Package edu.unika.aifb.rdf.crawler

Class Summary
Cache This is a top-level class responsible for the mapping of URIs to filepaths, streams or symbolic strings.
Channel An individual channel which waits for the work to be done with web page retrieving, tries out Cache, NetRetrieve and finally passes filepath to DocProcessor and gets back 1) a piece of RDF model 2) all the URIs which have to be tested/crawled as they appear in the given URI 3) exceptions (if any)
ChannelPool This class gets URIs one by one and decides when to start new threads.
CrawlConsole CrawlConsole is intended as the only public class to be used by every application which needs to embed RDF Crawler functionality.
DocInstance DocInstance - call different document processing routines.
HostFilter The class HostFilter checks whether the URL string belongs to the given set of hosts.
HTMLInstance HTMLInstance - process the metainfo extracted from the HTML document.
NetRetrieve NetRetrieve - fetch URLs and write them to files
RDFInstance HTMLInstance - process the metainfo extracted from the HTML document.
RobotCheck Finds out the host's robot policy
URIList The class URIList is the only class in the package intended to be called from outside the "uriproc" package - when initializing the URIList.
URLStruct This class represents a data structure to store a single URL with full status information - crawling depth, referrer=parent URL, processing status (see below) and exceptions encountered while crawling to the given URI.
 

Exception Summary
FilterException