|
Class Summary |
| Cache |
This is a top-level class responsible for
the mapping of URIs to filepaths, streams or
symbolic strings.
|
| Channel |
An individual channel which waits for the
work to be done with web page retrieving,
tries out Cache, NetRetrieve and finally
passes filepath to DocProcessor and gets
back
1) a piece of RDF model
2) all the URIs which have to be tested/crawled
as they appear in the given URI
3) exceptions (if any) |
| ChannelPool |
This class gets URIs one by one and decides when
to start new threads. |
| CrawlConsole |
CrawlConsole is intended as the only public class to be used
by every application which needs to embed RDF Crawler functionality.
|
| DocInstance |
DocInstance - call different document processing routines.
|
| HostFilter |
The class HostFilter checks whether the URL
string belongs to the given set of hosts.
|
| HTMLInstance |
HTMLInstance - process the metainfo extracted from the HTML document.
|
| NetRetrieve |
NetRetrieve - fetch URLs and write them to files |
| RDFInstance |
HTMLInstance - process the metainfo extracted from the HTML document.
|
| RobotCheck |
Finds out the host's robot policy |
| URIList |
The class URIList is the only class in the
package intended to be called from outside the "uriproc" package -
when initializing the URIList.
|
| URLStruct |
This class represents a data structure to store a single
URL with full status information - crawling depth,
referrer=parent URL, processing status (see below)
and exceptions encountered while crawling to the given URI. |