Add a single URI with the given crawling depth and given parent
(this public method is called from Console in
a sinchronized manner, to add new URIs discovered by
all the crawling threads/channels).
An individual channel which waits for the
work to be done with web page retrieving,
tries out Cache, NetRetrieve and finally
passes filepath to DocProcessor and gets
back
1) a piece of RDF model
2) all the URIs which have to be tested/crawled
as they appear in the given URI
3) exceptions (if any)
Initialize the crawler parameters
uris
String Vector of initial URIs to crawl to
hostfilter
String Vector of hosts we want to crawl (null, if we crawl everywhere)
depth
how deep we want to crawl (0, if we want just the given URIs)
time
how many seconds we wait until we break connections to nonresponding hosts
This method is adapted from org.gjt.vinny.html.HTMLEncoder,
it is a utility method for converting
a string into a format suitable for placing inside HTML,
so that special symbols: <,>,&," and ' are properly escaped.
F
filter -
Variable in class edu.unika.aifb.rdf.crawler.URIList
This class represents a data structure to store a single
URL with full status information - crawling depth,
referrer=parent URL, processing status (see below)
and exceptions encountered while crawling to the given URI.
Constructor to make URL records with all the crawling information.
W
WHITE -
Static variable in class edu.unika.aifb.rdf.crawler.URLStruct
status codes:
WHITE - discovered node not being processed by a thread,
GRAY - discovered node, currently being processed,
BLACK - discovered node, already processed, its descendants are inserted into list,
RED - some exception detected while crawling.