edu.unika.aifb.rdf.crawler
Class CrawlConsole

java.lang.Object
  |
  +--edu.unika.aifb.rdf.crawler.CrawlConsole

public class CrawlConsole
extends java.lang.Object

CrawlConsole is intended as the only public class to be used by every application which needs to embed RDF Crawler functionality. If you are the "enduser" of RDF Crawler, you should not directly use any other classes from the package.

It initializes several main modules and keeps static references from these modules as well as implements public methods to interact with these modules. This class is responsible for RDF model export and for logging of all the actions taken by the crawler.

An overview of other classes in the RDF Crawler (if you decide to use or change them):

URIList, URLStruct, HostFilter, FilterException, RobotCheck
URIList maintains dynamic list of URIs together with crawling information, exceptions etc.
ChannelPool, Channel
thread pool (ChannelPool) which initializes threads (Channels) processing individual URLs
Cache, NetRetrieve
Caching and networking
DocInstance, HTMLInstance, RDFInstance
Document processing - accumulates info on URLs to follow, namespaces and RDF facts


Field Summary
 Cache cache
          Cache of mappings: URL-filepaths.
 java.lang.String CachePath
          CachePath - absolute path where to store the cache map
 int capacity
          How many threads in the ThreadPool Feel free to change this for optimum performance
 java.lang.String LogPath
          LogPath - absolute path where to store the LOG file of the crawling process
 org.w3c.rdf.model.Model model
          RDF model - we are building it from small pieces
 java.lang.String ModelPath
          ModelPath - absolute path where to store the model of all the RDF facts
 ChannelPool pool
          Thread pool - branches off 10 different threads
 int time
          How many seconds to crawl.
 URIList urilist
          "TODO-list" - all the URLs we have to crawl.
 
Constructor Summary
CrawlConsole(java.util.Vector uris, java.util.Vector hostfilter, int depth, int time)
          Initialize the crawler parameters uris String Vector of initial URIs to crawl to hostfilter String Vector of hosts we want to crawl (null, if we crawl everywhere) depth how deep we want to crawl (0, if we want just the given URIs) time how many seconds we wait until we break connections to nonresponding hosts
 
Method Summary
 java.lang.String dumpModel()
          Get the crawling results as a string
static void main(java.lang.String[] args)
          Used to call CrawlConsole from DOS command line.
 void saveModel(java.lang.String filepath)
          Save the crawling results to a file RDFUtil.saveModel(...) does not work.
 void setCachePath(java.lang.String path)
          Indicate the file where you want to store the cache
 void setLocalNamespace(java.lang.String url, java.lang.String path)
          Set a mapping of "url" - some RDF Namespace given by a Web address to a local file "path".
 void setLogPath(java.lang.String path)
          Indicate the file where you want to store the LOG file
 void setModelPath(java.lang.String path)
          Indicate the file where you want to store the RDF model
 void start()
          Start Crawling.
 void writeResults()
          Write out the results
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

urilist

public URIList urilist
"TODO-list" - all the URLs we have to crawl. Threads concurrently update this.

cache

public Cache cache
Cache of mappings: URL-filepaths. Currently we cache all the retrieved documents. Even if we decide not to cache everything, it might be useful to cache frequently used ontologies, etc.

pool

public ChannelPool pool
Thread pool - branches off 10 different threads

time

public int time
How many seconds to crawl. Value "-1" means - time unlimited. Actually we might want to provide a new method "stop()" in this console, so that the user may stop CrawlConsole at any time and dump the results.

model

public org.w3c.rdf.model.Model model
RDF model - we are building it from small pieces

capacity

public final int capacity
How many threads in the ThreadPool Feel free to change this for optimum performance

LogPath

public java.lang.String LogPath
LogPath - absolute path where to store the LOG file of the crawling process

ModelPath

public java.lang.String ModelPath
ModelPath - absolute path where to store the model of all the RDF facts

CachePath

public java.lang.String CachePath
CachePath - absolute path where to store the cache map
Constructor Detail

CrawlConsole

public CrawlConsole(java.util.Vector uris,
                    java.util.Vector hostfilter,
                    int depth,
                    int time)

Initialize the crawler parameters

uris
String Vector of initial URIs to crawl to
hostfilter
String Vector of hosts we want to crawl (null, if we crawl everywhere)
depth
how deep we want to crawl (0, if we want just the given URIs)
time
how many seconds we wait until we break connections to nonresponding hosts
Method Detail

setLogPath

public void setLogPath(java.lang.String path)
Indicate the file where you want to store the LOG file

setModelPath

public void setModelPath(java.lang.String path)
Indicate the file where you want to store the RDF model

setCachePath

public void setCachePath(java.lang.String path)
Indicate the file where you want to store the cache

start

public void start()
           throws java.lang.Exception
Start Crawling. All the CrawlConsole does is - initialize and start the ChannelPool. All the actual work - crawling, RDF generation etc. is done by 10 participant threads in the ChannelPool.

saveModel

public void saveModel(java.lang.String filepath)
               throws java.lang.Exception
Save the crawling results to a file RDFUtil.saveModel(...) does not work. Reason ?

dumpModel

public java.lang.String dumpModel()
                           throws java.lang.Exception
Get the crawling results as a string

writeResults

public void writeResults()
                  throws java.lang.Exception
Write out the results

setLocalNamespace

public void setLocalNamespace(java.lang.String url,
                              java.lang.String path)
Set a mapping of "url" - some RDF Namespace given by a Web address to a local file "path". This is necessary, since some namespace addresses cannot be fetched via Internet, and also, to improve performance. You may use this function in all the cases when you wish for some reason to get some document from its local address

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Used to call CrawlConsole from DOS command line. Normally CrawlConsole is instantiated from elsewhere - a Windows interface or an application which embeds the RDF Crawler.