Usage of the RDF Crawler of 2000-11-27

Unpack the ZIP file in some directory - $DIR$. Change to "lib" directory containing the executables - JAR files. Apply launch.bat with the arguments - initial URL, depth and time (in seconds) for the crawling.

System requirements

Currently the RDF Crawler is tested only on the Windows platform, using Java Version 1.3. You can get it from http://java.sun.com - either as JDK (Java Developer Kit) for development, or JRE (Java Runtime Environment) just for running this program. Linux/Unix platforms may cause problems in a few places where we use Windows-style filepaths, but they could be easy to correct.

How to Run RDF Crawler as a Separate Application

Make two directories on your computer, if they do not exist already: c:\temp and c:\temp\rdf. Crawler uses them to store cached files and to output the results.

If you unzipped the distribution in directory $DIR$, then $DIR$/lib is your working directory. It contains all the JAR files necessary for the crawler. Run the BAT script $DIR$\lib\launch.bat. A sample command line would look like this:

launch http://www.aifb.uni-karlsruhe.de/WBS/daml/homework/assignment1.html 2 60

This would mean that you are crawling the page of DAML home assignments and all its links up to the depth 2 (i.e. pages which the initial URL refers to directly in RDF or HTML text, and also their descendants); and you use up to 60 seconds for the crawling. It is recommended that you pass exactly three arguments to the launch.bat. After this you get back two interesting files:

c:\temp\crawllog.xml
LOG file of all the URLs being crawled and diagnostics - "status=black" means that it has been processed successfully, but "status=red" means that some exception happened. You can see the text of exception as well.
c:\temp\crawlmodel.rdf
The repository of RDF facts obtained during crawling; serialized in RDF format.

Here are samples of crawllog.xml and crawlmodel.rdf, so that you can check whether your results look as they should.

How to Embed RDF Crawler in Another Application

If you want to embed the crawler in another application, you might want to make an instance of CrawlConsole and pass arguments to its constructor. See API documentation for details. Sample Java program invoking the RDF Crawler:

import java.util.*;
import edu.unika.aifb.rdf.crawler.*;

/**
 * Call this class with 3 arguments - URL to crawl to,
 * depth and time in seconds
 */

public class SampleCrawl {

    public static void main(String[] args) throws Exception {

        if (args.length != 3) {
            System.err.println("Usage: java  -cp  [JARs]  SampleCrawl  [URL]  [depth:int]  [time:int]");
            System.exit(0);
        }

        Vector uris = new Vector();
        uris.add(args[0]);

        // no host filtering - crawl to all hosts
        Vector hostfilter = null;

        /* You may want to do something else to enable host filtering:
         * Vector hostfilter = new Vector();
         * hostfilter.add("http://www.w3.org");
         * ....
         */

        int depth = 2;
        int time = 60;
        try {
            depth = Integer.parseInt(args[1]);
            time = Integer.parseInt(args[2]);
        }
        catch (Exception e) {
            System.err.println("Illegal argument types:");
            System.err.println("Argument list: URI:String  depth:int  time(s):int");
            System.exit(0);
        }

        // Initialize Crawling parameters
        CrawlConsole c = new CrawlConsole(uris,hostfilter,depth,time);

        // get an ontology file from its local location
        // (OPTIONAL)
        c.setLocalNamespace("http://www.daml.org/2000/10/daml-ont","c:\\temp\\rdf\\schemas\\daml-ont.rdf");

        // set all the paths to get all the results
        c.setLogPath("c:\\temp\\crawllog.xml");
        c.setCachePath("c:\\temp\\crawlcache.txt");
        c.setModelPath("c:\\temp\\crawlmodel.rdf");

        // crawl and get RDF model
        c.start();

        // This writes all three result files out
        c.writeResults();
    }
}

The original sources of other JARs used in the crawler

You might want to try newer versions of these Java APIs from the sources mentioned. Nevertheless, they are not guaranteed to work correctly with our RDF Crawler.

gnu-regexp-1.0.8.jar
see http://www.cacas.org/~wes/java/
xerces.jar
see http://xml.apache.org
xml4j.jar
see http://www.alphaworks.ibm.com/
rdf-api-2000-09-03.jar
see http://www-db.stanford.edu/~melnik/rdf/api.html