$DIR$.
Change to "lib" directory containing the executables - JAR files.
Apply launch.bat with the arguments - initial URL,
depth and time (in seconds) for the crawling.
Currently the RDF Crawler is tested only on the Windows platform, using Java Version 1.3. You can get it from http://java.sun.com - either as JDK (Java Developer Kit) for development, or JRE (Java Runtime Environment) just for running this program. Linux/Unix platforms may cause problems in a few places where we use Windows-style filepaths, but they could be easy to correct.
Make two directories on your computer, if they do not exist already:
c:\temp and c:\temp\rdf.
Crawler uses them to store cached files and to output the results.
If you unzipped the distribution in directory $DIR$, then
$DIR$/lib is your working directory. It contains
all the JAR files necessary for the crawler.
Run the BAT script $DIR$\lib\launch.bat.
A sample command line would look like this:
launch http://www.aifb.uni-karlsruhe.de/WBS/daml/homework/assignment1.html 2 60
This would mean that you are crawling the page of DAML home assignments
and all its links up to the depth 2 (i.e. pages which the initial
URL refers to directly in RDF or HTML text, and
also their descendants); and you use up to 60 seconds for the crawling.
It is recommended that you pass exactly three arguments to the launch.bat.
After this you get back two interesting files:
c:\temp\crawllog.xmlc:\temp\crawlmodel.rdfHere are samples of crawllog.xml
and crawlmodel.rdf,
so that you can check whether your results look as they should.
If you want to embed the crawler in another application,
you might want to make an instance of CrawlConsole
and pass arguments to its constructor. See API documentation
for details. Sample Java program invoking the RDF Crawler:
import java.util.*;
import edu.unika.aifb.rdf.crawler.*;
/**
* Call this class with 3 arguments - URL to crawl to,
* depth and time in seconds
*/
public class SampleCrawl {
public static void main(String[] args) throws Exception {
if (args.length != 3) {
System.err.println("Usage: java -cp [JARs] SampleCrawl [URL] [depth:int] [time:int]");
System.exit(0);
}
Vector uris = new Vector();
uris.add(args[0]);
// no host filtering - crawl to all hosts
Vector hostfilter = null;
/* You may want to do something else to enable host filtering:
* Vector hostfilter = new Vector();
* hostfilter.add("http://www.w3.org");
* ....
*/
int depth = 2;
int time = 60;
try {
depth = Integer.parseInt(args[1]);
time = Integer.parseInt(args[2]);
}
catch (Exception e) {
System.err.println("Illegal argument types:");
System.err.println("Argument list: URI:String depth:int time(s):int");
System.exit(0);
}
// Initialize Crawling parameters
CrawlConsole c = new CrawlConsole(uris,hostfilter,depth,time);
// get an ontology file from its local location
// (OPTIONAL)
c.setLocalNamespace("http://www.daml.org/2000/10/daml-ont","c:\\temp\\rdf\\schemas\\daml-ont.rdf");
// set all the paths to get all the results
c.setLogPath("c:\\temp\\crawllog.xml");
c.setCachePath("c:\\temp\\crawlcache.txt");
c.setModelPath("c:\\temp\\crawlmodel.rdf");
// crawl and get RDF model
c.start();
// This writes all three result files out
c.writeResults();
}
}
You might want to try newer versions of these Java APIs from the sources mentioned. Nevertheless, they are not guaranteed to work correctly with our RDF Crawler.