Specification of an RDF Crawler

Ontology servers and other tools dealing with meta information sometimes need to retrieve facts describing resources on the Web. The current standard of making statements about Web resources is RDF (Resource Description Framework), and there are a few more standards which build on top of the RDF, e.g. RDFS and OIL. Therefore we may need a utility to download RDF information from all over the Internet. This utility will be henceforth called RDF Crawler.

It is a tool which downloads interconnected fragments of RDF from the Internet and builds a knowledge base from this data. At every phase of RDF crawling we maintain a list of URIs to be retrieved as well as URI filtering conditions (e.g. depth, URI syntax), which we observe as we iteratively download resources containing RDF. To enable embedding in other tools, RDF Crawler provides a high-level programmable interface (Java API). RDF Crawler utility is just a wrapper around this API - either a console application, or a windows application or a servlet.

List of contacts

S. Staab
AIFB, rm. 224, phone +49-721-608-4751, staab@aifb.uni-karlsruhe.de
K. Apsitis
AIFB, rm. 239, phone +49-721-608-6038, kaa@aifb.uni-karlsruhe.de
S. Handschuh
AIFB, rm. 251, phone +49-721-608-7363, sha@aifb.uni-karlsruhe.de
H. Oppermann
Ontoprise, phone ..., oppermann@ontoprise.de

RDF Data Extraction

RDF data (or other meta data similar to RDF) may appear in Web documents in several ways. We list some typical cases, assuming that the RDF Crawler will initially cover just a few of them.

Pure RDF

Pure RDF resources on the Web may be distinguished before download - the files have extensions .rdf, .rdfs, .oil (anything else?). RDF has its own MIME type: text/rdf, as mentioned in http://www.mozilla.org/rdf/rdf-nglayout.html (maybe also text/xml). After the download, RDF API parser (see http://www-db.stanford.edu/~melnik/rdf/api.html also an older W3C tool: http://www.w3.org/RDF/Implementations/SiRPAC/), processes the RDF file. RDF API knows how to deal with all RDF syntax details which cannot be easily done by dumb XML parsing (see http://www.w3.org/TR/REC-rdf-syntax/ for details). It produces triples. How to handle them is discussed in section Storing RDF Data.

RDF data usually brings up usage of some kind of RDF Schema or more generally, some ontology. These are referenced with namespace identifiers. We require that RDF Crawler tool always tries to download these schemas regardless of depth restrictions.

There is an implementation issue - whether to use the networked RDF API (which is by itself capable of downloading and on-the-fly analysis of RDF documents), or do the networking operations explicitly in RDF Crawler.

RDF embedded in HTML

There are several ways of embedding in an HTML document. This sample comes from http://www.w3.org/TR/REC-rdf-syntax/#ex-Embedding. See this link for how to write embedded RDF so that the look of HTML document is not affected.

<html>
<head>
  <rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/metadata/dublin_core#">
    <rdf:Description about="">
      <dc:Creator>
    <rdf:Seq ID="CreatorsAlphabeticalBySurname"
      rdf:_1="Mary Andrew"
      rdf:_2="Jacky Crystal"/>
      </dc:Creator>
    </rdf:Description>
  </rdf:RDF>
</head>
<body>
<P>This is a fine document.</P>
</body>
</html>

RDF API can deal with such data as well, extracts the first RDF portion and ignores everything else. RDF fragment also may be put inside an HTML comment (see the source of http://www-db.stanford.edu/~melnik/). In other words, we have to cut out the RDF fragments from the rest of the document completely ignoring the syntax of the rest of the document.

Is there a standard for linking HTML file to an external RDF? We assume that something like this may be used:

<LINK REL="MetaInfo" TYPE="text/rdf" HREF="xfiles.rdf">

Subsection META, LINK tags in HTML tells, how we could deal with such externally linked files.

RDF embedded in XML

RDF could be embedded in an arbitrary XML document. It turns out that SiRPAC can deal with this as well. It is namespace aware, i.e. it takes into account namespaces defined outside the RDF portion of the file.

Another question - do we allow RDF to be written in so called "Simplified RDF syntax" (see http://www-db.stanford.edu/~melnik/rdf/syntax.html)? Currently we try to apply RDF API specification to as broad range of data as possible.

HTML hyperlinks

We discuss several suggestions how to deal with them. Assume that an HTML document x.html looks like this:

<a href="y1.html">Authors</a>;
<a href="y2.html">Copyright</a>

1st suggestion: We can phrase every statement about linking in the same way i.e. just say: "document x.html links to both documents y1.html and y2.html". In RDF syntax it looks like this:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:htmlanchors="http://html.anchors#">
<rdf:Description about="x.html">
      <dc:href rdf:resource="y1.html"/>
      <dc:href rdf:resource="y2.html"/>
</rdf:Description>

2nd suggestion: We can place all the links from some document in a container. For example, if x.html is a table of contents, then such container is very useful.

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:htmlanchors="http://html.anchors#">
<rdf:Description about="x.html">
    <htmlanchors:allhrefs> <rdf:Seq>
      <li resource="y1.html"/>
      <li resource="y2.html"/>
  </rdf:Seq> </htmlanchors:allhrefs>
</rdf:Description>

3rd suggestion: We can make a property, which preserves the string literal which is the name of the hyperlink. E.g. "document y1.html is the 'Authors' link for x.html, but and y2.html is 'Copyright' link for x.html". This would be useful, if people would choose link names accordingly to some standard. Even if it is not so, we can use these names for pattern matching.

4th suggestion: We can remember the DOM path to the hyperlink. This way we can infer more, if the syntax of x.html reflects its meaning.

META, LINK tags in HTML

These are predecessors of RDF for storing metadata in HTML documents. META tag is typically used to specify encoding/language, author(s), keywords, description of a document, which can be later used by tools like search engines.

<META NAME="description" CONTENT="A draft
    specification for an RDF Crawler tool.">
<META NAME="keywords" CONTENT="RDF,crawler,
    metainformation,Java,XML">
<META NAME="robots" CONTENT="NOINDEX,FOLLOW">
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:htmlmeta="http://traditional.html.meta.properties#">
  <rdf:Description about="http://www.foo.com/current.html"
    htmlmeta:description="A draft specification..."
    htmlmeta:keywords="RDF,crawler,..."
    htmlmeta:robots="noindex,follow"/>
</rdf:RDF>

On the property htmlmeta:robots see discussion in subsection Respect robots.txt?

LINK tags are used to define relationships of the given document with other files, e.g. CSS stylesheets and attached JavaScript. One can define LINKs between a document and its "Copyright" document, a document and its "Next", "Previous", "Home", etc. in some document hierarchy.

The REL and REV attributes of LINK tag define the nature of the relationship between the documents and the linked resource. REL defines a link relationship from the current document to the linked resource while REV defines a relationship in the opposite direction. Suppose that some document with an URL http://www.foo.com/current.html contains these two lines in its HEAD:

<LINK REL="Glossary" HREF="foo.html">
<LINK REV="Subsection" HREF="bar.html">

This indicates that foo.html is a glossary for current.html, while current.html itself is a subsection of bar.html. We could express this in RDF as follows:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:htmllink="http://traditional.html.link.properties#">
  <rdf:Description about="http://www.foo.com/current.html"
    htmllink:glossary="http://www.foo.com/foo.html"/>
  <rdf:Description about="http://www.foo.com/bar.html"
    htmllink:subsection="http://www.foo.com/current.html"/>
</rdf:RDF>

Alternately, one could also use Dublin Core properties, whenever it is possible to map to these. When we have found a list of subsections of the same document, we might want to use RDF sequence constructions?

Filtering

Assuming that there is unlimited amount of interrelated info on the Web (hopefully this will soon hold about RDF data as well), at some point RDF fact gathering by an RDF Crawler should stop. In this section we discuss the criteria to decide two questions:

  1. Decide, which of the retrieved RDF data we must store in our databank.
  2. Decide, which resources we still have to download to retrieve more RDF data.

Filtering by depth and/or quantity of data

At the very start of the program and at every subsequent step we maintain a queue of all the URIs we want. We process them in the breadth-first-search fashion, keeping track of those we have already visited. When the search goes too deep, or we have received sufficient quantity of data (measured as number of links visited or the total Web trafic or the amount of RDF data obtained), we may want to throw TrafficLimitException and to quit.

Filtering by URI

Specifying a list of domains
When gathering data about a specific ontology we may have beforehand knowledge about where it is located. In this case we may want to filter out only those URIs which belong to a certain domain name, e.g. www.foo.com.
Host and path patterns
Matching the host name (and/or file path) agains sertain pattern, e.g. selecting all URIs belonging to the same DNS zone. (same is done by Java API function javax.servlet.http.Cookie.setDomain(String pattern)),
Specifying an URI prefix
RDF syntax allows to assign certain property to all the Web resources which start with a certain prefix. Since dealing with RDF properties could be beneficial for other filtering purposes as well, (see subsection Filtering using RDF facts and F-logic?), this is a useful option, if we do not want to deal with the general case.
Negative filtering
Download everything except resources on "blacklisted" servers.
Filtering by file extension
We may want only HTML, XML, RDF etc. files, sometimes they can be recognized by file extension.
Whole URI patterns
We could specify a regular expression with "*", etc. to describe all parts of URI - the domain names, ports, paths, extensions, query parameters. Using this methodology we could tell that we do or do not want to deal with certain server-side applications (ASPs, servlets, JSPs, etc.), which also may generate RDF data. In many cases a server-side application can be detected by looking at its URI.
Filtering MIME types of resources
This would mean that we want RDF only from text/html or text/xml or text/rdf (maybe a few more). Certainly, this can be only done after we start receiving the file.

As we mentioned at the start of the section, filtering the URIs in all these cases may affect the behaviour of the RDF Crawler in two different ways. We may want to avoid storing in a database those triples, whose subject or object has "ignorable" URI. Another possibility is to store the whole retrieved RDF data, but not to access ignorable URIs over the network. Or do them both.

Respect robots.txt?

As we surf the web with the RDF Crawler, which is just another species of spider=robot, we may encounter the same problems which are typical for indexing engines. In every domain there can be a file robots.txt which lists the resources which are disallowed for robot access; HTML files may also contain META tags which instruct robots to do certain things. See http://www.kollar.com/robots.html for more information on the topic. Although this is mainly used to guide robots from search engines to affect listings of a particular site, it may happen that the information is relevant for any robot whatsoever. So this should be checked before we proceed.

Filtering using RDF facts, F-logic, signatures?

Since the crawler is interested only about RDF information, it may want to know beforehand whether the document it is going to download will contain more RDF. If the databank is being built for some specific purpose, we may need to be even more selective about downloading something. Both things can probably be decided by querying the current databank of RDF facts.

This would mean an iterative crawling - start from some URI, learn RDF facts, query those facts to decide which places to go to next, learn more facts, and so on. For example, consider the following crawling command: "Go to the URL http://www.foo.com/main.rdf, fetch it, and further fetch only those RDF resources which are signed by some trusted entity."

Storing RDF Data

User may want to specify the target of RDF data. These options seem natural:

Print Triples

http://www.w3.org/RDF/Implementations/SiRPAC/ shows that RDF data can be parsed and the following triples (syntax: property-subject-value) can be printed:

triple('http://description.org/schema/Creator',
       'http://www.w3.org/Home/Lassila',
       'Ora Lassila').
triple('http://description.org/schema/Creator',
       'http://www.w3.org/Home/puuh',
       'Lola Puuh').

This we want to do in the initial phase of project testing.

Store the Triples in a Database

Triples can be stored in a generic database using simple SQL statements.

Consolidate RDF data

There would be much sense for a Web server to store all its meta data in one well-defined location, so that spiders can easily get that information without having to filter out all the documents on that server. In order to do this, we allow to write back the whole gathered RDF data in a single RDF file. This should be included as a separate consolidation option with the RDF crawler, since in this case the main task is not building a data base for reasoning, but rather gathering data from a specific location on a specific ontology.

RDFDB

Some database systems may be more appropriate for storing RDF data. RDFdb is proposed by R.V.Guha in http://web1.guha.com/rdfdb/ (downside - access from C and Perl), better ones may exist (from York Sure overheard about Greek project on this).

Feed SiLRI with the Triples

RDF information primarily is used by some inference agents which make some conclusions from the RDF data or answer queries. Though SiLRI can consume raw RDF, making triples or F-logic statements may suit it better.

Implementation issues

In general, implementation should use reasonably efficient algorithms, which are still easy to understand and to code. Implementation is done completely in standard Java (JDK1.3), since it is easier to develop and maintain compared to platform-dependent languages like C++, it is sufficiently fast and has many facilities for programming threads.

Threads

Data retrieval from the Internet has unpredictable success rates and response times. At every stage of running the algorithm we maintain the list of all the resources we want to get and we may retrieve them in parallel, if the user of the application so wishes.

To improve response times we may also want to return some results before everything we need is downloaded. So the API may need iterator functions. To make coding easier, we could have the whole RDF retrieval, analysis and storage process done by the same thread; this thread may start new threads as is necessary.

There are two options of RDF retrieval - either store it in a repository for later access or analyze it on-the-fly using SAX parser.

Development and Runtime Environment

Java version 1.3 from the Sun Microsystems Standard Edition is used to compile the source code and to run the project. Coding is done in a typical text editor like Notepad, while compilation and running use DOS command session. For testing purposes we also need a local Web server which can reliably provide files containing RDF.

The Modules of the RDF Crawler

RDF Crawler is a stand-alone application, which is given URIs and builds an RDF database from it (or increments an existing database).

Making it a Client-Server application would be more difficult. This would mean that a client can ask to crawl to certain RDFs (or make an F-logic query) and either get back the results (like using a search-engine), or have the results cached on the server (in an RDF database) and then query the RDF database.

To make coding easier and the result more reusable, we suggest NOT to make the RDF Crawler as one big chunk of code, but to structure it as a pipeline (more precisely, a self-feeding loop) of producer-consumer processes:

+---->+<-- [input from outside]
|     |
|  URI requests with filtering conditions
|     |
|  Multithreaded download component
|     |
|  Pure RDF data or HTML with embedded RDF
|     |
|  SiRPAC (or similar) - RDF parsing component --> [Storing RDF data]
|     |
|  Stream of triples
|     |
|  Extract nonrepeating URIs
|     |
|  Stream of URIs
|     |
|  URI Filtering; forming new                  
|    URI requests with new filtering conditions
|     |
+<----+

We place all the RDF Crawler classes in some package, say, edu.unika.aifb.rdfcrawler. Initially, when we do not filter anything and write to the standard output, it could be possible to write a DOS command-line expression, e.g.

java edu.unika.aifb.rdfcrawler.CommandLine -Ddepth=5 URL1 URL2 ...

This would lead to printing "triple" predicates in the following form:

triple('http://description.org/schema/Creator',
       'http://www.w3.org/Home/Lassila',
       'Ora Lassila').
...

Are we interested where these triples come from? Say, if the RDFs from various sources have various levels of trust, etc., we might want to add all the statements obtained from a single source to some collection, tell in RDF form, that for each of these statements the source is blah-blah, and thus obtain some more useful statements about statements.

To make this API embeddable into other tools, we want to specify some API which could be called from another function.

API Specification

Sorry, not yet done.

Phases of the Project

  1. Processing a simple, clean, locally available RDF file, getting acquainted with RDF API, writing out RDF triples to standard output.
  2. Development of broader set of test data for the RDF Crawler. Looking at articles on Ontobroker and other literature to find interesting samples of RDF data. Storing them on a local Apache Web server.
  3. Extracting URIs and doing unfiltered multidocument crawl (with or without multiple threads).
  4. Understanding SiLRI and/or RDFDB, understanding what the crawler is for. Storing the triples an appropriate database or data structure required by SiLRI.
  5. Adding multiple embedded fragments of RDF to a single HTML and adding Strawman Syntax for XML. Running program on this data as well.
  6. Processing document relations given by HTML hyperlinks
  7. Processing document relations given by HTML META and LINK tags.
  8. Retrieving and processing a set of connected RDF documents with filtering ability.

Possible Applications

Along with the RDF test data for our utility, we might want to formulate some benchmark problems which could test the possibilities of sensible use of RDF Crawling. Some suggestions follow:

Computing Estimates of Paul Erdös Numbers

Most of researchers R who publish scientific papers with coauthors have their Erdös numbers (denoted PE(R)). They are defined recursively:

  1. PE(R)=0, if R[Name->"Paul Erdös"]
  2. PE(R)=Min{PE(R1)+1}, where R1 changes over the set of all researchers who have collaborated with R, i.e. written a scientific paper together.

To make this practical, we need a (possibly distributed) set of HTML documents representing scientific papers (i.e. HTML may have RDF property, indicating that it is a scientific paper); these HTML documents contain RDF property, indicating its list of authors; they link to other scientific documents (say, in their "Bibliography" section) which are annotated likewise. Equivalently, we can make the same query from databank, using the symmetric property "cooperatesWith" which is defined in an OIL sample .../swrc.oil.

Security Application

Suppose that we have a list of public keys of trusted entities (people and/or institutions) and we want to download in our databank all RDF data we can find, if these models are signed by anyone of trusted entities.

Ontology Domain - Java MSDN

Microsoft programmers have access to MSDN helps - they contain documentation of languages from various Microsoft products, installation instructions, tutorials, sample programs, analytical articles, exercises, etc. These materials are easy to search and use, since they are made by the same organization.

With Java programming environment the situation is somewhat less organized. One could make an MSDN-like ontology for Java resources on the Web, and then annotate some portion of resources, using this ontology, placing the respective RDFs in various places on the Web. This would require to have an annotation tool as well, albeit a coarse-grained one; it would annotate whole web-pages rather than small text items within a page.

The ontology concepts for this could be: "Java topic", "Java (tutorial) article", "Java article subsection", "Author" (for programs and articles), "Organization" (for proprietary packages), "Java package source", "Java package as JAR", "Java package documentation" (javadoc generated or designed by hand), "Java Development Tool", "Java sample program source", "Java sample program running instruction" (including JARs one needs), "Java programming exercise" (solution would be JavaSampleProgramSource or JavaPackageSource), "Java Quiz/MultipleChoiceTest Question".

Open Questions

  1. Are relative URLs allowed in RDF?
  2. What to do about default files in Web directories. For example, if we analyze the URI http://www-db.stanford.edu/~melnik/index.html we get back data about resources like http://www-db.stanford.edu/~melnik/index.html#genid15, on the other hand, if we analyze http://www-db.stanford.edu/~melnik/, we get back http://www-db.stanford.edu/~melnik/genid15; normalization or URIs in this sense is necessary?
  3. Any difference between URI http://aifbceto/rdfcrawl/rdfsamples/test1.rdf and http://aifbceto/rdfcrawl/rdfsamples/test1.rdf#?

Author: Kalvis Apsitis at AIFB; Last modified on