Ontology servers and other tools dealing with meta information sometimes need to retrieve facts describing resources on the Web. The current standard of making statements about Web resources is RDF (Resource Description Framework), and there are a few more standards which build on top of the RDF, e.g. RDFS and OIL. Therefore we may need a utility to download RDF information from all over the Internet. This utility will be henceforth called RDF Crawler.
It is a tool which downloads interconnected fragments of RDF from the Internet and builds a knowledge base from this data. At every phase of RDF crawling we maintain a list of URIs to be retrieved as well as URI filtering conditions (e.g. depth, URI syntax), which we observe as we iteratively download resources containing RDF. To enable embedding in other tools, RDF Crawler provides a high-level programmable interface (Java API). RDF Crawler utility is just a wrapper around this API - either a console application, or a windows application or a servlet.
staab@aifb.uni-karlsruhe.dekaa@aifb.uni-karlsruhe.desha@aifb.uni-karlsruhe.deoppermann@ontoprise.deRDF data (or other meta data similar to RDF) may appear in Web documents in several ways. We list some typical cases, assuming that the RDF Crawler will initially cover just a few of them.
Pure RDF resources on the Web may be distinguished before
download - the files have extensions .rdf,
.rdfs, .oil (anything else?).
RDF has its own MIME type: text/rdf,
as mentioned in
http://www.mozilla.org/rdf/rdf-nglayout.html
(maybe also text/xml).
After the download, RDF API parser
(see http://www-db.stanford.edu/~melnik/rdf/api.html
also an older W3C tool: http://www.w3.org/RDF/Implementations/SiRPAC/),
processes the RDF file.
RDF API knows how to deal with all RDF syntax details
which cannot be easily done by dumb XML parsing (see
http://www.w3.org/TR/REC-rdf-syntax/
for details). It produces triples. How to handle them is
discussed in section Storing RDF Data.
RDF data usually brings up usage of some kind of RDF Schema or more generally, some ontology. These are referenced with namespace identifiers. We require that RDF Crawler tool always tries to download these schemas regardless of depth restrictions.
There is an implementation issue - whether to use the networked RDF API (which is by itself capable of downloading and on-the-fly analysis of RDF documents), or do the networking operations explicitly in RDF Crawler.
There are several ways of embedding in an HTML document. This sample comes from http://www.w3.org/TR/REC-rdf-syntax/#ex-Embedding. See this link for how to write embedded RDF so that the look of HTML document is not affected.
<html>
<head>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/metadata/dublin_core#">
<rdf:Description about="">
<dc:Creator>
<rdf:Seq ID="CreatorsAlphabeticalBySurname"
rdf:_1="Mary Andrew"
rdf:_2="Jacky Crystal"/>
</dc:Creator>
</rdf:Description>
</rdf:RDF>
</head>
<body>
<P>This is a fine document.</P>
</body>
</html>
RDF API can deal with such data as well, extracts the first RDF portion and ignores everything else. RDF fragment also may be put inside an HTML comment (see the source of http://www-db.stanford.edu/~melnik/). In other words, we have to cut out the RDF fragments from the rest of the document completely ignoring the syntax of the rest of the document.
Is there a standard for linking HTML file to an external RDF? We assume that something like this may be used:
<LINK REL="MetaInfo" TYPE="text/rdf" HREF="xfiles.rdf">
Subsection META, LINK tags in HTML tells, how we could deal with such externally linked files.
RDF could be embedded in an arbitrary XML document. It turns out that SiRPAC can deal with this as well. It is namespace aware, i.e. it takes into account namespaces defined outside the RDF portion of the file.
Another question - do we allow RDF to be written in so called "Simplified RDF syntax" (see http://www-db.stanford.edu/~melnik/rdf/syntax.html)? Currently we try to apply RDF API specification to as broad range of data as possible.
We discuss several suggestions how to deal with them.
Assume that an HTML document x.html looks like
this:
<a href="y1.html">Authors</a>; <a href="y2.html">Copyright</a>
1st suggestion: We can phrase every statement about linking
in the same way i.e. just say:
"document x.html links to both documents y1.html
and y2.html". In RDF syntax it looks like this:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:htmlanchors="http://html.anchors#">
<rdf:Description about="x.html">
<dc:href rdf:resource="y1.html"/>
<dc:href rdf:resource="y2.html"/>
</rdf:Description>
2nd suggestion: We can place all the
links from some document in a container.
For example, if x.html is a table of contents,
then such container is very useful.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:htmlanchors="http://html.anchors#">
<rdf:Description about="x.html">
<htmlanchors:allhrefs> <rdf:Seq>
<li resource="y1.html"/>
<li resource="y2.html"/>
</rdf:Seq> </htmlanchors:allhrefs>
</rdf:Description>
3rd suggestion: We can make a property, which preserves the
string literal which is the name of the hyperlink. E.g.
"document y1.html is the 'Authors' link for
x.html, but
and y2.html is 'Copyright' link for x.html".
This would be useful, if people would choose link names
accordingly to some standard. Even if it is not so, we can use
these names for pattern matching.
4th suggestion: We can remember the DOM path to the
hyperlink. This way we can infer more, if the syntax of x.html
reflects its meaning.
These are predecessors of RDF for storing metadata in HTML documents. META tag is typically used to specify encoding/language, author(s), keywords, description of a document, which can be later used by tools like search engines.
<META NAME="description" CONTENT="A draft
specification for an RDF Crawler tool.">
<META NAME="keywords" CONTENT="RDF,crawler,
metainformation,Java,XML">
<META NAME="robots" CONTENT="NOINDEX,FOLLOW">
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:htmlmeta="http://traditional.html.meta.properties#">
<rdf:Description about="http://www.foo.com/current.html"
htmlmeta:description="A draft specification..."
htmlmeta:keywords="RDF,crawler,..."
htmlmeta:robots="noindex,follow"/>
</rdf:RDF>
On the property htmlmeta:robots
see discussion in subsection Respect
robots.txt?
LINK tags are used to define relationships of the given document with other files, e.g. CSS stylesheets and attached JavaScript. One can define LINKs between a document and its "Copyright" document, a document and its "Next", "Previous", "Home", etc. in some document hierarchy.
The REL and REV attributes of LINK tag define the nature of the relationship
between the documents and the linked resource. REL defines a link relationship
from the current document to the linked resource while
REV defines a relationship in the opposite direction. Suppose that
some document with an URL http://www.foo.com/current.html
contains these two lines in its HEAD:
<LINK REL="Glossary" HREF="foo.html"> <LINK REV="Subsection" HREF="bar.html">
This indicates that foo.html is a glossary for current.html, while
current.html itself is a subsection of bar.html.
We could express this in RDF as follows:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:htmllink="http://traditional.html.link.properties#">
<rdf:Description about="http://www.foo.com/current.html"
htmllink:glossary="http://www.foo.com/foo.html"/>
<rdf:Description about="http://www.foo.com/bar.html"
htmllink:subsection="http://www.foo.com/current.html"/>
</rdf:RDF>
Alternately, one could also use Dublin Core properties, whenever it is possible to map to these. When we have found a list of subsections of the same document, we might want to use RDF sequence constructions?
Assuming that there is unlimited amount of interrelated info on the Web (hopefully this will soon hold about RDF data as well), at some point RDF fact gathering by an RDF Crawler should stop. In this section we discuss the criteria to decide two questions:
At the very start of the program and at every subsequent
step we maintain a queue of all the URIs we want.
We process them in the breadth-first-search fashion, keeping
track of those we have already visited. When the search
goes too deep, or we have received sufficient quantity
of data (measured as number of links visited or the total
Web trafic or the amount of RDF data obtained), we
may want to throw TrafficLimitException and to quit.
www.foo.com.javax.servlet.http.Cookie.setDomain(String pattern)),
text/html or text/xml
or text/rdf (maybe a few more).
Certainly, this can be only done after we start receiving
the file.
As we mentioned at the start of the section, filtering the URIs in all these cases may affect the behaviour of the RDF Crawler in two different ways. We may want to avoid storing in a database those triples, whose subject or object has "ignorable" URI. Another possibility is to store the whole retrieved RDF data, but not to access ignorable URIs over the network. Or do them both.
robots.txt?As we surf the web with the RDF Crawler, which
is just another species of spider=robot, we may encounter
the same problems which are typical for indexing engines.
In every domain there can be a file robots.txt
which lists the resources which are disallowed for robot access;
HTML files may also contain META tags which
instruct robots to do certain things.
See http://www.kollar.com/robots.html
for more information on the topic. Although this
is mainly used to guide robots from search engines to affect
listings of a particular site,
it may happen that the information is relevant for any robot
whatsoever. So this should be checked before we proceed.
Since the crawler is interested only about RDF information, it may want to know beforehand whether the document it is going to download will contain more RDF. If the databank is being built for some specific purpose, we may need to be even more selective about downloading something. Both things can probably be decided by querying the current databank of RDF facts.
This would mean an iterative
crawling - start from some URI, learn RDF facts,
query those facts to decide which places to go to next,
learn more facts, and so on.
For example, consider the following crawling command:
"Go to the URL http://www.foo.com/main.rdf,
fetch it, and further fetch only those RDF resources which
are signed by some trusted entity."
User may want to specify the target of RDF data. These options seem natural:
http://www.w3.org/RDF/Implementations/SiRPAC/ shows that RDF data can be parsed and the following triples (syntax: property-subject-value) can be printed:
triple('http://description.org/schema/Creator',
'http://www.w3.org/Home/Lassila',
'Ora Lassila').
triple('http://description.org/schema/Creator',
'http://www.w3.org/Home/puuh',
'Lola Puuh').
This we want to do in the initial phase of project testing.
Triples can be stored in a generic database using simple SQL statements.
There would be much sense for a Web server to store all its meta data in one well-defined location, so that spiders can easily get that information without having to filter out all the documents on that server. In order to do this, we allow to write back the whole gathered RDF data in a single RDF file. This should be included as a separate consolidation option with the RDF crawler, since in this case the main task is not building a data base for reasoning, but rather gathering data from a specific location on a specific ontology.
Some database systems may be more appropriate for storing RDF data. RDFdb is proposed by R.V.Guha in http://web1.guha.com/rdfdb/ (downside - access from C and Perl), better ones may exist (from York Sure overheard about Greek project on this).
RDF information primarily is used by some inference agents which make some conclusions from the RDF data or answer queries. Though SiLRI can consume raw RDF, making triples or F-logic statements may suit it better.
In general, implementation should use reasonably efficient algorithms, which are still easy to understand and to code. Implementation is done completely in standard Java (JDK1.3), since it is easier to develop and maintain compared to platform-dependent languages like C++, it is sufficiently fast and has many facilities for programming threads.
Data retrieval from the Internet has unpredictable success rates and response times. At every stage of running the algorithm we maintain the list of all the resources we want to get and we may retrieve them in parallel, if the user of the application so wishes.
To improve response times we may also want to return some results before everything we need is downloaded. So the API may need iterator functions. To make coding easier, we could have the whole RDF retrieval, analysis and storage process done by the same thread; this thread may start new threads as is necessary.
There are two options of RDF retrieval - either store it in a repository for later access or analyze it on-the-fly using SAX parser.
Java version 1.3 from the Sun Microsystems Standard Edition is used to compile the source code and to run the project. Coding is done in a typical text editor like Notepad, while compilation and running use DOS command session. For testing purposes we also need a local Web server which can reliably provide files containing RDF.
RDF Crawler is a stand-alone application, which is given URIs and builds an RDF database from it (or increments an existing database).
Making it a Client-Server application would be more difficult. This would mean that a client can ask to crawl to certain RDFs (or make an F-logic query) and either get back the results (like using a search-engine), or have the results cached on the server (in an RDF database) and then query the RDF database.
To make coding easier and the result more reusable, we suggest NOT to make the RDF Crawler as one big chunk of code, but to structure it as a pipeline (more precisely, a self-feeding loop) of producer-consumer processes:
+---->+<-- [input from outside] | | | URI requests with filtering conditions | | | Multithreaded download component | | | Pure RDF data or HTML with embedded RDF | | | SiRPAC (or similar) - RDF parsing component --> [Storing RDF data] | | | Stream of triples | | | Extract nonrepeating URIs | | | Stream of URIs | | | URI Filtering; forming new | URI requests with new filtering conditions | | +<----+
We place all the RDF Crawler classes in some
package, say, edu.unika.aifb.rdfcrawler.
Initially, when we do not filter anything and write to
the standard output, it could be possible to
write a DOS command-line expression, e.g.
java edu.unika.aifb.rdfcrawler.CommandLine -Ddepth=5 URL1 URL2 ...
This would lead to printing "triple" predicates in the following form:
triple('http://description.org/schema/Creator',
'http://www.w3.org/Home/Lassila',
'Ora Lassila').
...
Are we interested where these triples come from? Say, if the RDFs from various sources have various levels of trust, etc., we might want to add all the statements obtained from a single source to some collection, tell in RDF form, that for each of these statements the source is blah-blah, and thus obtain some more useful statements about statements.
To make this API embeddable into other tools, we want to specify some API which could be called from another function.
Sorry, not yet done.
Along with the RDF test data for our utility, we might want to formulate some benchmark problems which could test the possibilities of sensible use of RDF Crawling. Some suggestions follow:
Most of researchers R who publish scientific papers with
coauthors have their Erdös numbers (denoted PE(R)).
They are defined recursively:
PE(R)=0, if R[Name->"Paul
Erdös"]PE(R)=Min{PE(R1)+1}, where R1 changes over
the set of all researchers who have collaborated with R, i.e.
written a scientific paper together.To make this practical, we need a (possibly distributed) set
of HTML documents representing scientific papers (i.e.
HTML may have RDF property, indicating that it is a scientific
paper); these HTML documents contain RDF property, indicating its
list of authors; they link to other scientific documents (say, in
their "Bibliography" section) which are annotated likewise.
Equivalently, we can make the same query from databank, using
the symmetric property "cooperatesWith" which is defined
in an OIL sample .../swrc.oil.
Suppose that we have a list of public keys of trusted entities (people and/or institutions) and we want to download in our databank all RDF data we can find, if these models are signed by anyone of trusted entities.
Microsoft programmers have access to MSDN helps - they contain documentation of languages from various Microsoft products, installation instructions, tutorials, sample programs, analytical articles, exercises, etc. These materials are easy to search and use, since they are made by the same organization.
With Java programming environment the situation is somewhat less organized. One could make an MSDN-like ontology for Java resources on the Web, and then annotate some portion of resources, using this ontology, placing the respective RDFs in various places on the Web. This would require to have an annotation tool as well, albeit a coarse-grained one; it would annotate whole web-pages rather than small text items within a page.
The ontology concepts for this could be: "Java topic", "Java (tutorial) article", "Java article subsection", "Author" (for programs and articles), "Organization" (for proprietary packages), "Java package source", "Java package as JAR", "Java package documentation" (javadoc generated or designed by hand), "Java Development Tool", "Java sample program source", "Java sample program running instruction" (including JARs one needs), "Java programming exercise" (solution would be JavaSampleProgramSource or JavaPackageSource), "Java Quiz/MultipleChoiceTest Question".
http://www-db.stanford.edu/~melnik/index.html
we get back data about resources like
http://www-db.stanford.edu/~melnik/index.html#genid15,
on the other hand, if we analyze
http://www-db.stanford.edu/~melnik/, we get back
http://www-db.stanford.edu/~melnik/genid15;
normalization or URIs in this sense is necessary?http://aifbceto/rdfcrawl/rdfsamples/test1.rdf
and
http://aifbceto/rdfcrawl/rdfsamples/test1.rdf#?
Author: Kalvis Apsitis at AIFB; Last modified on