edu.unika.aifb.rdf.crawler
Class URIList

java.lang.Object
  |
  +--edu.unika.aifb.rdf.crawler.URIList

public class URIList
extends java.lang.Object

The class URIList is the only class in the package intended to be called from outside the "uriproc" package - when initializing the URIList. (In our implementation the caller is Console class). In turn, it calls the Channel_Pool constructor, when it's time to start processing those URIs by a download thread pool. URIList uses URLStruct to store individual URIs with associated depths, processing status codes and possible error messages. In this class we can define policies - which URIs to follow.


Field Summary
 HostFilter filter
           
 
Constructor Summary
URIList()
          Initialize an empty URIList
 
Method Summary
 void addURI(java.lang.String uri, java.lang.String p_uri, boolean decrement)
          This method is used to add to the URIList if depth is not known in advance
 boolean addURI(java.lang.String uri, java.lang.String p_uri, int depth)
          Add a single URI with the given crawling depth and given parent (this public method is called from Console in a sinchronized manner, to add new URIs discovered by all the crawling threads/channels).
 boolean allNonWhite()
          Are all the URIs in the list non WHITE, i.e. currently all threads must wait for a new job to appear?
 boolean allRedOrBlack()
          Are all the URIs in the list either RED or BLACK?
 void assert(java.lang.String message)
           
 void checkInBlack(java.lang.String uri)
          Crawling to this "uri" was successfully finished It was added to RDF model, and its descendants (if any) were added to URIList
 void checkInRed(java.lang.String uri, java.lang.Exception e)
          Crawling found problem with the "uri", mark it and insert it back into the map
 java.lang.String checkOutWhite()
          Find a white URI in the list, return its string to the processing Channel instance and paint it gray
static java.lang.String cutRef(java.lang.String url)
          This function cuts away a reference part from a URL to avoid duplication of URLs when crawling, in case if they differ only in their reference part.
 java.lang.String getDescriptions(java.lang.String parent)
          Get back a nice representation of those URLs which are being crawled into from the given URL parent (parent=null for the top-level URLs)
 java.lang.String getParent(java.lang.String uri)
           
static void main(java.lang.String[] args)
          For debugging - create and print a list of URIs.
 void printColors()
           
 void printMap()
          Print all the associations in the map (for debugging)
 void setFilter(java.util.Vector hosts)
          Set host filter for this URIList
 java.lang.String toString()
          Get back a nice RDF representation of what URLs are placed in the list for crawling.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

filter

public HostFilter filter
Constructor Detail

URIList

public URIList()
Initialize an empty URIList
Method Detail

addURI

public boolean addURI(java.lang.String uri,
                      java.lang.String p_uri,
                      int depth)
Add a single URI with the given crawling depth and given parent (this public method is called from Console in a sinchronized manner, to add new URIs discovered by all the crawling threads/channels). Returns true, if insertion succeeded.

addURI

public void addURI(java.lang.String uri,
                   java.lang.String p_uri,
                   boolean decrement)
This method is used to add to the URIList if depth is not known in advance

setFilter

public void setFilter(java.util.Vector hosts)
Set host filter for this URIList

allRedOrBlack

public boolean allRedOrBlack()
Are all the URIs in the list either RED or BLACK?

allNonWhite

public boolean allNonWhite()
Are all the URIs in the list non WHITE, i.e. currently all threads must wait for a new job to appear?

toString

public java.lang.String toString()
Get back a nice RDF representation of what URLs are placed in the list for crawling.
Overrides:
toString in class java.lang.Object

getDescriptions

public java.lang.String getDescriptions(java.lang.String parent)
Get back a nice representation of those URLs which are being crawled into from the given URL parent (parent=null for the top-level URLs)

printMap

public void printMap()
Print all the associations in the map (for debugging)

main

public static void main(java.lang.String[] args)
For debugging - create and print a list of URIs. It is better to debug from the Console.main()

checkOutWhite

public java.lang.String checkOutWhite()
Find a white URI in the list, return its string to the processing Channel instance and paint it gray

checkInRed

public void checkInRed(java.lang.String uri,
                       java.lang.Exception e)
Crawling found problem with the "uri", mark it and insert it back into the map

checkInBlack

public void checkInBlack(java.lang.String uri)
Crawling to this "uri" was successfully finished It was added to RDF model, and its descendants (if any) were added to URIList

getParent

public java.lang.String getParent(java.lang.String uri)

cutRef

public static java.lang.String cutRef(java.lang.String url)
This function cuts away a reference part from a URL to avoid duplication of URLs when crawling, in case if they differ only in their reference part.

printColors

public void printColors()

assert

public void assert(java.lang.String message)