Targeted Web Crawling : Current Status
By: Daniel Dickman , Kesha Wilford
Supervisors: Professor James Allan, Xiaoyan Li
The World Wide Web contains a plethora of contantly changing information. Retrieving relevant information in this uncatalogued environment has proven to be difficult. There have been many approaches to solving this problem, however they tend to be very time consuming and labor intensive, with low precision and recall. So, the prevailing problem is to traverse the web in response to a user’s query and return the most relevant documents in the least amount of time.
Our basic hypothesis is that web documents link to documents with related content; moreover they do not link to unrelated documents in general. Our approach has been to perform a client-side, targeted crawl of web pages without the aid of a pre-built inverted list. There are various algorithms that have been implemented by other research groups in approaching this task: the fish search algorithm [ ], the shark search algorithm [ ], a depth first search [ ], and a breadth first search [ ]. Our goal has been to create a crawler that could use any one of these algorithms to retrieve relevant documents in response to a query, and to evaluate the performance of these algorithms.
Evaluating the algorithms on the actual web would have posed many problems, mainly bandwidth limitations. We needed to perform evaluations without speed limits imposed by network connections. The grabber performs this task. It consists of a collection of documents converted from SGML to blocks of 512 bytes each. Each document is now one header block and n data blocks. The document files had to be split into 1GB files because of unix file size limitations. We then built a hash table for the data for easy retrieval of documents based on the document’s url.
The crawler is used to rank and store documents and their links. It uses one of the specified algorithms to search the graph of documents and retrieve the most relevant documents to a query. The crawler takes as inputs: a queryID and a query.
It then begins by initializing the grabber with a set of files to search through, and initializing the crawler with the search algorithm (depth first, breadth first, fish, or shark). The function called crawler_seed, takes the crawler information (search strategy) along with the grabber information (database to search) and generates twenty seeds – five of which are taken from relevance judgements so they are known to be relevant, and fifteen of which are chosen at random. This is done in an attempt to simulate a real search engine. All twenty documents are given extremely high estimated beliefs values.
Once the seed documents are picked, the crawler begins its main algorithm. It fetches an unseen document with the highest estimated belief. Then it extracts the links from the document, and assigns estimated beliefs based on the specified algorithm. It then repeats the process.
We have obtained preliminary recall/precision results as well as another evaluation based on number of relevant documents retrieved over time. The precision/recall evaluations show that random and depth are similar to one another, and fish and shark are similar to one another. Fish and shark, however, perform significantly better than random and depth. The documents retrieved over time results show that fish and depth perform better than random. Shark, however, appears to have performed the worst. This needs to be investigated further.
While the crawler is working well, the database builder needs to be rebuilt because some documents are not being stored correctly. This may cause some problems with our evaluations. Most importantly, however, we need more results. Better heuristics may also be applied to determine if they would perform even better than Fish and Shark. The algorithms should eventually be evaluated against traditional search engines, such as AltaVista.
In conclusion, it does appear that more intelligent algorithms have the potential to increase precision and recall.