Database Characterization
Alana Nicholson and Wendy Gonzales
Center for Intelligent Information Retrieval
Department of Computer Science
University of Massachusetts, Amherst
Amherst, MA 01003-4610, USA
http://cadia.cs.umass.edu/irug/dbchar
Introduction
The Web currently has many databases that are available only via a search engine. The user enters a query, and subsequently gets a results page containing a list of documents. The user, however, does not have any means for directly accessing the database information.
In this project, our hypothesis is that we can implement a method for characterizing databases. The idea is to build a system where the user will construct his own "meta database"(database of databases) using those characterizations. The user will then be able to query this meta database, and instead of getting back a list of documents, a list of databases will be returned. This new system will therefore enable the user to have direct access to a number of databases, where they can search for an answer to their queries.
The idea behind our system can be better understood through an example. Let’s say a user is surfing the web, and he comes across a searchable database that he wants to include in his personal meta database. The user can go to our first interface and enter the name of the URL he wants to index. Once he submits his request, the URL will be automatically indexed into a meta database that will only contain the databases that the user has manually put into it. In the future, the user can go to our second interface, and query his personal collection of databases, to which he will have direct access. When the user enters a query, he will get a list of the top ten databases containing his query.
The two major programs required to implement our system were already written. Consequently, our contribution consisted in connecting the existing pieces, and finding the record boundaries of any web document. A more detailed explanation of our research problem, and how it exactly fits in the system architecture will follow.
Unfortunately, due to time limits, we thought it was appropriate to set the research aside, and instead, simplify as much as possible in order to get the entire system running. Therefore, in this paper we will give a detailed explanation of both our ideal and actual systems. We will first discuss our ideal system, explain the modifications we made, and finally explain our actual system. We will conclude with an outline on the future work that can make the current Database Characterization demo more interesting and robust.
Ideal System Architecture
The architecture of the ideal Database Characterization demo is shown in figure 1.

FIGURE 1. Architecture of the fully automatic Database Characterization demo.
When the user finds a searchable database that he wishes to index, he can go to the first interface of our system, which allows him to type the site address and a brief description of it. When the user submits his request, the site address and the description will go to a site queue. This information remains in the site queue until a cron job executes a perl script which calls the site sampler program.
The site sampler program implements the "query probing" technique of Callan et al (1999). This program takes the site URL of any text database as an argument, and automatically constructs a language model for that particular site. The idea behind this program is that query-based sampling can provide a random sample of documents to learn an accurate language model of the entire database. The algorithm for query based-based sampling is the following:
1.Select an initial query term.
2.Run a one-term query on the database.
3.Retrieve the top 4 documents returned by the database.
4.Update the language model based on the characteristics of the retrieved documents.
5. If 300 documents have not been sampled yet,
Once an accurate language model of a particular site is built, we call inbuild, which builds a database for every language model. After inbuild creates a database, the database is sent to an index queue, until it is passed onto the second program, the collection-selection indexer. This second program merges all the databases of the sites the user has indexed, and creates a meta database.
Anytime after the meta database is created, the user can access the second interface of our system. On this interface the user can enter a query, and will in turn get a results page with a list of the top ten databases containing the query. The user will therefore have direct access to various collections of documents that will contain an answer for the original query.
Research Problem
Since both the site sampler and the collection-selection indexer programs already existed, our task was to create the two interfaces, connect all the separate pieces with perl scripts, and find a solution to what we thought was the missing piece – finding the record boundaries of any HTML document. Figure 2 shows where exactly our research fits in the system architecture.

FIGURE 2. The white box indicates where our research problem fits within the system architecture.
When the site sampler program starts, it runs a simple query on the searchable database, and in turn, retrieves a small number of documents. A language model is built automatically from these documents. However, extracting the actual documents from the results page is a very difficult problem. Result pages come in numerous formats. Furthermore, they are marred with extraneous information such as animations and advertisements. We are, however, only interested in extracting the first four URL’s of the results page, in order to obtain both the words and frequencies contained in these documents. The goal is to find the record boundaries of any html document. If we look at the source code of the results page, the record boundaries are located right before and after each of the top four URL’s. Finding a solution to this problem is essential for our system because otherwise, the user would only be able to index searchable databases with a very specific format.
Record Boundary Discovery in Web Documents
The solution to finding record boundaries of any HTML document was based on the algorithm proposed by Embley et al (1999). A very detailed paper describing their heuristic approach to discover record boundaries in web documents can be found in the SIGMOD ’99 book. A brief summary of the idea that we implemented follows:
We worked on two separate C programs, one that created a tag tree, and one that found the biggest fan out tree, and applied the four independent heuristics to locate the candidate tags.
Unfortunately, due to time limits, we decided to set our research problem aside, and instead simplify as much as possible to get the entire system working.
Actual System Architecture
The architecture of the actual Database Characterization demo is shown in figure 4.

FIGURE 2. The green boxes represent the differences between the ideal and actual system architectures.
In this section we will use an example to demonstrate how our actual system works. Let’s say a user wants to include the National Institute of Health (NIH) searchable database in his personal meta database. The user starts by entering the address and description of the NIH site into our first interface. The user input is stored into the site queue, and he has to manually start a perl script that grabs the URL address from the site queue, and a query stored in a file. The perl script then sends the latter information as arguments for the site sampler program. The site sampler probes the searchable database that the user wishes to index, and it gets back a list of the top twenty documents containing the query. However, since we were unable to find the record boundaries of an HTML document, we implemented an alternative way to obtain the top four URL’s. Again, the idea was to automatically build a language model using the top four documents in the results page. To solve this problem, we modified the C code of the site sampler program, to grab the top twenty strings beginning with the <A href= tag. The strings were stored in a table, and subsequently, we created a random number generator that chose four of those URL’s. The site sampler program builds the language model, and calls inbuild to build a database out of the language model. Once the new database is built, it is stored in the index queue. And once again, because there was not enough time to make the system automatic, the user has to manually start the collection-selection program. To run this program, the user has to execute a series of commands. The program will then merge the existing database into our meta database. At this point, the user can have direct access to his collections of documents, and can use them to find more accurate answers for his queries.
Differences between Ideal and Actual System Architectures
There are three major differences between the ideal and actual system architectures. The first major difference is that our current system is not fully automatic. As shown in figure four, the user has to manually start both the site sampler and the collection-selection programs. The second difference is that we were not able to find the record boundaries for any HTML document. And as a result, the language models have the potential to be very inaccurate. The alternative approach to discovering the record boundaries, using a random number generator to pick any four URL’s contained in the results page, is inefficient. This is because using this alternative approach, the site sampler program grabs the top twenty URL’s, not knowing if they are the actual results, or simply advertisements, or other extraneous links. Furthermore, because we were unable to find a solution to our research problem, the searchable databases that the users can index have to be very specific. Users can only index databases where the URL addresses are preceded by the tag <A href=. Finally, the third major difference between the ideal and actual systems is that our meta database only contains three databases. This is a result of a second research problem that we discovered in the process of building our actual system. In order to start the query probing that allows for automatic discovery of language models, the site sampler program has to insert the initial query in the query field of the searchable database. This field must be contained in the http address of the query page. This means that the program can only discover language models of databases that actually contain a specific field in which to insert the query. The input of any other html format will result in an error, since the query will get lost somewhere in the process. It is very difficult to find databases that contain a specific field for the query. Therefore, it is necessary to research and find a way to allow the site sampler program to accept any html format as an input.
Results
Because of lack of time, we only created three databases, and the language model of each of these databases was created by sampling only four documents. According to the results of Callan et al (1999), the query-probing technique automatically discovers accurate language models only after probing three hundred documents. Therefore, our results cannot be very accurate. We tried to input different queries into our second interface, and in return, we get a list of top databases containing the original query, or no databases at all.
There are two reasons for not getting any results. One reason is a result of sampling only four documents to obtain each language model instead of the 300 that are required to obtain an accurate one. The second reason for not obtaining any results is caused by using the random number generator to pick the four documents from which the language models will be automatically built. The documents randomly picked might be advertisements or other links that are in no way related to the original query.
On the other hand, we were able to implement one way to evaluate the results we did get. Since we do have a folder containing all the language models after the stop words were removed, we picked the top word of each language model and used that as our query. For each of the three queries we entered the expected URL address was on the top of the result list. Therefore, according to our only evaluation, the Database Characterization demo is getting accurate results.
Conclusion
The Database Characterization demo is the beginning of what can be a very interesting and robust tool. There is, however, much future work left to be done.
The most important future work would be to find the record boundaries of any html document. If this problem is resolved, the system will have the capability to accept databases with different HTML formats as input and will therefore be less limited. Furthermore, if the site sampler program is able to locate the top four documents for every query probe, the language models will be very accurate.
It is also essential to work on the newly discovered research problem. The site sampler program has to be able to take databases with any format as an input. Currently, this program can only automatically discover language models for databases that have a query field. Therefore, the databases that this program can take as input are very limited. If both of these research problems are solved, the users will be able to index any searchable database.
Other important future work might be to automate the entire system and to do more research on the accuracy of language models.