SGER: Breaking the keyword bottleneck: Towards more effective access of government information

PI: W. Bruce Croft

A huge amount of government information is available on the Web. The web sites containing that information are, however, often extremely complex and difficult to navigate. In this type of environment, simple queries are not very helpful in locating relevant information. The current information retrieval (IR) landscape, however, is dominated by simple queries because that is what Web search engines are good at doing. These simple queries generally help the user to find a good “home page”. In the case of government information on the web, however, a home page is often of little help in finding the right answers and, instead, a considerable amount of additional user effort is required. We believe that most information needs would be better expressed as complex queries and that current systems impose a bottleneck where users are forced to use keyword-based simple queries. Some types of professional searchers (e.g. intelligence analysts, paralegals) do formulate longer and more complex queries, but complex queries will only become common if systems are capable of providing good answers to those queries, and longer, grammatical questions were easier to ask. The latter issue will be eventually addressed by speech interfaces, but improving the capability of systems to handle complex queries represents the major long-term goal of IR.

In the context of this project, we are carrying out initial experiments with retrieval models for complex queries that go beyond the typical “bag-of-words” approach. There are two major issues that we are exploring in the development of new retrieval models. First, in order to improve system robustness, we need to develop models that more reliably capture topical relevance than our current models. This means we need to have models that are better at recognizing different ways that topics can be described in text. Second, in order to improve the system accuracy in the top ranked documents, we need to develop models that more precisely capture topical relevance. This means retrieval models need to be better at recognizing and incorporating the specific concepts and relationships that are required by the query.

Answering complex queries is a hard problem, and one that has a long history of attempted solutions. There are a number of factors, however, that indicate that it should now be possible to make significant progress. In particular, there has been a recent surge of interest in a new approach to retrieval based on language models. This research will leverage this recent work and study complex queries from a new perspective.

In the one-year time frame of this project, we will be able to explore these new models and obtain preliminary results on their effectiveness with government information and on complex queries that are most representative of people with information needs related to government. This is a high-risk research project because of the lack of progress in this area in the past. The payoff of even moderate success will be high, however, as it will make the difference between a government information system returning a useful response to a query instead of either failing completely or providing very little assistance to the people seeking answers.

This project is sponsored by the National Science Foundation grant #IIS-0527159.