Morris Hirsch, MS, David Aronow, MD, MPH
Center for Intelligent Information Retrieval
University of Massachusetts at Amherst
{hirsch, aronow}@cs.umass.eduThe performance of Information Retrieval (IR) systems is critically dependent upon the quality of the queries submitted by the user. Queries may often be improved by "expanding" them with additional terms, either automatically or chosen in cooperation with the user. Typical expansion strategies include:
The work reported here is part of a larger effort to improve access to the MEDLINE database, using the INQUERY system developed at the Center for Intelligent Information Retrieval of the University of Massachusetts at Amherst. INQUERY supports full-text retrieval, in which all text words are available as indexing terms.
Although INQUERY supports all these methods, this report is concerned only with thesaurus-based query expansion.
A well-designed manual thesaurus is an effective means of query expansion, but maintaining one requires large amounts of time and effort on the part of domain experts. We seek to avoid this effort by automatic construction.
We generate a thesaurus by scanning the corpus for noun phrases. Each occurrence is noted along with the surrounding context, after which they are sorted to bring all contexts of each phrase together. Each context group is treated as a "pseudo-document" with the noun phrase as the document title. The set of pseudo-documents is indexed as a thesaurus database that accompanies the original collection. A query that matches some words of context in a pseudo-document causes that document to be selected, and its title to be listed as a possible expansion term.
Context may be saved as either surrounding text or other nearby noun phrases. In the following example, based on healthcare policy documents, this list of terms results from the query "child"