CluE Project, Summary of Annual Report, November 2009

National Science Foundation Award Number: IIS-0844226
Cluster Exploratory (CluE) project

Summary of annual report, November 2009 (9 months)

Research and Education Activities:

The goal of our work in CLUE is to use large corpora to find synonyms of words and phrases that are useful to improve retrieval effectiveness. These techniques are statistical and so often find related words that a person might not classify as a synonym, so we call them quasi-synonyms. We have been exploring a range of methods for (1) extracting quasi-synonyms, (2) evaluating the quality of extracted quasi-synonyms, (3) incorporating quasi-synonyms into query processing, and (4) evaluating the impact of that query processing on document retrieval.

So far we have used word context to identify quasi-synonyms. That is, if two words are used in the same context frequently, they are likely to be quasi-synonyms. We have identified related words from document text, from query logs, and from anchor text. We have used unigram contexts and n-gram contexts. We have evaluated quasi-synonyms we found by inspection, by having human annotators look at some instances, and by comparing to synonyms from existing thesauruses. We have evaluated the effectiveness of query reformulation on public TREC collections.

We have also been developing Map/Reduce skills in our graduate students, in not just our research group but in the Department at large. PI Allan, along with Prof. David Smith (senior personnel) and another colleague ran a seminar in Spring 2009 on using Map/Reduce for large data tasks. A total of eight students participated in the readings, class discussions, and a final project. The class also included a visiting researcher from China.

Findings:

We have shown (Dang and Croft, technical report 2009) that it is important to incorporate the number (and quality) of contexts that are shared, but also that there is value in noting contexts where they disagree.

We have shown (Dang and Croft, WSDM 2009) that anchor text can be an effective substitute for query log information when looking for synonyms.

When quasi-synonyms are incorporated into the retrieval process, our results show that it is better to use those new words to expand the query rather than to just substitute 'better' words into the query (Dang and Croft, WSDM 2009).

This work is supported by the National Science Foundation (Award Number IIS-0844226).