CluE Project, Summary of Annual Report, January 2011

National Science Foundation Award Number: IIS-0844226
Cluster Exploratory (CluE) project

Summary of annual report, January 2011

Research and Education Activities:

The primary and original goal of our work in CLUE is to use large corpora to find synonyms of words and phrases that are useful to improve retrieval effectiveness. These techniques are statistical and so often find related words that a person might not classify as a synonym, so we also call them quasi-synonyms. We have been exploring a range of methods for (1) extracting quasi-synonyms, (2) evaluating the quality of extracted quasi-synonyms, (3) incorporating quasi-synonyms into query processing, and (4) evaluating the impact of that query processing on document retrieval.

So far we have used word context to identify quasi-synonyms. That is, if two words are used in the same context frequently, they are likely to be quasi-synonyms. We have identified related words from document text, from query logs, and from anchor text. We have used unigram contexts and n-gram contexts. We have evaluated quasi-synonyms we found by inspection, by having human annotators look at some instances, and by comparing to synonyms from existing thesauruses. We have evaluated the effectiveness of query reformulation on public TREC collections.

Particularly in the second year of our work on this project, we have extended our investigations to other types of term and phrase relationships. We are still looking at massive-scale data analysis, a key challenge of the CLUE program. However, we also consider methods for supporting and estimating document-document (or page-page) similarities at a massive scale to improve aspects of retrieval that depend upon cross-corpus comparisons. We have further worked on finding term-document relationships that allow us to build better queries to improve recall, including finding additional documents (pages) that are highly similar to the one at hand. In addition, we have explored methods for extracting anchor text from Web page links and using that information to boost retrieval effectiveness in ways similar to the ways that query logs are used.

The underlying goal of all work on this project has been to find statistical relationships between terms, between documents, and between terms and documents -- statistical relationships that in turn can be leveraged to improve retrieval accuracy. Some of our efforts have not born fruit, but most have appeared in or been submitted to major research conferences, including SIGIR, CIKM, and WSDM.

We have also been developing Map/Reduce skills in our graduate students, in not just our research group but in the Department at large. PI Allan, along with Prof. David Smith (senior personnel) and another colleague ran a seminar in Spring 2009 on using Map/Reduce for large data tasks. A total of eight students participated in the readings, class discussions, and a final project. The class also included a visiting researcher from China.

Findings:

We have shown (Huston et al, WSDM 2011) methods for identifying repeated n-gram phrases in a massive scale collection using a cluster environment. The method offers an important tradeoff between speed and temporary storage space, scales almost linearly in the length of the sequence, and provides a uniform workload balance across the processors in the cluster.

We have shown (Cartright et al, CIKM 2010) that query expansion can be performed extremely quickly -- so that it is feasible in a production environment. We showed an approach that depends upon document-document similarities and presented ways to scale that approach to massive document collections. Our approximation methods reduced the running time by several orders of magnitude without adversely impacting effectiveness.

This work is supported by the National Science Foundation (Award Number IIS-0844226).