CluE Project, Summary of Final Report, January 2012

National Science Foundation Award Number: IIS-0844226
Cluster Exploratory (CluE) project

Summary of final report, January 2012

Research and Education Activities:

The primary and original goal of our work in CLUE was to use large corpora to find synonyms of words and phrases that are useful to improve retrieval effectiveness. These techniques are statistical and so often find related words that a person might not classify as a synonym, so we also call them quasi-synonyms. We explored a range of methods for (1) extracting quasi-synonyms, (2) evaluating the quality of extracted quasi-synonyms, (3) incorporating quasi-synonyms into query processing, and (4) evaluating the impact of that query processing on document retrieval.

We primarily used word context to identify quasi-synonyms. That is, if two words are used in the same context frequently, they are likely to be quasi-synonyms. We identified related words from document text, from query logs, and from anchor text. We used unigram contexts and n-gram contexts. We evaluated quasi-synonyms we found by inspection, by having human annotators look at some instances, and by comparing to synonyms from existing thesauruses. We evaluated the effectiveness of query reformulation on public TREC collections.

Particularly in the second year of our work on this project, we extended our investigations to other types of term and phrase relationships. We looked at massive-scale data analysis, a key challenge of the CLUE program. However, we also considered methods for supporting and estimating document-document (or page-page) similarities at a massive scale to improve aspects of retrieval that depend upon cross-corpus comparisons. We further worked on finding term-document relationships that allow us to build better queries to improve recall, including finding additional documents (pages) that are highly similar to the one at hand. In addition, we explored methods for extracting anchor text from Web page links and using that information to boost retrieval effectiveness in ways similar to the ways that query logs are used.

In the project’s extension, we wrapped up several of those projects. We also trained a new student in approaches for processing massive data. That student was given a new problem, but it still relied upon finding similar documents at a large scale very efficiently, so it meshes cleanly with and benefited from the other work on this project.

The underlying goal of all work on this project was to find statistical relationships between terms, between documents, and between terms and documents -- statistical relationships that in turn can be leveraged to improve retrieval accuracy. Some of our efforts did not bear fruit, but most appeared in or were submitted to major research conferences, including SIGIR, CIKM, and WSDM.

We also developed Map/Reduce skills in our graduate students, in not just our research group but in the Department at large. PI Allan, along with Prof. David Smith (senior personnel) and another colleague (R. Manmatha, not funded by this project) ran a seminar in Spring 2009 on using Map/Reduce for large data tasks. A total of eight students participated in the readings, class discussions, and a final project. The class also included a visiting researcher from China. Several of the class projects were the foundation of later research papers.

Findings:

We have shown (Huston et al, WSDM 2011) methods for identifying repeated n-gram phrases in a massive scale collection using a cluster environment. The method offers an important tradeoff between speed and temporary storage space, scales almost linearly in the length of the sequence, and provides a uniform workload balance across the processors in the cluster.

We have shown (Cartright et al, CIKM 2010) that query expansion can be performed extremely quickly -- so that it is feasible in a production environment. We showed an approach that depends upon document-document similarities and presented ways to scale that approach to massive document collections. Our approximation methods reduced the running time by several orders of magnitude without adversely impacting effectiveness.

We have shown (Dang et al., technical report 2009) that it is important to incorporate the number (and quality) of contexts that are shared, but also that there is value in noting contexts where they disagree.

We have shown (Dang and Croft, WSDM 2010; Dang et al., 2011 technical report) that anchor text can be an effective substitute for query log information when looking for synonyms.

When quasi-synonyms are incorporated into the retrieval process, our results show that it is better to use those new words to expand the query rather than to just substitute 'better' words into the query (Dang and Croft, WSDM 2010).

We have shown (Yi and Allan, CIKM 2011; Yi, PhD thesis 2011) that for Web search, it is possible to improve the retrieval of documents by “borrowing” click-through information from semantically similar web pages.

We have shown (Kim et al., SIGIR 2011) that it is possible to use a decision tree to determine and propose effective alternative Boolean query suggestions for professional searchers.

We explored (Cartright et al., CIKM 2011 workshop) the problem of finding evidence in support (or in contradiction) of sentences in a corpus. We showed preliminary work that suggested it was possible to find book pages to support sentences within the Wikipedia. (We have been unable to improve upon initial efforts in this work.)

We have shown (Dalton et al., CIKM 2011) that when performing Natural Language Tasks such as entity finding, using unlabeled but semantically similar passages to boost the training data can improve accuracy. These results are particularly true when there is a mismatch between the corpus being labeled and the training data.

This work is supported by the National Science Foundation (Award Number IIS-0844226).