TIDES 2 Project: Formal Frameworks and Empirical Evaluations for Information Organization

Principal Investigators:

W. Bruce Croft, PI
James Allan, Co-PI
Center for Intelligent Information Retrieval (CIIR)
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, MA 01003-9264

Project Overview

The Translingual Information Detection, Extraction, and Summarization (TIDES) program is a DARPA-sponsored initiative on fast machine translation and information access. DARPA's two ultimate goals are (1) to build a machine translation system in one week for a new low-density language, and to be able to sort "communications" into relevant or not-relevant for 80% of the cases and (2) to demonstrate a question answering capability against real-time data feeds from both English and non-english sources.

For the TIDES 2 project, the Center for Intelligent Information Retrieval seeks significant improvements in information technology through investigation of formal frameworks and their application in empirical evaluations. The CIIR is focusing on (1) improvements of cross-language methodologies; (2) tracking, identifying, describing, and monitoring changes in events in the news; and (3) high accuracy techniques for information retrieval based on incorporating user and context information.

The core of the CIIR’s approach is the development and application of statistical techniques as a theoretical and practical basis for information organization tasks. For TIDES this primarily takes the form of statistical language modeling algorithms. The CIIR was the first to demonstrate that language models are as effective as state-of-the art heuristically derived systems for information retrieval. Much of the CIIR’s current work focuses on extending the same approach and ideas into cross-language retrieval, topic detection and tracking, high accuracy retrieval, as well as summarization.

Current Plan

The CIIR’s major directions efforts within TIDES are:

  • Developing an understanding of how effective information retrieval and clustering systems have to be in order to substantially reduce the time it takes a person to accomplish a task. This work is done by simulating system output at a range of qualities and measuring user time on task and effectiveness.
  • Work on Topic Detection and Tracking in the tracking, clustering, story linking and new event detection evaluations. It focuses on pushing to make the clustering task more realistic by allowing overlapping and hierarchically organized clusters. We are also actively recasting the new event detection task to make it more practical and user-focused.
  • Continuing work on “event threading,” where the goal is to improve understanding of novelty of events discussed in streaming media. The CIIR anticipates that event threading may be a “replacement” for TDT in the long run.
  • Developing techniques for substantially improving the accuracy of information retrieval systems, particularly at the top of the ranked list. Toward this end, the CIIR is coordinating the TREC tracks on high accuracy retrieval (HARD). Researchers are also investigating a range of approaches, mostly based in language modeling techniques and various types of semi-supervised machine learning, for improving the accuracy and stability of retrieval.

Recent Accomplishments

  • IFE and technology transfer activities

    • Continued to provide and maintain a TDT cluster detection server for the IFE exercises. The server was available 24/7 as part of a distributed environment managed by BBN. (The IFE clustering system was shut down in late December as it is no longer needed for TIDES research.)
  • Topic Detection and Tracking (TDT) and Cross-language efforts

    • Participated in the TDT 2004 evaluation and meeting.
    • The UMass multi-lingual approach to the story link detection task was proven soundon new data (i.e., the TDT-5 corpus). Our run using native language comparisons with the relevance model approach was the top-performing system in the TDT 2004 Primary Link Detection evaluation.
    • Demonstrated the utility of named entities in the new event detection task (Kumaran and Allan, SIGIR 2004; and Kumaran, Allan, and McCallum, CIIR Technical Report 2004). We extended the named entity-based approaches to consider clusters of similar stories, and modified confidence scores based on additional evidence obtained from these clusters. We achieved consistent improvements of over 10% on the TDT-2, TDT-3, and TDT-4 corpora. However, the technique did not carry over to the TDT-5 corpus (TDT 2004 evaluation) for reasons that are still being investigated. Surprisingly, a basic vector space model approach performed best on the TDT-5 corpus.
    • Extensively investigated evaluation models for hierarchical topic detection (the TDT 2004 extension of the traditional TDT clustering task). The results were used at the evaluation meeting to help understand the results of this year’s evaluation and to plan for future TDT evaluations.
    • Analyzed the results of unsupervised tracking systems over all years of the TDT evaluation with corresponding systems and data-sets. This process included generating new baselines using a simple vector space approach to the task. That baseline system serves as a starting point for every year’s evaluation, making it clear where advances have (or have not) occurred. Preliminary results suggest that TDT tracking technology has not improved (relative to the baseline) as much as previously believed. On-going failure analysis is likely to indicate future directions of improvements.
    • Developed a generative model of names for proper name correction in speech-to-text output (Raghavan et al, CIIR technical report 2004), resulting in some improvements in spoken document retrieval. In addition, we found that normalizing names using Soundex (Raghavan et al, HLT Worksop, 2004) did not improve tracking effectiveness on the TDT-4 and TDT-5 corpora.
    • Studied the impact of active learning and on better exploitation of the user’s prior knowledge on supervised tracking (the TDT 2004 tracking task). Preliminary results suggest that there is much to be gained in news filtering by this mechanism.
    • Currently performing some Arabic IR experiments comparing our light Arabic stemmer with a morphological analyzer. In the past we have developed the light10 stemmer, which is now widely used (e.g., in the open source Lemur system), and is considered state-of-the-art. Although the morphological analyzer performs a far more sophisticated and “correct” analysis than the light stemmer, it is not better than the light stemmer for the purposes of IR. These results will be presented in a chapter in a forthcoming book on Arabic morphological analysis.
  • High Accuracy Retrieval from Documents (HARD)

    • Participated in the HARD track of TREC 2004. We successfully leveraged our work in query metadata, interactive query expansion, and passage retrieval at TREC 2004. In particular, the CIIR submissions consistently ranked among the top performing runs; for some measures, the CIIR submissions were the best runs across all sites. This indicates that the paths chosen and directions charted provide a reasonable solution to core HARD tasks. In light of these compelling results, we are drafting two papers for submission to SIGIR 2005. Our first submission will focus on the combination of query metadata and user feedback with traditional query representations. Our second submission will focus on our method for retrieving passages.

      • Metadata: Our metadata runs were among the best scoring runs for the track. For the mean average precision metric, one run ranked first for soft relevance and ranked fifth for hard relevance. We found the use of the related text metadata to significantly aid retrieval performance. Related text is a text excerpt given by the user as an example of either an on-topic or relevant piece of text. The use of genre and geography metadata did not significantly aid retrieval over strong baselines. We believe that power of the genre and geography metadata is limited for verbose queries. Recently, we have made new progress on utilizing the genre and geography metadata for short queries, but do not yet have concrete results to report.
      • Interactive query expansion: We explored approaches to interactive query expansion in which named entities are the feedback unit, reasoning that named entities are simpler than passages or documents, and should thus be easier for users to judge. We found comparable performance on the HARD measures for named entities as for passages, and we are continuing to explore similar low-cost feedback items.
      • Passage retrieval: We developed a new retrieval method for retrieving fixed-length passages that significantly outperforms strong baselines. This method performs 26-35% better than our previous best methods on the TREC 2004 evaluation metric of binary preference at 12,000 characters.
    • Continued work on predicting the effectiveness of query expansion (CIKM 2004 poster) by improving comparison methods for ranked lists of documents. Query expansion is known to improve retrieval accuracy substantially when it is correctly applied, so predicting when to do that should improve results. (This work will be submitted to SIGIR 2005.)
    • Showed that for ad hoc sentence retrieval (a variation of passage retrieval used in HARD), estimating the score of a sentence with a translation model gives an improvement of 8.5% over the query likelihood baseline in the mean reciprocal rank of the top 5 score. Two-part smoothing (from the document as well as the collection) has an improvement of almost 300% in mean reciprocal rank of the top 5 (from 0.117 to 0.300), and a similar improvement in precision at rank one (from 0.06 to 0.193). This project used the title queries and the sentence level relevance judgments from the TREC novelty track data (but not the provided documents – we did a separate retrieval from the TREC collections using the title queries) and so is applicable both to the HARD task and the novelty (retrieval) task of TREC. (This work will be submitted to SIGIR 2005.)
    • We demonstrated that for sentence retrieval from definition-style queries (“what is an X?” or “who is Y?”), precision at rank one is improved by more than 200% over the query likelihood baseline (from 0.182 to 0.455), by identifying definitional surface text patterns using conditional random fields, with similar improvement in the mean reciprocal rank of the top 5.
    • Started investigating smoothing approaches for bigram models of retrieval, with the goal of optimizing their robustness and effectiveness in information retrieval.
    • Began the development of a more elaborate model of information retrieval that can readily incorporate feedback. The model is a simple generative graphical model that integrates different scenarios such as relevance feedback, pseudo relevance feedback or a combination of both into a unified framework. The model considers retrieval as a classification problem and treats model estimation as a learning problem and ranking as inference using the EM algorithm. We developed a novel modification of the Dirichlet distribution (“smoothed Dirichlet”) as our generative component and showed that this distribution overcomes some of the limitations of the standard multinomial distribution popularly used in IR. We also show that the standard KL-divergence ranking function used in IR emerges naturally from the inference mechanism of our graphical model. Our experiments demonstrate that the new model performs well in all settings, achieving statistically significant improvements over the baselines. (This work will be submitted to SIGIR 2005.)
  • Novelty Detection and Event Threading (not TDT)

    • We participated in all four tasks in the TREC 2004 Novelty Track. New named entities as well as new words were considered for identifying novel sentences. Sentence pair-wise similarities were also calculated and the cutoff thresholds for eliminating redundant sentences were tuned on training data. The results of this track showed that our runs for two of the four tasks were top-performing. The first (Task 2) was to identify all novel sentences given all relevant sentences from documents associated with the topic. The other (Task 4) was to find novel sentences in the remaining documents given all relevant sentences from all documents and novel sentences from the first five documents.

      We continued to develop “answer updating” approaches to novelty detection. In the proposed answer-updating approaches, an answer model is estimated for each relevant sentence and novel sentences are assumed to be those sentences for which the answer model is different from the answer models of all previous sentences. Currently we are trying to develop different techniques for constructing answer models. We are using the data from TREC 2003 and 2004 novelty tracks for our experiments.

      The other two tasks (Tasks 1 and 3) required a novelty detection system to find relevant sentences first and then identify novel sentences among them. Therefore the performance of relevant sentence retrieval directly affected the performance of novel sentences detection part. We have developed a more robust relevance model in the language framework for ad-hoc retrieval. The new relevance model can be applied to document retrieval with both pseudo feedback and true relevance feedback. We have carried out experiments with TREC queries 101-200 on AP collection. The experimental results showed that the new relevance model could achieve a better performance than the original relevance model by Lavrenko and Croft (SIGIR 2001) and was not sensitive to the number of pseudo feedback documents. (We are planning to submit a paper on this work to SIGIR 2005.)

    • Continued to explore the work on “event threading” though much of this work was set aside for the TDT evaluation. (Nallapati et al., CIKM 2004). The idea was applied in the hierarchical topic detection task in our TDT 2004 submission, although it did not show obvious improvement in performance.
  • Utility experiments

    • Executed the full utility study from October through the beginning of December. Collected data at various levels of system accuracy across two variables: passage retrieval accuracy and passage clustering accuracy. Tested three levels of clustering accuracy and eight levels of retrieval accuracy. Ran a total of 45 queries with 31 users. Results revealed that as passage retrieval accuracy increases, raw time on task decreases linearly. Also as retrieval accuracy increases, users are able to find relevant material faster, and the time lapse between finding individual relevant answers decreases. Interestingly, clustering accuracy seemed to have no effect whatsoever on user utility.
  • TIDES administration

    • Area coordinator for TIDES detection.
    • Coordinated and directed TDT 2004 evaluations. Began plans for a TDT 2005 evaluation.
    • Coordinated and directed the TREC 2004 HARD tracks and began planning for a TREC 2005 track.

For more information on our current research, view our technical publications. TIDES 1 Project: Tools for Rapidly Adaptable Translingual Information Retrieval and Organization

The CIIR's research objective for the TIDES 1 project was to seek significant improvements in information technologies through investigation of formal frameworks and empirical evaluations. We focused on (1) improvements in cross-language information retrieval of low-density languages through the use of an intermediate language; (2) tracking, identifying, describing, and monitoring changes in events in the news; and (3) summarization and visualization techniques for rapidly conveying the content of multiple documents to a user.

This work was supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by SPAWARSYSCEN-SD grant number N66001-99-1-8912 and SPAWARSYSCEN-SD grant number N66001-02-1-8903.