|
|
|
TIDES 2 Project: Formal Frameworks and Empirical Evaluations for Information Organization
Principal Investigators:
W. Bruce Croft, PI
James Allan, Co-PI
Center for Intelligent Information Retrieval (CIIR)
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, MA 01003-9264
Project Overview
The Translingual Information Detection, Extraction, and Summarization (TIDES) program is a DARPA-sponsored initiative on fast machine translation and information access. DARPA's two ultimate goals are (1) to build a machine translation system in one week for a new low-density language, and to be able to sort "communications" into relevant or not-relevant for 80% of the cases and (2) to demonstrate a question answering capability against real-time data feeds from both English and non-english sources.
For the TIDES 2 project, the Center for Intelligent Information Retrieval seeks significant improvements in information technology through investigation of formal frameworks and their application in empirical evaluations. The CIIR is focusing on (1) improvements of cross-language methodologies; (2) tracking, identifying, describing, and monitoring changes in events in the news; and (3) high accuracy techniques for information retrieval based on incorporating user and context information.
The core of the CIIR’s approach is the development and application of statistical techniques as a theoretical and practical basis for information organization tasks. For TIDES this primarily takes the form of statistical language modeling algorithms. The CIIR was the first to demonstrate that language models are as effective as state-of-the art heuristically derived systems for information retrieval. Much of the CIIR’s current work focuses on extending the same approach and ideas into cross-language retrieval, topic detection and tracking, high accuracy retrieval, as well as summarization.
Current Plan
The CIIR’s major directions efforts within
TIDES are:
- Developing an understanding of how effective information
retrieval and clustering
systems have to be in order to substantially
reduce
the time it takes a person to
accomplish a task. This work is done
by simulating system output at a range of qualities
and measuring
user time on task and effectiveness.
- Work on Topic Detection and Tracking
in the tracking, clustering, story
linking and new
event detection evaluations. It focuses on pushing
to make the clustering task more
realistic by allowing overlapping
and hierarchically organized clusters. We are also actively recasting
the new event detection task to make it more practical and user-focused.
- Continuing
work on “event threading,” where the goal is
to improve understanding of
novelty of events discussed in
streaming media. The CIIR anticipates that event threading may
be a “replacement” for
TDT in the long run.
- Developing techniques for substantially
improving the accuracy of information retrieval systems, particularly
at the top of the ranked list.
Toward this end, the CIIR is coordinating the TREC tracks on high accuracy
retrieval (HARD). Researchers are also investigating a range
of approaches, mostly based in language modeling techniques and various types
of
semi-supervised machine learning, for improving the accuracy and stability
of retrieval.
Recent Accomplishments
- IFE and technology transfer activities
- Continued to provide and
maintain a TDT cluster detection server for the IFE
exercises.
The server was available 24/7 as part of a distributed
environment
managed by BBN. (The IFE clustering system was shut
down in late December as
it is no longer needed for TIDES research.)
- Topic Detection and Tracking (TDT)
and Cross-language efforts
- Participated in the TDT 2004 evaluation and
meeting.
- The UMass multi-lingual approach to the story
link detection task was proven sound
on new data (i.e., the TDT-5
corpus). Our run
using
native
language comparisons
with the relevance model approach was
the top-performing system in the TDT 2004 Primary Link Detection
evaluation.
- Demonstrated the utility of named entities
in the new event detection task (Kumaran and Allan, SIGIR 2004;
and
Kumaran, Allan, and
McCallum, CIIR Technical Report 2004). We extended the named
entity-based approaches
to consider clusters of
similar stories, and modified confidence
scores based on additional evidence obtained from these
clusters. We achieved
consistent improvements of over 10% on the
TDT-2, TDT-3,
and TDT-4 corpora. However, the technique did not carry over
to the
TDT-5 corpus
(TDT 2004
evaluation) for reasons that are still being investigated.
Surprisingly, a basic vector space model approach performed best
on the TDT-5
corpus.
- Extensively investigated evaluation models
for hierarchical topic detection (the TDT 2004 extension of the
traditional
TDT clustering
task). The
results were used at the evaluation meeting to help understand
the results of this year’s evaluation and to
plan
for future TDT evaluations.
- Analyzed the results of unsupervised
tracking systems over all years of the TDT evaluation
with corresponding systems
and data-sets.
This
process included
generating new baselines using a simple
vector space approach to the task. That baseline system
serves as
a starting point
for every year’s evaluation, making it
clear
where advances have (or have not) occurred. Preliminary
results
suggest that
TDT
tracking
technology has not improved (relative to the baseline)
as much as previously believed. On-going failure analysis
is
likely to indicate
future directions
of improvements.
- Developed a generative model of names
for proper name correction in speech-to-text output
(Raghavan et al, CIIR
technical
report 2004), resulting
in some improvements
in spoken document retrieval.
In addition, we found that normalizing names using Soundex (Raghavan
et al, HLT Worksop,
2004)
did not improve tracking effectiveness on the TDT-4
and
TDT-5 corpora.
- Studied the impact of active learning
and on better exploitation of the user’s prior knowledge
on supervised tracking (the TDT 2004 tracking task). Preliminary
results
suggest that there is much
to be
gained in news filtering by this mechanism.
- Currently
performing some Arabic IR experiments comparing our light
Arabic
stemmer with a morphological analyzer.
In the past
we have
developed the light10 stemmer, which is now widely
used (e.g., in the open source
Lemur system), and is considered state-of-the-art.
Although the morphological analyzer performs
a far more sophisticated
and “correct” analysis
than the light stemmer, it is not better than
the light stemmer for the purposes of IR. These
results will
be presented in a chapter
in a forthcoming
book on Arabic morphological analysis.
- High Accuracy Retrieval from Documents (HARD)
- Participated in the
HARD track of TREC 2004. We successfully leveraged our work in
query metadata, interactive query expansion, and passage
retrieval at TREC 2004.
In particular, the CIIR submissions consistently
ranked among the top performing
runs; for some measures, the CIIR
submissions were the best runs across all sites.
This indicates
that the paths chosen
and directions charted provide a reasonable
solution to core HARD
tasks. In light of these compelling results, we are drafting
two
papers for
submission to SIGIR 2005. Our first submission will focus on the
combination of query metadata and user feedback with traditional
query representations.
Our second submission will focus on our method for retrieving passages.
- Metadata: Our metadata runs were among the best scoring runs
for the track.
For the mean average precision metric, one run
ranked first
for
soft relevance
and ranked fifth for hard relevance. We found
the use of the related text
metadata to significantly aid retrieval
performance.
Related text is a text excerpt given by the user as an example
of
either an on-topic or relevant piece of text.
The use of genre
and geography
metadata did not significantly aid retrieval over strong baselines.
We believe that power of the genre and geography metadata
is
limited for
verbose queries. Recently, we have made new progress on utilizing
the genre and geography metadata for short queries, but do
not yet have
concrete results to report.
- Interactive query expansion: We explored approaches to interactive
query expansion in which named entities are the feedback unit, reasoning
that
named entities are simpler than passages or documents, and should
thus be easier for users to judge. We found comparable performance
on the
HARD measures for named entities as for passages, and we are continuing
to explore similar low-cost feedback items.
- Passage retrieval: We
developed a new retrieval method for retrieving
fixed-length
passages that significantly outperforms strong baselines.
This
method performs 26-35% better than our previous best methods
on the TREC 2004 evaluation metric of binary preference at 12,000 characters.
- Continued work on predicting the effectiveness of query expansion
(CIKM 2004
poster) by improving comparison methods for ranked lists
of documents.
Query expansion is known to improve retrieval accuracy substantially
when it is correctly applied, so predicting when to do that should
improve results. (This work will be submitted to SIGIR 2005.)
- Showed
that for ad hoc sentence retrieval (a variation of passage retrieval
used in HARD), estimating the score of a sentence with
a translation
model gives an improvement of 8.5% over the query likelihood baseline
in the mean reciprocal rank
of the top 5 score. Two-part smoothing
(from the document as well as the collection) has an improvement
of almost
300% in mean reciprocal rank of the top 5 (from 0.117 to 0.300),
and a similar improvement in precision at rank one (from 0.06 to
0.193).
This project used the title queries and the sentence level relevance
judgments from
the TREC novelty track data (but not the provided
documents – we
did a separate retrieval from the TREC collections using the title
queries) and so is applicable both to the HARD task and the novelty
(retrieval)
task of TREC. (This work will be submitted to SIGIR 2005.)
- We demonstrated
that for sentence retrieval from definition-style queries (“what
is
an X?” or “who is Y?”),
precision at rank one is improved by more than 200% over
the query
likelihood
baseline
(from 0.182 to 0.455), by identifying definitional surface text
patterns using conditional random fields, with similar improvement
in the
mean reciprocal rank of the top 5.
- Started investigating smoothing
approaches for bigram models of retrieval, with the
goal of optimizing
their robustness and
effectiveness
in information
retrieval.
- Began the development of a more elaborate
model of information retrieval that can readily incorporate feedback.
The model is a
simple generative
graphical model that integrates different scenarios such as relevance
feedback, pseudo relevance feedback
or a combination of both into
a unified framework. The model considers retrieval as a classification
problem
and treats model estimation as a learning problem and ranking
as
inference
using the EM algorithm. We developed a novel modification of the
Dirichlet distribution (“smoothed Dirichlet”) as our
generative component and showed that this distribution overcomes
some of the limitations
of the standard multinomial distribution popularly used in IR.
We also show
that the standard KL-divergence
ranking function used in IR emerges
naturally from the inference mechanism of our graphical model.
Our experiments
demonstrate that the new model performs well in
all settings, achieving
statistically significant improvements over the baselines. (This
work will be submitted to SIGIR 2005.)
- Novelty Detection and Event Threading (not TDT)
- We participated in
all four tasks in the TREC 2004 Novelty Track. New named entities
as well as new words were considered for identifying
novel sentences. Sentence
pair-wise similarities were also
calculated and the cutoff thresholds for eliminating redundant
sentences were
tuned on training data. The results of this track showed
that
our runs
for two of the four tasks were top-performing. The first (Task
2) was to identify all novel sentences given all relevant sentences
from documents
associated
with the topic. The other (Task 4) was to find novel
sentences
in the remaining documents given all relevant sentences from
all documents and novel sentences
from the first five documents.
We continued to develop “answer updating” approaches
to novelty detection.
In the proposed answer-updating approaches,
an answer
model is estimated for each relevant sentence and novel sentences
are assumed to be those sentences for which the answer model
is different
from the answer models of all previous sentences. Currently we are
trying to develop different techniques for constructing answer
models. We are
using the data from TREC 2003 and 2004 novelty tracks for our experiments.
The other two tasks (Tasks 1 and 3) required a novelty detection system
to find relevant sentences first and then identify novel sentences among
them. Therefore the performance of relevant sentence retrieval directly
affected the performance of novel sentences detection part. We have developed
a more robust relevance model in the language framework for ad-hoc retrieval.
The new relevance model can be applied to document retrieval with both
pseudo feedback and true relevance feedback. We have carried out experiments
with TREC queries 101-200 on AP collection. The experimental results
showed that the new relevance model could achieve a better performance
than the original relevance model by Lavrenko and Croft (SIGIR 2001)
and was not sensitive to the number of pseudo feedback documents. (We
are planning to submit a paper on this work to SIGIR 2005.)
- Continued
to explore the work on “event threading” though
much of this work was
set aside for the TDT evaluation. (Nallapati
et al., CIKM 2004). The idea was applied in the hierarchical topic
detection
task in our TDT 2004 submission, although it did not show obvious
improvement in performance.
• Utility experiments
- Executed the full utility study from October through
the beginning of December. Collected data at various levels of system accuracy
across
two variables: passage retrieval accuracy and passage clustering
accuracy. Tested three levels of clustering accuracy and eight
levels of retrieval
accuracy. Ran a total of 45 queries with 31
users. Results revealed
that as passage retrieval accuracy increases, raw time on
task decreases linearly.
Also as retrieval accuracy increases, users are able to find
relevant material faster, and the time lapse between finding individual relevant
answers decreases. Interestingly, clustering accuracy seemed
to have
no effect whatsoever on user utility.
- TIDES administration
- Area coordinator for TIDES detection.
- Coordinated and directed
TDT 2004 evaluations. Began plans for a TDT 2005
evaluation.
- Coordinated
and directed the TREC 2004 HARD tracks and began planning for
a
TREC 2005 track.
For more information on our current research, view our technical publications.
TIDES 1 Project: Tools for Rapidly Adaptable Translingual Information Retrieval and Organization
The CIIR's research objective for the TIDES 1 project was to seek significant improvements
in information technologies through investigation of formal frameworks and empirical evaluations. We focused on (1) improvements in cross-language information retrieval of
low-density
languages through the use of an intermediate language; (2) tracking, identifying,
describing, and monitoring changes in events in the news; and (3) summarization and visualization techniques for rapidly conveying the content of multiple documents to a user.
This work was supported in part by the Center for Intelligent Information Retrieval (CIIR)
and in part by SPAWARSYSCEN-SD grant number N66001-99-1-8912 and
SPAWARSYSCEN-SD grant number N66001-02-1-8903.
|