@%META:TOPICINFO{author="KateMoruzzi" date="1138715902" format="1.0" version="1.84"}%
CIIR weekly lab meetings
This page lists past, current, and upcoming CIIR lab meetings. Please feel free to edit the wiki to sign up for a meeting. Normally a meeting should include two 20-minute talks with 5 minutes Q/A period after each. Ideally, the two talks are somehow connected. The connection could be very strong, in which case the two talks should be prepared with some collaboration.
The talks could be on your own work or any other interesting work that is related to the lab's general research direction. To stimulate new research ideas, please include a slide in the end that addresses future directions and open research questions related to that work.
All Spring 2006 meetings are held on Friday mornings in CS151 from 10am-11am, unless otherwise specified.
LMSpring05?,
LMFall04?,
LMSummer04?,
LMSpring04?,
LMFall03,
LMSpring03?
Spring 2006
*
February 17th 10-11am (Room 151)*
Special Guest Speaker Sihem Amer-Yahia from AT&T Labs
-
- Title: "XML Full-Text Search and Ranking"
-
- Abstract: A growing number of applications require access to a mix of structured and text content. There are many examples where XML has been adopted to represent such data. Querying XML is a well-explored topic with powerful database-style languages such as XPath/XQuery set to become W3C standards. However, these languages are not powerful enough to express full-text search queries. I will first present TeXQuery? and its W3C successor, XQuery Full-Text, a full-text extension to XPath/XQuery which provides a rich set of fully composable full-text search primitives, such as keyword Boolean search, proximity distance, stemming and regular expressions and gracefully combines them with structured search with XPath/XQuery. XQuery Full-Text enables the approximate matching of queries on both structure and content. I will describe a query semantics that consistently extends classical database semantics to account for approximate answers and appropriate scoring methods that are consistent with tf.idf. I will then give a brief overview of topK processing algorithms that we developed in this context.
Sihem Amer-Yahia is a Technical Specialist at AT&T Labs Research. She received her Ph.D. degree from the University of Paris XI-Orsay and INRIA. She has been working on various aspects related to XML query processing. More lately, she has focused on XML full-text search. Sihem is a co-editor of the XQuery Full-Text language specification and use cases published in September 2005 by the W3C Full-Text Task Force. She is the main developer of
GalaTex? (
http://www.galaxquery.org/galatex}), a conformance implementation of XQuery Full-Text. She is interested in research at the intersection of database and information retrieval.
*
February 24th 10-11am (Room 151)*
-
- Talk 1: Ben Carterette
- Abstract: TBA
-
- Talk 2: David Mimno
- Abstract: TBA
*
March 3rd 10-11am (Room 151)*
-
- Talk 1: Xuerui Wang
- Title: Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends
-
- Abstract: This paper presents an LDA-style topic model that captures not
only the low-dimensional structure of data, but also how the structure
changes over time. Unlike other recent work that relies on Markov
assumptions or discretization of time, here each topic is associated
with a continuous distribution over timestamps, and for each generated
document, the mixture distribution over topics is influenced by both
word co-occurrences and the document's timestamp. Thus, the meaning of
a particular topic can be relied upon as constant, but the topics'
occurrence and correlations change significantly over time. We present
results on nine months of personal email, 17 years of NIPS research
papers and over 200 years of presidential state-of-the-union addresses,
showing improved topics, better timestamp prediction, and interpretable
trends. Joint work with Andrew McCallum?.
*
March 10th 10-11am (Room 151)*
-
- Talk 1: Mark Smucker
- Abstract: TBA
-
- Talk 2: Don Metzler
- Abstract: TBA
*++++No meetings March 17th or 24th++++
*
March 31st 10-11am (Room 151)*
-
- Talk 1: Hema Raghavan
- Abstract:
We empirically analyze the convergence speed of active learning on a variety
of text categorization problems and relate it to measures of problem difficulty
such as the feature set size required for the maximum achievable performance.
The speed of convergence is a measure of how quickly an active
learning algorithm converges to its best possible classification
performance on a given problem. Quickness or speed is a function of the number
of feedback iterations. A problem that needs many instances to converge (slow speed)
is considered difficult. We explore 4 difficulty measures (2 each of instance and
feature complexity respectively) which allow us to rank numerous categorization problems and experimental
test-beds based on their difficulty for active learning. Our feature complexity measures
capture many previous results in feature selection. We find that the speed of convergence
is inversely related (r=-0.7) to feature complexity. This has useful implications for future
research, especially in understanding a dual approach for active learning where the teacher
is asked to provide feedback on features in addition to labeling instances. We find that the
improvement in the speed of active learning brought about due to such a dual feedback approach
is negatively correlated with feature complexity (r=-0.65). Our experiments show that such a
dual feedback approach can increase the speed of active learning by 57\% on average on 358
binary text classification problems in 9 standard corpora that we consider, because most
bench-mark text categorization corpora contain problems of low to medium complexity.
-
- Talk 2: Jiwoon Jeon
- Title: Predicting the Quality of Answers with Non-Textual Features
- Abstract: New types of document collections are being developed by various web services. The service providers keep track of non-textual features such as click counts. In this paper, we present a framework to use non-textual features to predict the quality of documents. We also show our quality measure can be successfully incorporated into the language modeling-based retrieval model. We test our approach on a collection of question and answer pairs gathered from a community based question answering service where people ask and answer questions. Experimental results using our quality measure show a significant improvement over our baseline.
*
April 7th 10-11am (Room 151)*
-
- Talk 1: Ramesh Nallapati
- Title: Smoothed Dirichlet distribution: Understanding the Cross-entropy ranking function in IR
- Abstract: In this work we analyze the popular Cross-entropy ranking function in information retrieval. We uncover the generative distribution, namely the Smoothed Dirichlet distribution, underlying this ranking function and show that this distribution captures term occurrence distribution much better than the multinomial, thus offering a reason behind the success of the ranking function. We present theoretically motivated approximations to the distribution that lead to a closed form maximum likelihood solution, much like the multinomial, making it ideal for online IR tasks. We use the new distribution to construct a new, well-motivated ad-hoc retrieval algorithm. Our experiments show that this algorithm performs at least as well as similar algorithms that employ cross-entropy ranking. It also provides additional flexibility, e.g. in handling queries of various lengths, due to a consistent generative framework.
-
- Talk 2: Shaolei Feng
- Title: A Hierarchical, HMM based Automatic Evaluation of OCR Accuracy for a Digital Library of Books
- Abstract: Content-based on line book retrieval usually requires first converting printed text into machine readable text using an OCR engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. I will describe a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. Joint work with R.Manmatha while visiting Google.
*
April 14th 10-11am (Room 151)*
-
- Talk 1: Xing Wei
- Title: LDA-Based Document Models for Ad-hoc Retrieval
- Abstract: Previous research on cluster-based retrieval has shown that simple topic models can lead to significant improvements in retrieval performance. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is still unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that significant improvements over cluster-based retrieval can be obtained with reasonable efficiency.
-
- Talk 2: Giridhar Kumaran
- Title: Simple Questions to Improve Pseudo-Relevance Feedback Results, and Some Failure Analysis
- Abstract: We explore methods to further improve the performance of pseudo-relevance feedback. Studies suggest that new methods for tackling difficult queries are required. Our approach is to gather more information about the query from the user by asking her simple questions. The equally simple responses are used to modify the original query. Our experiments using the TREC Robust Track queries show that we can obtain a significant improvement in mean average precision of around 10% over pseudo-relevance feedback. This improvement is also spread across more queries compared to ordinary pseudo-relevance feedback, as suggested by geometric mean average precision.
*
April 21st 10-11am (Room 151)*
-
- Talk 1 Yun Zhou
*Title: A novel approach to predict retrieval performance: Ranking Robustness
- Abstract: A general observation in the field of noise data retrieval is that as retrieval effectiveness improves, the ranking function becomes more robust against data corruption. Motivated by this, we introduce a statistical measure to quantify the notion of ranking robustness in the context of regular document retrieval. We show that the new approach is at least as good as the clarity score method across a variety of collections. In particular, a combination of the two usually leads to further improvements.
-
- Talk 2: Kedar Bellare
- Title: Cleaning and Augmenting Databases by Learning their Alignments with Text Collections
- Abstract: Many real-world data mining applications begin with noisy, partially-filled databases. One might correct and fill these databases using information extracted from unstructured text. However, state-of-the-art extraction systems are trained by machine learning and typically require a large quantity of training data. This paper introduces a method of automatically cleaning and augmenting databases using only the existing noisy database and unlabeled text. A conditional random field model is used to learn alignments between existing database records and the appearance of their information in text, and simultaneously learn to perform extraction of new records not yet in the database. We present preliminary results learning extractors for bibliographic citations using real-world and noisy Bibtex databases.
*
April 28th -
MEETING CANCELED
*
May 5th 10-11am (ROOM 151)*
-
- Talk 1: Chirag Shah
- Title: "Using Named Entities Representation for Story Link Detection (SLD)"
- Abstract: Topic Detection and Tracking (TDT) forum has evoked a new line of research that explicitly focuses on the event-based organization of broadcast news. The uniqueness of this research has made it worthwhile to address some of its issues with a different approach than traditional IR. We identify some of these peculiarities and argue that named entities provide a better way of representing the documents in TDT related research. We support this argument by a series of experiments with Story Link Detection (SLD) task on TDT corpora. Our proposed systems that make use of named entities for document representation exhibit significant performance improvement over the baseline. Executing the experiments on different TDT corpora of varying nature, we identify some of the issues with different approaches regarding their effectiveness. We also provide a deeper analysis of the results and pinpoint the unique characteristics of various systems. This knowledge is used to combine two different systems and boost the performance even further. We are currently working on understanding the limitations of named entities based representations and identifying the ways to address them.
-
- Talk 2: Xiaoyan Li "Sentence Level Information Patterns for Novelty Detection"
-
- Abstract: The detection of new information in a document stream is an important component of many potential applications. In this work, a new novelty detection approach based on the identification of sentence level information patterns is proposed. Given a user’s information need, some information patterns in sentences such as combinations of query words, sentence lengths, named entities and phrases, and other sentence patterns, may contain more important and relevant information than single words. A thorough analysis of sentence level information patterns is elaborated on data from the TREC novelty tracks, including sentence lengths, named entities, and opinion patterns. I will present how we perform novelty detection based on information patterns, which focuses on the identification of previously unseen query-related patterns in sentences. A unified pattern-based approach is presented to novelty detection for both specific NE topics and more general topics. Experiments on novelty detection were carried out on data from the TREC 2003 and 2004 novelty tracks. Experimental results show that the proposed approach significantly improves the performance of novelty detection for both specific and general topics, therefore the overall performance for all topics from the 2002-2004 TREC novelty tracks, in terms of precision at top ranks. Future research directions along this line are suggested in the conclusions of the work.
*
May 12th 10-11am (Room 151)*
-
- Talk 1: Fernando Diaz "Experiments Toward Pseudo-Parallel Corpora"
- Abstract: Pseudo-parallel corpora are corpora which follow parallel topical distributions but may only contain a few exact translations. While the usefulness of such corpora is questionable for training statistical machine translation systems, previous results indicate that they may be helpful for cross-lingual information retrieval. In this talk, I will describe experiments comparing and re-aligning parallel document corpora. Our technique uses only geometric properties of the corpora (ie, does not require training a translation system) and achieves surprsingly strong alignment performance.
-
- Talk 2: Xiaoyong Liu
- Abstract: The most common approach to cluster-based retrieval (CBR), which was proposed in 1970s, is to retrieve one or more clusters in their entirety to a query. Research in this area has suggested that “optimal” clusters exist that, if retrieved, would yield very large improvements in effectiveness relative to document-based retrieval (DBR). However, no real retrieval strategy has achieved this result. Except for precision-oriented searches on very small data sets, DBR is found to be generally more effective. There has been a resurgence of research in CBR in the past few years including our own efforts in this area. The general approach is to use clusters as a form of document smoothing. Studies have shown that clusters can indeed improve retrieval performance automatically on modern test collections and the language modeling framework is an effective probabilistic retrieval framework for studying CBR. The reported results are encouraging but there is still large room for improvement as compared to what optimal clusters could potentially produce were they retrieved. In the proposed research, we examine the optimal and real performance of CBR with the goal of identifying the characteristics of optimal clusters. We develop a set of techniques that will address several aspects of CBR including systematic modeling of document-cluster relationships, different ways of representing clusters for retrieval, and possibly new retrieval models that are more suitable for CBR. Preliminary results on TREC collections demonstrate the promise of the research.
Spring 2005
* May 4, 2005 (Room 151):
-
- Talk 1 Ron Bekkerman - Disambiguating Web Appearances of People in a Social Network
- Abstract: Say you are looking for information about a particular person. A search engine returns many pages for that person's name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? In this talk I will present two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/Conglomerative Double Clustering---an application of a recently introduced multi-way distributional clustering method. To evaluate our methods, we collected and hand-labeled a dataset of over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someones in an email folder. On this dataset our methods outperform traditional agglomerative clustering by more than 20%, achieving over 80% F-measure.
-
- Talk 2 Hema Raghavan - Interaction in TDT Tracking
- Abstract: Interaction in News Filtering has been restricted to document level feedback. In addition, the assumption with feedback is that a user provides feedback on every document delivered to him. Current news filtering evaluation frameweorks do not consider a limit on the user's available labeling effort. In this talk we show that allowing users to provide subsets of documents for feedback in addition to marking documents as relevant is indeed beneficial for News Filtering, resulting in substantial improvements in TDT cost with as few as five documents labeled.
-
- Talk 3 Fernando Diaz - _
- Abstract:
- April 13, 2005 in room 140 (small classroom) NOTE ROOM CHANGE!
-
- Talk 1 Jamie Rothfeder - Aligning Transcriptions and Automatically Segmented Handwritten Documents
- Abstract: The MIR lab has developed a system for automatically segmenting word images from degraded, handwritten documents. This system has an error rate of around 18% when used on 100 documents from the George Washington collection. ASCII transcriptions corresponding to each of these 100 documents are available. If we knew exactly how the words in the transcriptions corresponded to the words in the handwritten documents, then we could automatically generate data in the form of word image, ASCII term pairs. These pairs are crucial for training automatic recognizers such as the one introduced in "Holistic Word Recognition for Handwritten Historical Documents" by Rath et. al. Aligning the transcriptions with the handwritten documents would be a trivial task if the segmentation were error free, since the sequence of word images would correspond directly to the sequence of ASCII terms in the transcriptions. Unfortunately, segmentation errors offset the direct alignment and make our problem more complicated. In this talk, I will discuss a HMM-based method to align perfect transcriptions to imperfectly segmented documents. Our hidden variables represent the sequence of word images that have been automatically segmented from a given document, the state space for these variables are all the terms in the transcription for the document. The observed variables are the features extracted from the automatically segmented images. We use the Viterbi algorithm to decode these hidden variables and thus assign a transcript word to each of the segments. After this, a second post-processing step is employed to improve the alignment.
-
- Talk 2 Don Metzler - Modeling Query Term Dependencies
- Abstract: Most information retrieval models make the assumption that terms occurrences are independent of each other. Many attempts in the past to relax this assumption have been made, including the linked dependence model, n-gram language models, and models that make explicit use of phrases. In this work, we propose and evaluate a general probabilistic framework for modeling dependencies between query terms. Experimental results show that using different dependence assumptions across varying types and sizes of collections can yield significant improvements over models that assume strict term independence.
- March 9, 2005
- Talk 1 VanessaMurdock - Ad Hoc Sentence Retrieval
- Abstract: Sentence retrieval has become an integral part of question answering systems, novelty detection and summarization. Each of these tasks has different requirements of a "good" sentence. Studies of sentence retrieval have been done on a task-specific basis. We demonstrate a query-likelihood baseline for sentence retrieval independent of a specific task. We investigate ways to estimate a "translation" model, translating queries to sentences, incorporating external resources such as WordNet?. We show significant performance gains by smoothing using the document that contains the sentence.
- Talk 2 Wei Li
- Talk 3 Koji Eguchi
- Febuary 9th
Talk 1 MarkSmucker and FernandoDiaz - High Precision Retrieval via User Interaction and Metadata
- Abstract: Traditional information retrieval systems focus on retrieving documents on the same topic as the query. Because a user's information need is richer than a mere topical description, the user is often unsatisfied with the results from conventional, topically-based retrieval systems. We explore improvement of retrieval by interacting with the user. To this end, we asked the user to specify extra-topical constraints or {\em metadata} to the information need. In addition, we studied the utility of several methods for obtaining and using user feedback following an initial topical retrieval. We found that metadata-based interaction resulted in improvements over our baselines for high-precision tasks. We also found that passage-based feedback provides a robust means to improve retrieval of relevant as well as topical documents. Finally, we demonstrate that the quality of retrieval summaries significantly influences the quality of the query reformulation.
- Talk 2 VanessaMurdock - Ad Hoc Sentence Retrieval
- Abstract: Sentence retrieval has become an integral part of question answering systems, novelty detection and summarization. Each of these tasks has different requirements of a "good" sentence. Studies of sentence retrieval have been done on a task-specific basis. We demonstrate a query-likelihood baseline for sentence retrieval independent of a specific task. We investigate ways to estimate a "translation" model, translating queries to sentences, incorporating external resources such as WordNet?. We show significant performance gains by smoothing using the document that contains the sentence.
- Talk 3
- Talk 4
Fall 2004
- November 15th
- Talk 1 VanessaMurdock - Sentence Retrieval from Questions
- Abstract: Passage retrieval has applications in question-answering, summarization, HARD, novelty detection, and machine translation. For tasks such as these there is more emphasis on the quality of the top of the ranked list, with less emphasis on the overall quality of the list. The richer the set of passages, in terms of relevant content, the more accurate the results. We present a simple translation model for passage retrieval at the sentence level. We choose sentences because sentences are a natural linguistic unit, whereas a passage may be an arbitrary piece of text. We demonstrate the translation model framework on TREC data, in the context of factoid question-answering, and show that it performs better than retrieval based on query likelihood, and on par with other systems.
- Nov 22nd (No lab meeting -- Virtual Thursday)
- Dec 13th
- Talk 1 FernandoDiaz - Pseudo-Relevance Feedback Using Support Vector Machines
- Abstract: Pseudo-relevance feedback is a well-studied technique for performing automatic query expansion. We will briefly present a review of language modeling approaches to pseudo-relevance feedback and derive a few discriminative analogs to these approaches using support vector machines. Preliminary results will be presented. This work is in its very early stages but out-of-box SVM solutions appear competitive with advanced language modeling techniques such as relevance models.
- Talk 2 CourtneyWade - Exploration of high-accuracy passage retrieval
- Abstract: The HARD track of TREC includes a passage retrieval component where the goal is high precision at the top of a ranked list. The passage-level relevance judgments from the 2003 and 2004 HARD track provide an opportunity to study techniques for isolating only the relevant portions of documents. We discuss some of the problems with finding an appropriate evaluation metric for arbitrary passage retrieval. Then we present a simple mixture model for scoring fixed-length passages that does significantly better than query likelihood, SVMs, and RM3 (a variation on relevance modeling). We conclude by discussing some preliminary work on variable-length passage retrieval.
Summer 2004
- June 28
- Don Metzler - Indri
- Trevor Strohman - Indri
- July 19
- FernandoDiaz - Using Temporal Profiles of Queries for Precision Prediction (SIGIR practice talk)
- ToniRath - Handwriting Retrieval
Spring 2004
- January 12
- Charles Sutton on learning to perform multiple sequence labeling tasks simultaneously. PPT
- Shaolei Feng on using the Bernoulli model for something
- ProjectorSetup by
- January 19, no meeting (Martin Luther King Day)
- January 26, meeting was cancelled
- February 2, meeting was cancelled
- February 9, meeting was cancelled
- February 16, no meeting (Presidents' Day)
- February 23, meeting was cancelled
- March 8
- Don Metzler on multiple-Bernoulli models for language modeling
- Chirag Shah on evaluating high accuracy retrieval techniques
- ProjectorSetup by Ramesh
- March 15 (Spring break; may not meet.)
- Xiaoyong Liu on automatic recognition of reading levels from user queries
- Mark Smucker on document dependent smoothing
- ProjectorSetup by Trevor
- March 22
- Xiaoyan Li on using answer models for novelty detection
- Andres Corrada-Emmanuel
- ProjectorSetup by NadiaGhamrawi?
- March 29
- Wei Li on answer retrieval from extracted tables
- Steve Cronen-Townsend on a language modeling framework for selective query expansion
- ProjectorSetup by XingWei
- April 5
- Hema Raghavan Experiments with ASR documents for IR and TDT
- Josh Lewis on search for Rexo and/or NSDL
- ProjectorSetup by Giridhar
- April 14, Jeremy's PhD? defense talk is at 10:30
- April 19, no meeting (Patriot's Day)
- April 26
- Chung Heong Gooi - Cross Document Coreferencing on a Large Scale Corpus PPT
- ProjectorSetup by YunZhon?
- May 3
- Giridhar Kumaran - Text Categorization and Named Entities for New Event Detection
- JiwoonJeon - Content Based Yahoo Photo News Retrieval
- ProjectorSetup by JJ
- May 17 (Classes ended the previous week)
Fall 2003
We meet from 11-12 on Tuesdays in CS151. Italics dates are in the past. Bold dates need one or more speakers.
- September 16. Predicting value of query expansion
- September 23. Cross-language issues
- Victor giving a tutorial of statistical machine translation basics PPT:
- Leah talking about the DARPA surprise language exercise.
- September 30. Smoothing
- Alvaro on smoothing at eBay
- Ramesh on a Zhai and Lafferty smoothing paper. PPT
- Chengxiang Zhai and John Lafferty, Model-based Feedback in the Language Modeling Approach to Information Retrieval,CIKM, 403-410, 2001. citeseer
- Chengxiang Zhai and John Lafferty, A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval, SIGIR 334-342, 2001. citeseer
- October 14. Music
- Jeremy/Victor on their ACM Multimedia paper on CRFs for music retrieval
- Vanessa on "Automatic transcription of piano music" by Rafeal, ISMIR 2002 PDF.
- October 21. Arabic
- Nasreen on name transliteration PPT
- Giri presenting "Unsupervised learning of Arabic stemming using a parallel corpus" by Rogati et al, ACL 2003 PS.gz.
- October 28. CIKM practice talks
- Xiaoyan on time-based language models PPT:
- Ao Feng on clustering evaluation in TDT detection PPT:
- November 4. (CIKM is happening in New Orleans.)
- Hema Raghavan on Query-Free News Search (Monika Henzinger et al, WWW 2003)HTML
- James Allan on aligning transcripts and handwriting
- November 11. No meeting; today is Veteran's Day.
- November 18. (TREC is happening in Gaithersburg.)
- Trevor Strohman on IR performance issues PPT
- Fuchun Peng on extracting information from technical papers
- November 25.
- Ben Carterette on BLEU and IR
- Xing Wei on table processing PPT
- December 2.
- Don Metzler on LM and Inference networks PPT
- Ramesh Nallapati on maximum entropy for IR PPT
- December 9.
- Xiaoyong Liu on experiments with clusters and language models PPT
- Margie Connell on cross-language processing for TDT PPT
- December 16.
- Toni Rath on historical manuscript retrieval and recognition
to top