IR - Information Retrieval Laboratory
Much of our Information Retrieval research over the past 20+ years has been directed at the fundamental issues of text representation, query formulation, and retrieval models that form the basis of all search engines. This research has been extended into a number of different architectures, such as web search, filtering information streams, and searching distributed databases. It has been extended into different languages in both multilingual and cross-lingual systems. It has been extended into different modes of interaction, such as long queries and graphics-based visualization techniques. It has been extended into different applications, such as question answering, social search, enterprise search, blog search, and tracking text reuse. Finally, it has been extended into different data types, such as images, speech, structured data, video, and music.
To give an idea of the variety of research topics pursued in the CIIR, we focus here on the people who have come out of our environment. Specifically, the following is a description of some of the recent graduates from the Information Retrieval lab:
IESL - Information Extraction and Synthesis Laboratory
Information -- A collection of facts, relations or events from which conclusions may be drawn. Knowledge that has been gathered or received.
Extraction -- Obtaining materials in concentrated, usable form from a dilluted, unusable source.
Synthesis -- The combining of separate elements or substances to form a coherent whole. Reasoning from the general to the particular; logical deduction.
Laboratory -- An organization performing scientific experimentation and research.
IESL aims to dramatically increase our ability to mine actionable knowledge from unstructured text. We are especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature and community. We develop and employ various methods in statistical machine learning, natural language processing and information retrieval. We tend toward probabilistic approaches, graphical models, and Bayesian methods.
Biomedical Informatics Natural Language Processing (BioNLP) Laboratory
The BioNLP lab conducts research on information retrieval, machine learning, and natural language processing, with a focus on biomedical applications. Our goal is to extract information from the vast amount of unstructured data in the biomedical domain, such as electronic health record (EHR) notes and scientific articles. We have developed and built systems for biomedical question answering, adverse drug event detection, biomedical figure search, EHR note comprehension, and healthcare outcome predictions, among others
MIR - Multimedia Indexing and Retrieval Laboratory
The Multimedia Indexing and Retrieval Laboratory's (MIR) research focuses on retrieving databases of images, videos and scanned handwritten documents.
Libraries have traditionally annotated images manually with text and then retrieved the resulting images. This is labor intensive, expensive and tedious to do. One of our current approaches involves using statistical approaches to automatically annotate and retrieve images (videos) given a small annotated training set of images (videos). One approach involves viewing the problem as similar to that of cross-lingual retrieval where say a set of documents in French is retrieved using an English query. To do this, a parallel corpus of documents in English and French is required for training. By analogy we have a parallel vocabulary of image features and annotation words obtained from a training set. Given this training set a relevance (based language) model is learned. This relevance model is then used to annotate unseen test images. The test images may then be retrieved via their automatic annotations using text queries and a language model based retrieval approach. We have applied a number of other models to this area and the approach is very promising.
Current handwriting recognition works well for constrained domains such as postal address recognition and bank check recognition. There has been little work on unconstrained domains like historical manuscripts. MIR has developed the first automatic retrieval system for retrieving handwritten manuscripts and has demonstrated this on a 1000 page (8 Gb) database of George Washington's manuscripts.
The approach is similar to that used for image annotation and retrieval. The scanned images are automatically segmented using a scale space page segmentation algorithm. The word images are preprocessed and features extracted from them. A small training dataset is produced by annotating the words in a small portion of the manuscripts. A statistical model is learned using this test set and is then used to automatically annotate the test set with words and probabilities. A language model based retrieval approach may then be used to retrieve pages given a text (ASCII) query. As mentioned above this has been demonstrated on a part of the George Washington dataset. We have also developed handwriting recognition algorithms. We are currently working on improving performance, scalability issues and on learning models for out of vocabulary terms.
Past work by MIR includes a multi-modal retrieval using appearance based image retrieval and text retrieval which was applied to a large database of trademarks containing image and text data from the US Patent and Trademark Office. The database contained 68,000 trademarks which could be searched using either image retrieval or image and text retrieval while 615,000 trademarks could be searched using text retrieval.
MLDS - Machine Learning for Data Science Laboratory
The Machine Learning for Data Science laboratory focuses on the development of machine learning models and algorithms for addressing a variety of challenging problems in the emerging areas of computational social science, computational ecology, computational behavioral science and computational medicine.