![]() |
|
|
| CIIR
Home |
Research within the Center for Intelligent Information Retrieval Much of our research over the past 15 years has been primarily directed at the basic issues of text representation, query acquisition, and retrieval models that form the basis of all search engines. This research has been extended into a number of different architectures, such as filtering information streams and searching distributed databases. It has been extended into different languages in both multilingual and cross-lingual systems. It has been extended into different modes of interaction, such as graphics-based visualization techniques. Finally, it has been extended into different data types, such as images, speech, video, and music. In the first 3 years of the Center, we developed the InQuery® retrieval software based on our research on inference net models of retrieval. This software proved to be very useful both as a research platform and as the basis for a number of non-core projects. In this period, we also developed the Badger information extraction system that has been licensed by a number of members. We also carried out many large-scale evaluations in this period, in the context of government-sponsored events such as TREC, MUC and TIPSTER. Evaluations using large databases provided by members such as NLM were also done. The large number of non-core projects and evaluations required us to have a large programming staff. The Center's research strategy has evolved significantly, as the Center's first generation of technologies moved from leading edge research to mainstream commercial and government applications. The last three years have been characterized by a strengthened commitment to basic research, planting the seeds for technologies that will grow into the next generation of leading-edge intelligent information retrieval software. The combination of research on advancing core technology areas, research on issues associated with the partner projects, and an active interest in technology transfer, has given the Center a broad and synergistic research base. We study nearly every topic related to information retrieval, from the basic theory underlying retrieval models to the effects of graphical user interfaces on people's ability to accomplish information seeking tasks. Highest priority is given to making advances to core technology areas of long-term interest to Center members. Potential research issues are evaluated based on their ability to make a significant impact in one or more areas. As we have moved beyond the initial 9 years of National Science Foundation funding, we have found that the most valuable research role that we can play, and the one that has the most success in terms of attracting continued funding, is to focus on our "roots". That is, we are concentrating on the unsolved long-term research problems that underlie effective information retrieval - text representation, query acquisition, and retrieval models. The problem of providing a software system that could answer questions as effectively as an educated person was identified as a "software grand challenge" in a Turing Award lecture by Jim Gray, so there is growing recognition both that the information retrieval problem is difficult and that even solving parts of that problem will have enormous economic and social benefits. IRLab Research:Over the last 7 years, the Information Retrieval Laboratory has developed a new language-modeling approach to information retrieval that resolves a number of important problems with previous models, offers improved accuracy, and is applicable to a wide range of previously disparate techniques, such as document retrieval, query expansion, and database selection. This research lays the groundwork for retrieval systems for years to come. A significant amount of IRLab effort is being devoted to extending this work, improving and unifying common retrieval tasks under a single approach. We are currently researching the following areas related to language models:
Another theme of our research is related to new information environments. Society is in the midst of a communications transformation that will forever change the way people, companies, and governments acquire and distribute information. However, this vision rests on many assumptions about technology developments that remain to be realized. Much of the IRLab's research agenda is based on solving problems that are likely to arise during this transformation. For example, the rapid proliferation of the Internet provides people with thousands of (often proprietary) information repositories in which to search, information in different languages, and a constant stream of new information from many sources. The IRLab is a leader in providing solutions for automatically identifying the most useful databases to search, for cross-lingual IR (English queries retrieve Spanish documents), and for detecting and tracking novel events in large information streams. Another group of research areas related to this includes information summarization, visualization, and text data-mining techniques that allow people to easily understand and manipulate very large amounts of information. The research vision is to provide tools that enable people to find, manipulate and analyze large amounts of unstructured data easily, in much the same way that spreadsheets and relational databases do for numeric data. In addition to its work on information retrieval and analysis tools, the IRLab studies the problems of how best to present information to people so that it can be understood easily. The IRLab has invested significant resources into a research laboratory for undergraduate students, which focuses on creating tools that enable rapid prototyping of graphical user interfaces. More details on the Information Retrieval Laboratory can be found at http://ciir.cs.umass.edu/research/irlab.html. IESL Research:The Information Extraction and Synthesis Laboratory aims to dramatically increase our ability to mine actionable knowledge from unstructured text. We are especially interested in information extraction from the Web, understanding the connections between people and between organizations, expert finding, social network analysis, and mining the scientific literature and community. Typical problems in information extraction include finding all the people, organizations, and locations in a news article; filling in a database of job postings from thousands of company sites across the Web; and extracting the bibliographic information of a cited article in a research paper. One can attack these problems by designing textual patterns by hand, but this process is both tedious and error-prone. Therefore, we develop techniques that learn how to extract information from text given a few correctly-labeled examples for training. IESL develops and employs various methods in statistical machine learning, natural language processing and information retrieval. IESL tends toward probabilistic approaches, graphical models, and Bayesian methods. IESL's current research includes: Unified Information Extraction and Data Mining: Information extraction and data mining appear together in many applications: Information extraction populates slots in a database by identifying relevant subsequences of text, and data mining extracts rules from a populated database. However, extraction is usually not aware of the emerging patterns and regularities in the database, while data mining methods are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach. We have been researching relational probabilistic models that unify extraction and mining, so that by sharing common inference procedures, they can each overcome the weaknesses of the other. Extracting Relationships between People: Much useful information about people is buried in their email. For example, we are developing a system that automatically extracts an address book of a users’s contacts from their incoming email and the Web. Within science, information is hidden in the way research papers cite each other. From the graph of paper citations, one can find authorities, i.e., papers which are often cited; tutorials and overviews, i.e., papers which cite many authorities; pairs of researchers who often collaborate; and researchers do not know each other but perhaps should. We are developing a large database of research papers, and automatically extracting citations between papers and relationships between researchers. Conditional Probability Models for Sequences and Relational Data: After having some success using hidden Markov models for information extraction, we found ourselves frustrated by their inability to incorporate many arbitrary, overlapping features of the input sequence, such as capitalization, lexicon memberships, spelling features, and conjunctions of such features in a large window of past and future observations. The same difficulties exist in many generatively-trained models historically used in NLP. We have begun work with conditionally-trained probability models that address these problems. Finite-state Conditional Random Fields (CRFs) are globally-normalized conditional sequence models. We have also been working with CRFs for coreference and multi-sequence labeling, analogous to conditionally-trained Dynamic Bayesian Networks (DBNs). More details on the Information Extraction and Synthesis Laboratory can be found at http://iesl.cs.umass.edu. MIR Research: The Multimedia Indexing and Retrieval Laboratory's (MIR) research focuses on retrieving databases of images, videos and scanned handwritten documents. Libraries have traditionally annotated images manually with text and then retrieved the resulting images. This is labor intensive, expensive and tedious to do. One of our current approaches involves using statistical approaches to automatically annotate and retrieve images (videos) given a small annotated training set of images (videos). One approach involves viewing the problem as similar to that of cross-lingual retrieval where say a set of documents in French is retrieved using an English query. To do this, a parallel corpus of documents in English and French is required for training. By analogy we have a parallel vocabulary of image features and annotation words obtained from a training set. Given this training set a relevance (based language) model is learned. This relevance model is then used to annotate unseen test images. The test images may then be retrieved via their automatic annotations using text queries and a language model based retrieval approach. We have applied a number of other models to this area and the approach is very promising. Current handwriting recognition works well for constrained domains such as postal address recognition and bank check recognition. There has been little work on unconstrained domains like historical manuscripts. MIR has developed the first automatic retrieval system for retrieving handwritten manuscripts and has demonstrated this on a 1000 page (8 Gb) database of George Washington's manuscripts. A demonstration of the handwriting retrieval system of the Multimedia Indexing and Retrieval Laboratory can be found at http://ciir.cs.umass.edu/research/wordspotting. The approach is similar to that used for image annotation and retrieval. The scanned images are automatically segmented using a scale space page segmentation algorithm. The word images are preprocessed and features extracted from them. A small training dataset is produced by annotating the words in a small portion of the manuscripts. A statistical model is learned using this test set and is then used to automatically annotate the test set with words and probabilities. A language model based retrieval approach may then be used to retrieve pages given a text (ASCII) query. As mentioned above this has been demonstrated on a part of the George Washington dataset. We have also developed handwriting recognition algorithms. We are currently working on improving performance, scalability issues and on learning models for out of vocabulary terms. Past work by MIR includes a multi-modal retrieval using appearance based image retrieval and text retrieval which was applied to a large database of trademarks containing image and text data from the US Patent and Trademark Office. The database contained 68,000 trademarks which could be searched using either image retrieval or image and text retrieval while 615,000 trademarks could be searched using text retrieval. |
© 2008
University of Massachusetts
Amherst. Site
Policies. This site is maintained by Department of Computer Science/Center for Intelligent Information Retrieval |
|