Digital Government Project

A Language-Modeling Approach to Metadata for Cross-Database Linkage and Search
A Collaborative project between the University of Massachusetts and Carnegie Mellon University

Principal Investigators: W. Bruce Croft, UMass, and Jamie Callan, CMU

Project Overview

The Digital Government project is a National Science Foundation-sponsored initiative. One of the crucial problems in virtually every digital government application is locating and integrating information that is spread across many different organizations in many databases and in many different formats. Areas as diverse as crisis management, government statistics, and legislative support have all identified the issue of integrating heterogeneous information as a major step towards more effective information systems.

Government employees, ordinary citizens, and small businesses would all benefit from government information systems that could locate, retrieve, and integrate desired information quickly, handling transparently the details of which databases contain the information or in what format it is presented. No system should expect its patrons to trust its results unquestioningly, so these information systems should also make it easy to examine the relationships among documents and/or databases with similar content when desired.

The basis for this type of system is metadata, which is data that describes data or collections of data. We propose a completely new approach to metadata, which is based on language models instead of ontologies or controlled vocabularies. Simple language models represent basic vocabulary and frequency information; more complex language models represent phrases, names, and other speech patterns. Language models are a far more detailed representation of document or database contents than a few controlled vocabulary terms. Language models also enable a system to generate descriptions (metadata) directly from the content of its databases, without trying to match database contents to a controlled vocabulary. Language models are easily updated as information is added to a database, they support an unlimited range of subjects (because they are generated directly from database contents), and they support a wide range of information seeking activities. The research will demonstrate that language models are a sound and effective foundation on which to build large-scale, distributed information systems for government applications. Together with our government partners (U.S. Geological Survey, U.S. Department of Commerce, General Services Administration/Regulatory Information Service Center, and the U.S. Library of Congress), we will produce a prototype of a complete system for accessing distributed, heterogeneous, government information, and demonstrate its utility. Building this system will be an important part of evaluating the research, and the first step in transferring the new technology to government systems.


  • Callan, J., and Connell, M., “Query-based sampling of text databases.” ACM Transactions on Information Systems, 19(2), pp. 97-130. 2001.
  • Lu, J., and Callan, J., “Pruning long documents for distributed information retrieval.” In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM'02) (pp. 332-339). McLean, VA: ACM. 2002.
  • Lu, J., and Callan, J., “Reducing storage costs for federated search of text databases” (poster description). In Proceedings of the National Conference on Digital Government Research (dg.o2003). Boston. 2003.
  • Pinto, D., Branstein, M., Coleman, R., King, M., Li, W., Wei,X. and Croft, W.B., "QuASM: A System for Question Answering Using Semi-Structured Data", Proceedings of the JCDL 2002 Joint Conference on Digital Libraries, pp. 46-55 (2002).
  • Pinto, D., McCallum, A., Wei, X. and Croft, W.B., "Table Extraction Using Conditional Random Fields", SIGIR '03 Conference, Toronto, Canada, pp. 235-242 (2003).
  • Si, L., and Callan, J., “Using sampled data and regression to merge search engine results.” In Proceedings of the Twenty Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 19-26). Tampere, Finland: ACM. 2002.
  • Si, L., and Callan, J., “A semi-supervised learning method to merge search engine results.” ACM Transactions on Information Systems, 24(4) (pp. 457-491). ACM. 2003a.
  • Si, L., and Callan, J., “Relevant document distribution estimation method for resource selection.” In Proceedings of the Twenty Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto: ACM. 2003b.
  • Si, L., and Callan, J., “The effect of database size distribution on resource selection algorithms.” In J. Callan, F. Crestani, and M. Sanderson (eds), Distributed Multimedia Information Retrieval. Springer-Verlag. 2003c. Also appeared in the Proceedings of the SIGIR 2003 Workshop on Distributed Information Retrieval. Toronto.
  • Si, L., Jin, R., Callan, J., and Ogilvie, P., “Language modeling framework for resource selection and results merging.” In Proceedings of the Eleventh International Conference of Information and Knowledge Management (CIKM'02).
  • Wei, X., Croft, W.B., and Pinto, D., “Answer Retrieval From Extracted Tables,” submitted to SIGIR '04, Sheffield, England, July 25-29 (2004).

This work was supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation grant EIA-9983215 and EIA-9983253.