|
|
|
|
|
| Our Approach | People in the group | Papers (.gz postscript files) |
|
Much of today's information is in the form of video, pictures, audio or paper documents, rather than in machine readable text form. Indexing these sources of information is, however, a non-trivial task. The problem of indexing images, for example, is a non-trivial one. Although it is clear that color, texture or shape are good attributes to use for indexing images, to be useful there needs to be a semantic correlation between the attribute and the object. For example, red sunsets can be indexed using the color red. Since there are many other red objects in the world, this may not always work. By restricting the domain or by using multiple sources of information (multimedia), the problem can be made more tractable. For example by comparing the audio track of a video against the picture information, one can restrict the domain of discourse considerably. Another approach is to use any text information present in the image itself. Many databases can and will be created by scanning paper documents, but access to these databases is currently restricted to manually-provided keywords and OCR. Although OCR is a practical technique for clean, well-printed documents in standard fonts, it typically does not do well with other common types of documents such as advertisements or diagrams with text. We propose to develop techniques to identify and index the following types of information on scanned pages: Text that is readable by OCR: Text would be detected and then a cleanup process would be used to remove backgrounds such as shading and hatching and pass it through a commercial OCR package. A version of INQUERY that had been tuned for OCR would be used to index and query this part of the information. Text that is not readable by OCR: Text written in unusual fonts, of poor quality, or even handwritten. In this case, we will use an approach ("wordspotting") where the input image is segmented into words and image matching techniques, rather than OCR, will be used for retrieval. Samples of important words can be used for training the matching techniques. Image Retrieval By Content: An algorithm to retrieve images based on appearance has been developed. An algorithm to retrieve images by color is also under investigation | ||