Brenden Tamilio, Paul Ogilvie

The Center For Intelligent Information Retrieval

REU Program - Summer 1999 - Final Report

August.6.1999

Team Acronym

After 10 short, short weeks we're arriving at closure (save our paper) on

the acronym project. We have two working demos, and an active crawler that

is even now boldly searching the web for undiscovered acronyms. Below are

descriptions on the status of things, where our work and demos reside, and

some ideas for project expansion.

Files:

We have moved all of the essential files for the project (for future work)

into two locations. The majority of the files are located on dubbo, under

~reu2/workspace (this is under the temp mount, only accessible on dubbo).

README files exist in this directory, as well as most of the subdirs,

explaining their contents. The items that can be found here include:

- ALL working versions of the lexers we tested

- Evaluation material, judgements, etc.

- The cutting, trimming, layout, and formatting tools we created to make the

process more efficient.

- Information about the query sets we used

The other area where files reside is on Cadia in the ../cgi-bin/acro

directory. Another README resides in this directory. The items in this

directory are:

- The Inquery database interface program acro-lookup

- Scripts to control crawling and grabbing (using GNU Wget)

- Prototype versions of many of the CGI programs

- the crawling output (we were able to get 1 gig of acronyms in 2 days, over 4 million acronyms)

We also have 3 .html files under ../web-docs/acro (index, getacros, and lookup).

The Inquery webaccess database is also in this directory, as wellas several web logs.

Fruits:

We a couple of working demonstrations residing on cadia. They reside as

follow:

http://cadia/irug/acro/index.html

Database Interface -> ./lookup.html

Extraction Demonstration -> ./getacros.html

The output of our crawling is located on Cadia (the file is pretty big):

./cgi-bin/acro/out/gencrawl.out

Future Fodder:

If we had another week or two, there are a few of things we would

immediately implement:

1) an automation script to update the database at regular weekly intervals.

The script would have to pause the crawling, and pass the crawling output

file to Inquery (we have scripts on dubbo to help lay the

SGML) and update the database to include the newly acquired acronyms. The

output file would then be flushed, and crawling resumed

The generated crawler retreives about 15 megs per day, so we suggest it be

updated "often".

2) changes to the methods for crawling.

Currently, the crawler is seeded a pseudo-randomly generated acronym based

on length distribution. It recursively generates 2

acronyms of length 2, 4 of length 3, 3 of length 4, and 1 of length 5-8, and

then repeats. It keeps no record of acronyms it has

already generated, and will search for the top 200 results on AltaVista,

even if 0 results are found. It does not generate acronyms

with non-uppercase alpha characters (no digits, slashes, dashes, &amps, or

lowercase).

We suggest a much more intelligent method of acronym generation. We crawled

exclusively with our query set for two days, and retrieved just under 1 gig

of acronyms. This leads us to believe that if we crawl with known acronyms,

or queries that are more likely to be acronyms than say the randomly

generated QXFJG, we will receive more acronyms. We have also found that

acronyms beget more acronyms - queries for acronyms will often retrieve

pages with large sets of acronyms on them. We would also suggest attempting

to crawl the top 10,000 results for "acronym" and "acronyms" and "acro" etc.

to get the hundreds and hundreds of acronym lists Altavista returns. We

successfully crawled Hotbot too (just for kicks), and recommend attempts at

other search engines.

When searching AltaVista, there are about 25 useless links at the top of the

page that are unfortunately crawled with every query. These links go to

advertisers, and other domains AltaVista owns or is affiliated with. A

couple of these links are useful, like the AltaVista recommended areas, or

the RealNames listing, which sometimes produces results that would provide

additional relevant documents for the query, though these are side effects.

However, most of the stuff at the top and bottom of the page is junk, and

causes the crawler to hit Altavista an additional 20 or so times per query.

We recommend a script that would extract only the results ( 1., 2., 3.,) to

crawl from the AltaVista generated page. Unfortunately, we ran out of time

to implement this.

Also, we do not ensure that we don't crawl the same page. Hashing the

visited URLs and their modification date may allow someone to check whether

the page should be crawled.

3) classification.

we didn't even touch it.

4) confidence of a definition

We have a confidence system based on the number of times a definition

occurs. However, we feel that we can improve this. Right now, no matter

how the definition was found, we increase the raw confidence by 1. If we

assigned different confidence increases according to how the definition was

found (what form of canonical or contextual), we think we can threshold the

defintions better. Statistically examining the existing judged queries

would help to assign appropriate weights.

Another concern is crawling identical text from different URLs. It is not

feasible to store the pages we have crawled and compare the text

of the documents. We do keep the context where the acronym was found; if

the increase in confidence is reduced when seeing the definition in context

it has been defined in before, that may improve the confidence rating.

Raw confidence (rc) is currently converted to confidence by rc / (rc + 2).

Tinkering with this equation may also be helpful.