Brenden Tamilio, Paul Ogilvie
The Center For Intelligent Information Retrieval
REU Program - Summer 1999 - Final Report
August.6.1999
Team Acronym
After 10 short, short weeks we're arriving at closure (save our paper) on
the acronym project. We have two working demos, and an active crawler that
is even now boldly searching the web for undiscovered acronyms. Below are
descriptions on the status of things, where our work and demos reside, and
some ideas for project expansion.
Files:
We have moved all of the essential files for the project (for future work)
into two locations. The majority of the files are located on dubbo, under
~reu2/workspace (this is under the temp mount, only accessible on dubbo).
README files exist in this directory, as well as most of the subdirs,
explaining their contents. The items that can be found here include:
- ALL working versions of the lexers we tested
- Evaluation material, judgements, etc.
- The cutting, trimming, layout, and formatting tools we created to make the
process more efficient.
- Information about the query sets we used
The other area where files reside is on Cadia in the ../cgi-bin/acro
directory. Another README resides in this directory. The items in this
directory are:
- The Inquery database interface program acro-lookup
- Scripts to control crawling and grabbing (using GNU Wget)
- Prototype versions of many of the CGI programs
- the crawling output (we were able to get 1 gig of acronyms in 2 days, over 4 million acronyms)
We also have 3 .html files under ../web-docs/acro (index, getacros, and lookup).
The Inquery webaccess database is also in this directory, as wellas several web logs.
Fruits:
We a couple of working demonstrations residing on cadia. They reside as
follow:
http://cadia/irug/acro/index.html
Database Interface -> ./lookup.html
Extraction Demonstration -> ./getacros.html
The output of our crawling is located on Cadia (the file is pretty big):
./cgi-bin/acro/out/gencrawl.out
Future Fodder:
If we had another week or two, there are a few of things we would
immediately implement:
1) an automation script to update the database at regular weekly intervals.
The script would have to pause the crawling, and pass the crawling output
file to Inquery (we have scripts on dubbo to help lay the
SGML) and update the database to include the newly acquired acronyms. The
output file would then be flushed, and crawling resumed
The generated crawler retreives about 15 megs per day, so we suggest it be
updated "often".
2) changes to the methods for crawling.
Currently, the crawler is seeded a pseudo-randomly generated acronym based
on length distribution. It recursively generates 2
acronyms of length 2, 4 of length 3, 3 of length 4, and 1 of length 5-8, and
then repeats. It keeps no record of acronyms it has
already generated, and will search for the top 200 results on AltaVista,
even if 0 results are found. It does not generate acronyms
with non-uppercase alpha characters (no digits, slashes, dashes, &s, or
lowercase).
We suggest a much more intelligent method of acronym generation. We crawled
exclusively with our query set for two days, and retrieved just under 1 gig
of acronyms. This leads us to believe that if we crawl with known acronyms,
or queries that are more likely to be acronyms than say the randomly
generated QXFJG, we will receive more acronyms. We have also found that
acronyms beget more acronyms - queries for acronyms will often retrieve
pages with large sets of acronyms on them. We would also suggest attempting
to crawl the top 10,000 results for "acronym" and "acronyms" and "acro" etc.
to get the hundreds and hundreds of acronym lists Altavista returns. We
successfully crawled Hotbot too (just for kicks), and recommend attempts at
other search engines.
When searching AltaVista, there are about 25 useless links at the top of the
page that are unfortunately crawled with every query. These links go to
advertisers, and other domains AltaVista owns or is affiliated with. A
couple of these links are useful, like the AltaVista recommended areas, or
the RealNames listing, which sometimes produces results that would provide
additional relevant documents for the query, though these are side effects.
However, most of the stuff at the top and bottom of the page is junk, and
causes the crawler to hit Altavista an additional 20 or so times per query.
We recommend a script that would extract only the results ( 1., 2., 3.,) to
crawl from the AltaVista generated page. Unfortunately, we ran out of time
to implement this.
Also, we do not ensure that we don't crawl the same page. Hashing the
visited URLs and their modification date may allow someone to check whether
the page should be crawled.
3) classification.
we didn't even touch it.
4) confidence of a definition
We have a confidence system based on the number of times a definition
occurs. However, we feel that we can improve this. Right now, no matter
how the definition was found, we increase the raw confidence by 1. If we
assigned different confidence increases according to how the definition was
found (what form of canonical or contextual), we think we can threshold the
defintions better. Statistically examining the existing judged queries
would help to assign appropriate weights.
Another concern is crawling identical text from different URLs. It is not
feasible to store the pages we have crawled and compare the text
of the documents. We do keep the context where the acronym was found; if
the increase in confidence is reduced when seeing the definition in context
it has been defined in before, that may improve the confidence rating.
Raw confidence (rc) is currently converted to confidence by rc / (rc + 2).
Tinkering with this equation may also be helpful.