Summary Report
--------------
Researcher: Karl Barkei
Primary Faculty Advisor: Kathryn McKinley
Faculty Advisor: Prashant Shenoy
August 6, 1999
Introduction
------------
My research/implementation problem centered around improving the speed
of many-document retrieval in INQUERY. We intended to achieve this
improvement by optimizing certain aspects of the I/O involved in
getting a document from disk.
Problem Context
---------------
For a human user browsing query results, being able to retrieve a
large number of the returned documents quickly is not a particularly
useful system feature. A human user usually browses documents one at
a time and may only look at a few of the documents for any given query.
However, we could imagine a system where it would be desirable to
retrieve a large number of documents quickly. Perhaps we want to
analyze all of the documents returned for a given query or archive
them. It is in this context that the problem becomes worthwhile.
Project Objectives
------------------
Since we were interested in seeing if we could change the performance
of INQUERY, we needed a method of evaluating this change. To this
end, I familiarized myself with timing routines used by Zhihong Lu and
Brendon Cahoon to time INQUERY. The next steps were to make the actual
modifications to INQUERY to improve the I/O. These steps involved the
following:
1) Deciding on a platform
2) Make the reads from disk asynchronous reads.
3) Give INQUERY the ability to stripe documents.
4) Give INQUERY the ability to change the layout of documents on disk.
Deciding on a platform
----------------------
We decided to use Digital UNIX OSF1 v4.0 (alpha) as a platform
Making the reads asynchronous
----------------------------
The first part of the project involved trying to change the reads that
are performed in retrieving a document into asynchronous reads. The
normal way INQUERY retrieves documents is through a call to the API
function 'inq_get_doc()'. This function accepts a single internal
document id as an argument and allows users to retrieve that single
document from a collection. The function reads positional information
about the document (i.e., which file it is in, the offset in that
file, and the length) from the key-file using an 'fread()' call. After
obtaining this information, it then reads the actual document from the
collection file, also using an 'fread()'.
The initial idea was to simply change the fread() that actually read
the document into an asynchronous read. Several complications arose
in implementing this:
1) If the key-file was still being read via an fread(), than we would
severely limit the asynchrony achieved because we would still do an
fread() from the key-file and that would effectively synchronize all
the reads of the document bodies.
2) Changing the key-file fread() itself into an asynchronous read was
complicated, for one, and, two, of limited use because we couldn't
schedule the asynchronous read of a document until we have its
positional information.
3) With the original inq_get_doc(), when the function completed the
document was returned. However, with an asynchronous version this
would no longer be the case. This brought up the question of how to
know when the document was actually returned.
4) The asynchronous I/O API in Digital UNIX limits the number of
simultaneous I/O requests to AIO_MAX = 64. This would make
implementation more complicated.
Another idea was to make a new function inq_get_doc_list_async() that would
take a list of document id's as an argument and return a list of
documents. This seemed like a desirable API as it conformed closer to
the problem context and it solves problem 1 above by allowing us to
get all the positional information at once. However it still suffered
from the last two problems.
James Allan suggested not using a truly asynchronous read. Instead, I
wrote a function inq_get_doc_list_async() (The name is misleading
because it doesn't really do an asynchronous read). This function
reads all the positional information at once and then schedules the
reads for the requested documents in AIO_LISTIO_MAX = 64 request
chunks via the lio_listio() system call.
In addition to this function, I wrote another function
inq_get_doc_list() which also reads all the positional information up
front, but instead of calling lio_listio(), it just uses an fread()
inside of a for-loop. The purpose of this function is to allow us a
finer comparison of the different read methods.
After writing the function it was time to evaluate it. My method went
as follows:
for <function> = inq_get_doc_list_async(), inq_get_doc_list(),
inq_get_doc()
for N = {1,10,100,300,500,750,1000,2000,3500,5000,10000}
1. Randomly generate N pseudo-random document id's.
2. call inq_get_doc_list() retrieving N documents to prime cache.
3. run 6 trials of:
1. Attempt to clear cache.
2. Time <function> retrieving N documents..
4. Construct an average of the the last 5 trials.
This method was encapsulated in a C program and took approximately 14
hours to run on dubbo.cs.umass.edu.
I say "attempt to clear" the cache because I could not find an
effective way to completely clear the cache. The original suggestion
of how to clear the cache (also known as "chilling the system") was to
sequentially read a large file. This proved to be terrible at
clearing the cache. Since then I have tried or considered many
possible solutions:
- Sequentially read a large unrelated file
- Randomly read a large unrelated file
- Sequentially read 200 MB total from 25 different files.
- Read n = 1 to 90,000 documents from an unrelated collection(s)
- sync() function
- use identical copies of the collection
- changing system attributes of the UBC (Unified Buffer Cache)
- re-booting the system
- re-mounting the file-system
I posted/searched newsgroups and the web for a solution. I e-mailed
CSCF who in-turn e-mailed the tru-64 managers list. I talked with
Prashant Shenoy. All to no avail.
The Digital Unix OS implements file caching with the Unified Buffer
Cache (UBC) (the data cache) and the metadata buffer cache (cache for
inodes, indirect blocks, directory blocks, etc.) both of which reside
in main memory. The UBC competes with the virtual memory system for
main memory and may grow dynamically. The reason reading a single
large file is not effective at clearing the UBC is that the UBC limits
the ability of a single file to dominate the cache. It effectively
sets a resident set size for a file. A file may grow to this size in
the cache, after which it starts stealing pages from itself. (I think)
To me, the ideal solution would be to give me a user mountable
partition and re-mount the file-system in between trials. However this
did not swing with CSCF.
In the end, I chilled the system by calling sync() 3 times, reading
1000 random documents from an unrelated trec database, and then call
sync() 3 more times. This works fairly well, but still not
completely; this is why I prime the cache. That is, I have tried to
reduce the effect of caching as much as possible and I have assumed
that all the functions will benefit the same amount from whatever
caching is still going on.
Here are the results:
(Tests ran on dubbo.cs.umass.edu at night,
tried to make sure no one else was using
the machine)
Time (in seconds)
# docs inq_get_doc_list_async inq_get_doc_list inq_get_doc
------ ---------------------- ---------------- -----------
1 0.065870 0.101565 0.122679
10 0.824172 1.009599 1.007446
100 7.231273 8.610550 8.725015
300 18.576859 22.537199 22.256295
500 27.693490 34.184263 34.254342
750 38.431305 47.573210 49.077716
1000 47.667473 60.731188 63.052914
2000 89.290067 116.824428 124.715043
3500 157.658777 209.836924 229.523617
5000 231.657306 333.888941 373.012260
10000 559.919367 918.606995 1139.637049
(Note: each time is an average of 5 trials)
I think what accounts for the improvement seen in my function
inq_get_doc_list_async() is attributable to disk request reordering
and overhead overlap.
Striping documents
------------------
The idea of this part of the project was to look at how various
striping parameters (size of striping unit, number of disks striped
across) affect performance and to ask the question: In retrieving a
set of documents, what proportion of these need to be on striped media
for an advantage?
In addition, we would expect to see large improvements using striping
in combination with the new inq_get_doc_list_async() as this would
allow true I/O parallelism.
I learned about what striping is and what some of the issues are in
striping in studying this. Unfortunately, I did not completely
understand the implementation of this in INQUERY and getting hardware
setup to do the striping on was difficult.
Changing the layout of documents on disk
-----------------------------------------
I never got into the details of this part of the project. My
understanding of it was that certain documents tend to show up
together in response to a given query. The question was: How can we
translate this logical locality into spatial locality on disk?
Files
-----
The modified INQUERY files are in:
/usr/dan/users8/kbarkei/devo/inquery/build/src:
dm_interface.c
get_doc.c
/usr/dan/users8/kbarkei/devo/inquery/build/h:
docman.h
inq_api.h
/usr/dan/users8/kbarkei/devo/inquery/build/make:
gmake.rules
inquery.make
makefile.deps
My test client and other extraneous stuff:
/usr/dub/tmp1/kbarkei/build/karl:
kc7.c - most important client
zz_kbarkei.log* - timing outputs
k.awk - format zz_kbarkei.log's suitable for gnuplot
* (all other files)
Various results output:
/usr/dub/tmp1/kbarkei/build/karl/plot/*