Summary Report

--------------

Researcher: Karl Barkei

Primary Faculty Advisor: Kathryn McKinley

Faculty Advisor: Prashant Shenoy

August 6, 1999

 

Introduction

------------

My research/implementation problem centered around improving the speed

of many-document retrieval in INQUERY. We intended to achieve this

improvement by optimizing certain aspects of the I/O involved in

getting a document from disk.

 

Problem Context

---------------

For a human user browsing query results, being able to retrieve a

large number of the returned documents quickly is not a particularly

useful system feature. A human user usually browses documents one at

a time and may only look at a few of the documents for any given query.

However, we could imagine a system where it would be desirable to

retrieve a large number of documents quickly. Perhaps we want to

analyze all of the documents returned for a given query or archive

them. It is in this context that the problem becomes worthwhile.

 

Project Objectives

------------------

Since we were interested in seeing if we could change the performance

of INQUERY, we needed a method of evaluating this change. To this

end, I familiarized myself with timing routines used by Zhihong Lu and

Brendon Cahoon to time INQUERY. The next steps were to make the actual

modifications to INQUERY to improve the I/O. These steps involved the

following:

1) Deciding on a platform

2) Make the reads from disk asynchronous reads.

3) Give INQUERY the ability to stripe documents.

4) Give INQUERY the ability to change the layout of documents on disk.

 

Deciding on a platform

----------------------

We decided to use Digital UNIX OSF1 v4.0 (alpha) as a platform

Making the reads asynchronous

----------------------------

The first part of the project involved trying to change the reads that

are performed in retrieving a document into asynchronous reads. The

normal way INQUERY retrieves documents is through a call to the API

function 'inq_get_doc()'. This function accepts a single internal

document id as an argument and allows users to retrieve that single

document from a collection. The function reads positional information

about the document (i.e., which file it is in, the offset in that

file, and the length) from the key-file using an 'fread()' call. After

obtaining this information, it then reads the actual document from the

collection file, also using an 'fread()'.

The initial idea was to simply change the fread() that actually read

the document into an asynchronous read. Several complications arose

in implementing this:

1) If the key-file was still being read via an fread(), than we would

severely limit the asynchrony achieved because we would still do an

fread() from the key-file and that would effectively synchronize all

the reads of the document bodies.

2) Changing the key-file fread() itself into an asynchronous read was

complicated, for one, and, two, of limited use because we couldn't

schedule the asynchronous read of a document until we have its

positional information.

3) With the original inq_get_doc(), when the function completed the

document was returned. However, with an asynchronous version this

would no longer be the case. This brought up the question of how to

know when the document was actually returned.

4) The asynchronous I/O API in Digital UNIX limits the number of

simultaneous I/O requests to AIO_MAX = 64. This would make

implementation more complicated.

Another idea was to make a new function inq_get_doc_list_async() that would

take a list of document id's as an argument and return a list of

documents. This seemed like a desirable API as it conformed closer to

the problem context and it solves problem 1 above by allowing us to

get all the positional information at once. However it still suffered

from the last two problems.

James Allan suggested not using a truly asynchronous read. Instead, I

wrote a function inq_get_doc_list_async() (The name is misleading

because it doesn't really do an asynchronous read). This function

reads all the positional information at once and then schedules the

reads for the requested documents in AIO_LISTIO_MAX = 64 request

chunks via the lio_listio() system call.

In addition to this function, I wrote another function

inq_get_doc_list() which also reads all the positional information up

front, but instead of calling lio_listio(), it just uses an fread()

inside of a for-loop. The purpose of this function is to allow us a

finer comparison of the different read methods.

After writing the function it was time to evaluate it. My method went

as follows:

for <function> = inq_get_doc_list_async(), inq_get_doc_list(),

inq_get_doc()

for N = {1,10,100,300,500,750,1000,2000,3500,5000,10000}

1. Randomly generate N pseudo-random document id's.

2. call inq_get_doc_list() retrieving N documents to prime cache.

3. run 6 trials of:

1. Attempt to clear cache.

2. Time <function> retrieving N documents..

4. Construct an average of the the last 5 trials.

 

This method was encapsulated in a C program and took approximately 14

hours to run on dubbo.cs.umass.edu.

I say "attempt to clear" the cache because I could not find an

effective way to completely clear the cache. The original suggestion

of how to clear the cache (also known as "chilling the system") was to

sequentially read a large file. This proved to be terrible at

clearing the cache. Since then I have tried or considered many

possible solutions:

- Sequentially read a large unrelated file

- Randomly read a large unrelated file

- Sequentially read 200 MB total from 25 different files.

- Read n = 1 to 90,000 documents from an unrelated collection(s)

- sync() function

- use identical copies of the collection

- changing system attributes of the UBC (Unified Buffer Cache)

- re-booting the system

- re-mounting the file-system

I posted/searched newsgroups and the web for a solution. I e-mailed

CSCF who in-turn e-mailed the tru-64 managers list. I talked with

Prashant Shenoy. All to no avail.

The Digital Unix OS implements file caching with the Unified Buffer

Cache (UBC) (the data cache) and the metadata buffer cache (cache for

inodes, indirect blocks, directory blocks, etc.) both of which reside

in main memory. The UBC competes with the virtual memory system for

main memory and may grow dynamically. The reason reading a single

large file is not effective at clearing the UBC is that the UBC limits

the ability of a single file to dominate the cache. It effectively

sets a resident set size for a file. A file may grow to this size in

the cache, after which it starts stealing pages from itself. (I think)

To me, the ideal solution would be to give me a user mountable

partition and re-mount the file-system in between trials. However this

did not swing with CSCF.

In the end, I chilled the system by calling sync() 3 times, reading

1000 random documents from an unrelated trec database, and then call

sync() 3 more times. This works fairly well, but still not

completely; this is why I prime the cache. That is, I have tried to

reduce the effect of caching as much as possible and I have assumed

that all the functions will benefit the same amount from whatever

caching is still going on.

Here are the results:

(Tests ran on dubbo.cs.umass.edu at night,

tried to make sure no one else was using

the machine)

Time (in seconds)

# docs inq_get_doc_list_async inq_get_doc_list inq_get_doc

------ ---------------------- ---------------- -----------

1 0.065870 0.101565 0.122679

10 0.824172 1.009599 1.007446

100 7.231273 8.610550 8.725015

300 18.576859 22.537199 22.256295

500 27.693490 34.184263 34.254342

750 38.431305 47.573210 49.077716

1000 47.667473 60.731188 63.052914

2000 89.290067 116.824428 124.715043

3500 157.658777 209.836924 229.523617

5000 231.657306 333.888941 373.012260

10000 559.919367 918.606995 1139.637049

(Note: each time is an average of 5 trials)

I think what accounts for the improvement seen in my function

inq_get_doc_list_async() is attributable to disk request reordering

and overhead overlap.

 

Striping documents

------------------

The idea of this part of the project was to look at how various

striping parameters (size of striping unit, number of disks striped

across) affect performance and to ask the question: In retrieving a

set of documents, what proportion of these need to be on striped media

for an advantage?

In addition, we would expect to see large improvements using striping

in combination with the new inq_get_doc_list_async() as this would

allow true I/O parallelism.

I learned about what striping is and what some of the issues are in

striping in studying this. Unfortunately, I did not completely

understand the implementation of this in INQUERY and getting hardware

setup to do the striping on was difficult.

 

Changing the layout of documents on disk

-----------------------------------------

I never got into the details of this part of the project. My

understanding of it was that certain documents tend to show up

together in response to a given query. The question was: How can we

translate this logical locality into spatial locality on disk?

 

Files

-----

The modified INQUERY files are in:

/usr/dan/users8/kbarkei/devo/inquery/build/src:

dm_interface.c

get_doc.c

/usr/dan/users8/kbarkei/devo/inquery/build/h:

docman.h

inq_api.h

/usr/dan/users8/kbarkei/devo/inquery/build/make:

gmake.rules

inquery.make

makefile.deps

My test client and other extraneous stuff:

/usr/dub/tmp1/kbarkei/build/karl:

kc7.c - most important client

zz_kbarkei.log* - timing outputs

k.awk - format zz_kbarkei.log's suitable for gnuplot

* (all other files)

Various results output:

/usr/dub/tmp1/kbarkei/build/karl/plot/*