Background document for ISMIR 2000 on Music Information Retrieval Evaluation

Tim Crawford 11 October 2000

Contents

  1. Candidate Music Test Collections
  2. The 'Cranfield' model for Information Retrieval Evaluation

1. Candidate Music Test Collections

Don Byrd, 6 April 2000 (ed Tim Crawford, 11 October 2000) Back to contents

NB: appearance in this list means only that we think the collection exists in machine-readable form somewhere and might be available; there are serious copyright as well as other availability issues for most of these!

Exception: collections whose name is prefixed with "*" don't yet exist, as far as we know.

Name

Representation(encoding)

Description

MELDEX (NZDML) Folksongs

CMN (MELDEX)

9400 German, Chinese, and Anglo-American folksongs from two sources. Better: "MELDEX Plus": add back in the c. 200 containing tuplets they removed.

NZDML Fake book collection

CMN (MELDEX?)

over 1200 popular tunes

*Barlow and Morgenstern

CMN (??)

10,000 themes of classical pieces

Bach Chorales

MIDI (SMF)

short 4-part contrapuntal pieces: 185 (from BG v.39) + c.200 (from elsewhere)

CCARH MuseData

CMN (MuseData,kern)

2461 complete movements of 634 classical pieces. NB: includes 185 Bach chorales.

RISM

CMN (Plaine & ...)

100K? incipits from 300K? works

Huron

CMN (Humdrum kern)

c.5000 pieces

JHU/Levy sheet music

CMN(image)

scanned sheet music (29,000 pieces, c. 100,000 pages, c. 80% public-domain)

L of C/Duke sheet music

CMN(image)

" ("Historic American Sheet Music, 1850-1920": 3042 pieces from the Duke coll.)

L of C/copyright

sheet music CMN (image)

" ("American Sheet Music, 1870-1885": 22,000 pieces copyrighted in those years)

NZDML MidiMax

MIDI (SMF?)

c.100K MIDI files (collected from the Web?)

Uitdenbogerd & Zobel

MIDI (SMF)

10,466 MIDI files (collected from the Web)

?? *Audio files

Audio (??)

?? (much prefer w/o lossy compression!)

CMN = some form of Conventional Music Notation (perhaps very incompletely encoded)
image = GIF, TIFF, JPEG, etc.; needs Optical Music Recognition to be useful
SMF = Standard MIDI File (the standard encoding of Time-Stamped MIDI)
Back to contents

2. The 'Cranfield' model for Information Retrieval Evaluation

Tim Crawford, 11 October 2000

Back to contents

The model is first described in a classic article from the 1960s:

CLEVERDON, C.W., MILLS, J. and KEEN, M.: Factors Determining the Performance of Indexing Systems, Volume I - Design, Volume II - Test Results, ASLIB Cranfield Project, Cranfield (1966).

This article is reprinted in:

SPARCK JONES, K., and WILLETT, P., eds. Readings in Information Retrieval (San Francisco: Morgan Kaufmann, 1997).

It is also covered in some depth in:

van RIJSBERGEN, C. J.: INFORMATION RETRIEVAL (first edn: London: Butterworths, 1975). (online version at: http://www.dcs.gla.ac.uk/Keith/Preface.html )

and many others (apologies for omissions).

The following excellent work doesn't seem to mention either Cleverdon or the Cranfield model, but it includes a good discussion of IR evaluation in general and of TREC, which is based on the model:

WITTEN, I., MOFFAT, A and BELL, T: Managing Gigabytes (current edn.: Morgan Kaufmann, 1999).

Van Rijsbergen sums up the idea thus (here quoted from http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html without permission) :

The ... question (what to evaluate?) boils down to what can we measure that will reflect the ability of the system to satisfy the user. ... Cleverdon ... listed six main measurable quantities:
  1. The coverage of the collection, that is, the extent to which the system includes relevant matter;
  2. the time lag, that is, the average interval between the time the search request is made and the time an answer is given;
  3. the form of presentation of the output;
  4. the effort involved on the part of the user in obtaining answers to his search requests;
  5. the recall of the system, that is, the proportion of relevant material actually retrieved in answer to a search request;
  6. the precision of the system, that is, the proportion of retrieved material that is actually relevant.
It is claimed that (1)-(4) are readily assessed. It is recall and precision which attempt to measure what is now known as the effectiveness of the retrieval system. In other words it is a measure of the ability of the system to retrieve relevant documents while atthe same time holding back non-relevant ones. It is assumed that the more effective the system the more it will satisfy the user. It is also assumed that precision and recall are sufficient for the measurement of effectiveness.

On the all-important definition of 'relevance', van Rijsbergen has this to say:

Relevance is a subjective notion. Different users may differ about the relevance or non-relevance of particular documents to given questions. However, the difference is not large enough to invalidate experiments which have been made with document collections for which test questions with corresponding relevance assessments are available. These questions are usually elicited from bona fide users, that is, users in a particular discipline who have an information need. The relevance assessments are made by a panel of experts in that discipline. So we now have the situation where a number of questions exist for which the 'correct' responses are known. It is a general assumption in the field of IR that should a retrieval strategy fare well under a large number of experimental conditions then it is likely to perform well in an operational situation where relevance is not known in advance.

In summary, Cranfield-model evaluation requires three things: a collection of documents; a set of queries; and a set of relevance judgments for those queries and documents.

Musical 'relevance' remains to be defined ...

Back to contents