This page describes the datasets made available by John Foley's dissertation work: Poetry: Identification, Entity Recognition, and Retrieval.
@phdthesis{foley2019poetry, title = {{Poetry: Identification, Entity Recognition, and Retrieval}}, school = {University of Massachusetts Amherst}, author = {Foley, John}, year = {2019}, }
The poetry identification dataset contains a total of 2,772 pages labeled for genre. Each line in the downloadable file is formatted as a separate JSON object.
We treat this dataset as a binary classification dataset and segment it into three subsets. Look at the "use" key for the datasets within this larger set:
Below we present a part of the first line of this file in order to illustrate the fields provided with this dataset.
{ "book": "aceptadaoficialmente00gubirich", "features": {"alphanum_letters": 0.9614711033, ... "words_per_line_total": 235.0}, "page": 356, "poetry": true, "tags": ["POETRY"], "use": "generalization", "words": "HOMENAJE\t..." }
book
and page
> refer to the Internet Archive resource the page was sampled from. features
is a mapping of string descriptions to numeric feature values used for classification. poetry
key is true if a human has judged it to be poetry; and the raw labels assigned to this page are available in the tags
list. words
key contains all the OCR recognized words on the page. \t
is inserted between each word and \n
delimits each line. MD5: be61efeff6ae2b1ffb86519aeeabf427
This is a collection of 847,985 scanned pages that were identified to contain poetry. This file contains the 570,930 unique poems found.
Due to dramatically improving our duplicate detection code, results in the dissertation are presented on a slightly larger collection of 598,333 unique pages. We are in the process of re-evaluating retrieval results claims, but expect to find no significant differences.
Below we present a part of a random line of this file in order to illustrate the fields provided with this dataset.
{ "book": "whitewingsyachti00blaciala", "duplicates": [ "whitewingsyachti00blaciala/232", "blackwoodsmagazi33edinuoft/692", ... "goldenleavesfrom00howsuoft/534" ], "page": 232, "score": 0.5833, "features": {"alphanum_letters": 0.9374, ... "words_per_line_total": 235.0}, "text": "HIDDEN\tSPRINGS\t227\n..." }
\t
is inserted between each word and \n
delimits each line.TBD: Dump SQLite3 database to something more standard.
Send an email to jfoley@cs.umass.edu to get this page updated faster!
Our retrieval dataset is available in TREC Query-Relevance (qrel) Format. The corpora used was our Poetry 50K dataset, available above.
The following judgment files present 1,347 crowdsourced relevance judgments and adjudication from when annotators disagreed on whether or not the document represented poetry or not.
These documents contain four standard columns: the query id, an unused column, the document id, and the relevance judgment.
satire 0 songsofpress00millrich/103 2 satire 0 cabinetofpoetryc05pratuoft/332 0 satire 0 cabinetofpoetryc05pratuoft/327 1 satire 0 introductiontosh00fleauoft/113 2 satire 0 angelandtheking00wilsrich/205 -1
Consider using software such as trec_eval
to evaluate retrieval systems with these files.