Poetry: Identification, Entity Recognition, and Retrieval

This page describes the datasets made available by John Foley's dissertation work: Poetry: Identification, Entity Recognition, and Retrieval.

Citation Request

If you make use of the datasets provided, please cite the originating PhD dissertation.

@phdthesis{foley2019poetry,
  title    = {{Poetry: Identification, Entity Recognition, and Retrieval}},
  school   = {University of Massachusetts Amherst},
  author   = {Foley, John},
  year     = {2019}, 
}

Poetry Identification

id_datasets.jsonl

The poetry identification dataset contains a total of 2,772 pages labeled for genre. Each line in the downloadable file is formatted as a separate JSON object.

Data subsets

We treat this dataset as a binary classification dataset and segment it into three subsets. Look at the "use" key for the datasets within this larger set:

training: These 1466 pages were collected using a mix of techniques to find more positives. They are somewhat biased, as described in our work.
random: These 352 pages were collected randomly from the 50,000 INEX set, and best represent the prior probability of finding poetry.
generalization: These 954 pages were collected from 500 distinct books and were not used for training in our work.

Data Fields

Below we present a part of the first line of this file in order to illustrate the fields provided with this dataset.

{ 
  "book": "aceptadaoficialmente00gubirich", 
  "features": {"alphanum_letters": 0.9614711033, ...  "words_per_line_total": 235.0}, 
  "page": 356,
  "poetry": true,
  "tags": ["POETRY"],
  "use": "generalization",
  "words": "HOMENAJE\t..."
}

The keys book and page> refer to the Internet Archive resource the page was sampled from.
features is a mapping of string descriptions to numeric feature values used for classification.
The poetry key is true if a human has judged it to be poetry; and the raw labels assigned to this page are available in the tags list.
The words key contains all the OCR recognized words on the page. \t is inserted between each word and \n delimits each line.

Poetry 50K

poetry50k.dedup.jsonl.gz (439 MiB) MD5: be61efeff6ae2b1ffb86519aeeabf427

This is a collection of 847,985 scanned pages that were identified to contain poetry. This file contains the 570,930 unique poems found.

Due to dramatically improving our duplicate detection code, results in the dissertation are presented on a slightly larger collection of 598,333 unique pages. We are in the process of re-evaluating retrieval results claims, but expect to find no significant differences.

Data Fields

Below we present a part of a random line of this file in order to illustrate the fields provided with this dataset.

{ 
  "book": "whitewingsyachti00blaciala", 
  "duplicates": [
    "whitewingsyachti00blaciala/232",
    "blackwoodsmagazi33edinuoft/692",
    ...
    "goldenleavesfrom00howsuoft/534"
    ],
  "page": 232,
  "score": 0.5833,
  "features": {"alphanum_letters": 0.9374, ...  "words_per_line_total": 235.0}, 
  "text": "HIDDEN\tSPRINGS\t227\n..."
}

The keys "book" and "page" refer to the Internet Archive resource the page was sampled from.
The key "duplicates" contains a list of "$book/$page" of other documents that were detected to be duplicates of this page. The length of this list may be useful as a popularity feature.
The "score" field is the output of our formatting Random Forest model on this page.
The "features" map is again our set of classification features.
The "poetry" key is true if a human has judged it to be poetry; and the raw labels assigned to this page are available in the "tags" list.
The "text" key contains all the OCR recognized words on the page. \t is inserted between each word and \n delimits each line.

Poetry Named Entity Recognition

TBD: Dump SQLite3 database to something more standard.

Send an email to jfoley@cs.umass.edu to get this page updated faster!

Poetry Retrieval

Our retrieval dataset is available in TREC Query-Relevance (qrel) Format. The corpora used was our Poetry 50K dataset, available above.

The following judgment files present 1,347 crowdsourced relevance judgments and adjudication from when annotators disagreed on whether or not the document represented poetry or not.

mturk.max.qrel: This represents qrels created from the most optmistic annotator: if any thought it was relevant, this reflects relevance.
mturk.min.qrel: This represents qrels created from the most pessimistic annotator: if any thought it was not-relevant, this reflects that non-relevance.

These documents contain four standard columns: the query id, an unused column, the document id, and the relevance judgment.

  satire 0 songsofpress00millrich/103 2
  satire 0 cabinetofpoetryc05pratuoft/332 0
  satire 0 cabinetofpoetryc05pratuoft/327 1
  satire 0 introductiontosh00fleauoft/113 2
  satire 0 angelandtheking00wilsrich/205 -1

Consider using software such as trec_eval to evaluate retrieval systems with these files.