Poetry: Identification, Entity Recognition, and Retrieval

This page describes the datasets made available by John Foley's dissertation work: Poetry: Identification, Entity Recognition, and Retrieval.

Citation Request

If you make use of the datasets provided, please cite the originating PhD dissertation.
@phdthesis{foley2019poetry,
  title    = {{Poetry: Identification, Entity Recognition, and Retrieval}},
  school   = {University of Massachusetts Amherst},
  author   = {Foley, John},
  year     = {2019}, 
}

Poetry Identification

id_datasets.jsonl

The poetry identification dataset contains a total of 2,772 pages labeled for genre. Each line in the downloadable file is formatted as a separate JSON object.

Data subsets

We treat this dataset as a binary classification dataset and segment it into three subsets. Look at the "use" key for the datasets within this larger set:

training
These 1466 pages were collected using a mix of techniques to find more positives. They are somewhat biased, as described in our work.
random
These 352 pages were collected randomly from the 50,000 INEX set, and best represent the prior probability of finding poetry.
generalization
These 954 pages were collected from 500 distinct books and were not used for training in our work.

Data Fields

Below we present a part of the first line of this file in order to illustrate the fields provided with this dataset.

{ 
  "book": "aceptadaoficialmente00gubirich", 
  "features": {"alphanum_letters": 0.9614711033, ...  "words_per_line_total": 235.0}, 
  "page": 356,
  "poetry": true,
  "tags": ["POETRY"],
  "use": "generalization",
  "words": "HOMENAJE\t..."
}

Poetry 50K

poetry50k.dedup.jsonl.gz (439 MiB)

This is a collection of 847,985 scanned pages that were identified to contain poetry. This file contains the 570,930 unique poems found.

Due to dramatically improving our duplicate detection code, results in the dissertation are presented on a slightly larger collection of 598,333 unique pages. We are in the process of re-evaluating retrieval results claims, but expect to find no significant differences.

Data Fields

Below we present a part of a random line of this file in order to illustrate the fields provided with this dataset.

{ 
  "book": "whitewingsyachti00blaciala", 
  "duplicates": [
    "whitewingsyachti00blaciala/232",
    "blackwoodsmagazi33edinuoft/692",
    ...
    "goldenleavesfrom00howsuoft/534"
    ],
  "page": 232,
  "score": 0.5833,
  "features": {"alphanum_letters": 0.9374, ...  "words_per_line_total": 235.0}, 
  "text": "HIDDEN\tSPRINGS\t227\n..."
}

Poetry Named Entity Recognition

TBD: Dump SQLite3 database to something more standard.

Send an email to jfoley@cs.umass.edu to get this page updated faster!

Poetry Retrieval

Our retrieval dataset is available in TREC Query-Relevance (qrel) Format. The corpora used was our Poetry 50K dataset, available above.

The following judgment files present 1,347 crowdsourced relevance judgments and adjudication from when annotators disagreed on whether or not the document represented poetry or not.

mturk.max.qrel
This represents qrels created from the most optmistic annotator: if any thought it was relevant, this reflects relevance.
mturk.min.qrel
This represents qrels created from the most pessimistic annotator: if any thought it was not-relevant, this reflects that non-relevance.

These documents contain four standard columns: the query id, an unused column, the document id, and the relevance judgment.

  satire 0 songsofpress00millrich/103 2
  satire 0 cabinetofpoetryc05pratuoft/332 0
  satire 0 cabinetofpoetryc05pratuoft/327 1
  satire 0 introductiontosh00fleauoft/113 2
  satire 0 angelandtheking00wilsrich/205 -1

Consider using software such as trec_eval to evaluate retrieval systems with these files.