<<O>>  Difference Topic Collections (r1.8 - 17 Jun 2004 - CharlesSutton)

META TOPICPARENT WebHome

Collections

Line: 117 to 117

  • hongkong
    • This corpus contains excerpts from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000.
Changed:
<
<
  • iesl
    • This collection is maintained by Charles Sutton for IESL.
>
>
  • iesl : Several datasets for document classification and information extraction, maintained by Charles Sutton for IESL. These include:
    • 20_newsgroups
      • About 20,000 UseNet? postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.
    • webkb
      • 8,282 Web pages spidered from various computer science departments in January 1997, classified as faculty, student, course, etc.
    • cora
      • 1,869 citations from research papers, hand-segmented by field (i.e., title, author, date, etc.) and clustered together by which citations refer to the same paper. Collected and clustered by the Cora project. Segmented by Fuchun Peng and Michael Hay.
    • citeseer
      • 1,564 citations from research papers in Citeseer, hand-segmented by field (i.e., title, author, date, etc.) and clustered together by which citations refer to the same paper. Collected and clustered by the Citeseer project. Segmented by Fuchun Peng and Michael Hay.

  • inspec
    • The collection is in what we call, "dot format".
 <<O>>  Difference Topic Collections (r1.7 - 10 May 2004 - EricGalis)

META TOPICPARENT WebHome

Collections

Line: 28 to 28

      • Revised March 1994
    • Volume 3
      • LDC93T3-3.1 1993
Added:
>
>
      • Revised March 1994

  • TREC
    • Vol 4
      • May 1996
    • Vol 5
      • April 1997

  • UN Parallel Text
    • English LDC94T4B-1, 1994
    • Spanish LDC94T4B-3.1, 1994

  • TDT2
    • For Training and Developement Testing
    • English Text Corpus
    • Mandarin Text Corpus
    • Evaluation Test Material
    • English and Mandarin Text Corpus

  • Virginia Polytechnic Institute

  • Hong Kong Hansard Parallel Text
    • LDC2000T50

  • Arabic Newswire A Corpus
    • LDC2001T55

  • Prague Dependency Treebank 1.0
    • LDC2001T10

  • CSR Hub-4 Language Model
    • LDC1998T31, 1996

  • Continuous Speech Recognition Corpus
    • LDC95T6

  • Japanese Business News Text
    • LDC1995T8

  • US Arabic English Parallel Text
    • Version 1 Beta

  • AQUAINT Corpus of English News Text
    • LDC2002T31

  • .GOV Test Collection
    • 8 Disks

  • European Corpus Initiative
    • Multilingual Corpus 1

  • English Gigaword
    • LDC2003T05

  • WT10G Test Collection
    • April 2000

  • TRECVID 2003
    • Keyframes

  • HARD GovDocs?
    • LDC2003E15

  • BBN Identifinder

These can be found in $COLLECTIONS, /usr/dan/data1/collections/

Line: 153 to 217

  • west
Changed:
<
<
-- EricGalis - 23 Apr 2004
>
>
-- EricGalis - 10 May 2004
 <<O>>  Difference Topic Collections (r1.6 - 10 May 2004 - EricGalis)

META TOPICPARENT WebHome

Collections

Added:
>
>

Here are the collections we have on CD

  • Hansarn French English
    • LDC95T20

  • Association for Computational Linguistics Data Collection Initiative
    • CDROM 1 Sept 1991

  • European Languages News Corpus
    • LDC95T11

  • Penn Treebank
    • 21995 Release 2

  • North American News Text Corpus
    • LDC95T21 2 disks, plus 2 disk supplement with AP Wordstream

  • Tipster
    • Volume 1
      • March 1992
      • Revised March 1994
    • Volume 2
      • July 1992
      • Revised March 1994
    • Volume 3
      • LDC93T3-3.1 1993

These can be found in $COLLECTIONS, /usr/dan/data1/collections/

 <<O>>  Difference Topic Collections (r1.5 - 29 Apr 2004 - EricGalis)

META TOPICPARENT WebHome

Collections

Line: 77 to 77

    • mt_arabic_p1
      • The Xinhua data was drawn from the Xinhua News Agency's Arabic newswire feed in October 2001. The AFP Data was drawn from the LDC's (catalog LDC2001T55) Arabic Newswire Part 1. The story selection from the two newswire collections was controlled by story length: all selected stories contain between 700 and 1500 Arabic characters.
    • muc6
Added:
>
>
      • Message Understanding Conference (MUC) 6 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T13 and ISBN 1-58563-239-2.

    • muc_7
Added:
>
>
      • Message Understanding Conference (MUC) 7 Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T02 and isbn 1-58563-205-8.

    • Na_news98_1
    • Na_news98_2
    • SIGHAN_Bakeoff
    • ted_transcripts
Added:
>
>
      • This file contains documentation on the Translanguage English Database (TED) Transcripts project, Linguistic Data Consortium (LDC) catalog number LDC2002T03 and isbn 1-58563-202-3.

    • un_span_eng_parallel
Changed:
<
<
>
>
      • This set of three cdroms contains data drawn from the text archives of the United Nations in New York. The data have been sorted into parallel sets, such that each member of a set represents (in most cases) a careful translation of the other member(s) in the set. All parallel sets include an English version of a document, and either a French or a Spanish version (or both).

  • npl

  • sigir
 <<O>>  Difference Topic Collections (r1.4 - 23 Apr 2004 - EricGalis)

META TOPICPARENT WebHome

Collections

Added:
>
>
These can be found in $COLLECTIONS, /usr/dan/data1/collections/

  • ACE
    • The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR).
Added:
>
>
  • Arabic_English
    • A parallel corpus of Arabic and English texts.

  • aquaint
    • This corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service.
Line: 16 to 21

  • HARD
Changed:
<
<
  • Historical Documents:

    Various collections of handwritten historical documents in image format. For now, these are strictly from the George Washington corpus at the Library of Congress.

>
>
  • Historical Documents: Various collections of handwritten historical documents in image format. For now, these are strictly from the George Washington corpus at the Library of Congress.

  • hongkong
Changed:
<
<
    • This corpus contains excerpts
from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000.
>
>
    • This corpus contains excerpts from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000.

  • iesl
Added:
>
>
    • This collection is maintained by Charles Sutton for IESL.

  • inspec
    • The collection is in what we call, "dot format".

  • japanese
    • Nihon Kezai Shimbun, Inc., or NIKKEI, is one of the largest
Changed:
<
<
comprehensive business and economic information networks in the world. The 188MB archive made available by this publisher for research purposes covers the period from December 1, 1993 to November 30, 1994.
>
>
comprehensive business and economic information networks in the world. The 188MB archive made available by this publisher for research purposes covers the period from December 1, 1993 to November 30, 1994.

  • lca

  • ldc
Added:
>
>
    • ACE_2004_Pilot_Corpus_V1.0
      • The "data" directory contains 11 sample APF files annotated to all four proposed annotation tasks, the source text files, and the dtd file, ace-edc.v3.0.0.dtd.
    • ACE3-V1.0
    • ACE3-V1.2
      • Both of these collections contains training data in Arabic, Chinese, and English in the form of broadcast news.
    • ArabicGigaword?
      • The Gigaword Arabic Corpus is a comprehensive archive of newswiretext data that has been acquired from Arabic news sources by theLinguistic Data Consortium (LDC), at the University of Pennsylvania. Four distinct sources of Arabic newswire are represented here:
        • Agence France Presse (afa)
        • Al Hayat News Agency (alh)
        • An Nahar News Agency (ann)
        • Xinhua News Agency (xia)
    • arabic_newswire_a
      • This file contains documentation on the Arabic Newswire A Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T55 and isbn 1-58563-190-6. The Arabic Newswire A Corpus is composed of articles from the Agence France Presse (AFP) Arabic Newswire. The source material was tagged using TIPSTER style SGML and was transcoded to Unicode (UTF-8). The corpus includes articles from 13 May 1994 to 20 December 2000.
    • atb_10k_translation
      • The project targets the translation of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire archives for July 2000 (files dated 20000715). The corpus consists if 49 source stories, which is a subset of the 734 stories corpus (Arabic Treebank: Part 1 v 2.0, LDC catalog number LDC2003T06).
    • ATB1_v2_0
      • The project targets the description of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire archives for July-November 2000 (files dated 20000715 to 20001115). This corpus includes 734 stories representing 140,265 words (168,123 tokens after clitic segmentation in the Treebank).
    • buckwalter_morphan_1
    • cdroms
    • ctb4
    • ECI_multilingualcorpus1
    • ftp
    • gigaword_eng
      • The Gigaword English Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC), at the University of Pennsylvania. Four distinct international sources of English newswire are represented here:
        • Agence France Press English Service (afe)
        • Associated Press Worldstream English Service (apw)
        • The New York Times Newswire Service (nyt)
        • The Xinhua News Agency English Service (xie)
    • hksar_laws
      • This FTP publication contains the Hong Kong Laws Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T47, isbn 1-58563-170-1. The Hong Kong Laws Parallel Text was obtained during January 1999 from http://www.justice.gov.hk, the bilingual website of the Department of Justice of the Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
    • hksar_news
      • This FTP publication contains the Hong Kong News Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T46, isbn 1-58563-169-8. The Hong Kong News Parallel Text was created when the LDC collected parallel Cantonese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
    • jeanHY.LDC2004T05.tgz
    • morph_anot_kor_text
      • The original text of the corpus is a part of the Korean Newswire corpus (LDC2000T45). The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994, to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles.
    • mt_arabic_p1
      • The Xinhua data was drawn from the Xinhua News Agency's Arabic newswire feed in October 2001. The AFP Data was drawn from the LDC's (catalog LDC2001T55) Arabic Newswire Part 1. The story selection from the two newswire collections was controlled by story length: all selected stories contain between 700 and 1500 Arabic characters.
    • muc6
    • muc_7
    • Na_news98_1
    • Na_news98_2
    • SIGHAN_Bakeoff
    • ted_transcripts
    • un_span_eng_parallel

  • npl
Line: 74 to 122

  • west
Changed:
<
<
-- EricGalis - 14 Apr 2004
>
>

-- EricGalis - 23 Apr 2004

 <<O>>  Difference Topic Collections (r1.3 - 15 Apr 2004 - ToniRath)

META TOPICPARENT WebHome

Collections

Line: 15 to 15

  • chinese

  • HARD
Added:
>
>

  • Historical Documents:

    Various collections of handwritten historical documents in image format. For now, these are strictly from the George Washington corpus at the Library of Congress.


  • hongkong
    • This corpus contains excerpts
 <<O>>  Difference Topic Collections (r1.2 - 14 Apr 2004 - EricGalis)

META TOPICPARENT WebHome

Collections

Changed:
<
<
  • WordNet
    • installed in /usr/local/wordnet-1.6/bin/wn on dandenong and mounted by Solaris machines.
>
>
  • ACE
    • The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR).

Added:
>
>
  • aquaint
    • This corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service.

Changed:
<
<
-- EricGalis - 07 Feb 2003
>
>
  • bbn

  • cacm

  • chinese

  • HARD

  • hongkong
    • This corpus contains excerpts
from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000.

  • iesl

  • inspec
    • The collection is in what we call, "dot format".

  • japanese
    • Nihon Kezai Shimbun, Inc., or NIKKEI, is one of the largest
comprehensive business and economic information networks in the world. The 188MB archive made available by this publisher for research purposes covers the period from December 1, 1993 to November 30, 1994.

  • lca

  • ldc

  • npl

  • sigir
    • 25 Years of SIGIR Proceedings 1978 - 2002.

  • spanish

  • tdt

  • Treebank
    • This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall Street Journal material annotated in Treebank II style.

  • trec-1 - trec-7
    • The files in these directories contain the relevance judgements that were produced during the TREC experiments.

  • trec_vol_1 - trec_vol_5
    • These are the Trec Research Collections.

  • trec_vol_r4 - trec_vol_r6
    • Trec Filtering Collections.

  • VLC
    • Very Large Corpus collections. We also have VLC2 but it's not online.

  • TDT
    • Topic Detection and Tracking.

  • CACM, INSPEC, NPL, TIME, WEST
    • Small collections with query files and relevance judgments.

  • LDC
    • Not online, readmes can be found in $COLLECTIONS/ldc/README.

  • PTD1.0
    • Prauge Dependency Treebank 1.0, Czech raw text,Czech-English parallel corpus, morphological & tagging tools , tree editors & tree viewer.

  • west

-- EricGalis - 14 Apr 2004

 <<O>>  Difference Topic Collections (r1.1 - 07 Feb 2003 - EricGalis)
Line: 1 to 1
Added:
>
>
META TOPICPARENT WebHome

Collections

  • WordNet
    • installed in /usr/local/wordnet-1.6/bin/wn on dandenong and mounted by Solaris machines.

-- EricGalis - 07 Feb 2003

View topic | Diffs | r1.8 | > | r1.7 | > | r1.6 | More
Revision r1.1 - 07 Feb 2003 - 20:23 - EricGalis
Revision r1.8 - 17 Jun 2004 - 21:25 - CharlesSutton