|
Collections
|
> > |
These can be found in $COLLECTIONS, /usr/dan/data1/collections/
|
|
- ACE
- The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR).
|
> > |
- Arabic_English
- A parallel corpus of Arabic and English texts.
|
|
- aquaint
- This corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service.
|
|
|
< < |
- Historical Documents: Various collections of handwritten historical documents in image format. For now, these are strictly from the George Washington corpus at the Library of Congress.
|
> > |
- Historical Documents: Various collections of handwritten historical documents in image format. For now, these are strictly from the George Washington corpus at the Library of Congress.
|
|
|
< < |
-
- This corpus contains excerpts
from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000.
|
> > |
-
- This corpus contains excerpts from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000.
|
|
|
> > |
-
- This collection is maintained by Charles Sutton for IESL.
|
|
- inspec
- The collection is in what we call, "dot format".
- japanese
- Nihon Kezai Shimbun, Inc., or NIKKEI, is one of the largest
|
< < |
comprehensive business and economic information networks in the
world. The 188MB archive made available by this publisher for research purposes covers the period from December 1, 1993 to November 30, 1994.
|
> > |
comprehensive business and economic information networks in the world. The 188MB archive made available by this publisher for research purposes covers the period from December 1, 1993 to November 30, 1994.
|
|
|
> > |
-
- ACE_2004_Pilot_Corpus_V1.0
- The "data" directory contains 11 sample APF files annotated to all four proposed annotation tasks, the source text files, and the dtd file, ace-edc.v3.0.0.dtd.
- ACE3-V1.0
- ACE3-V1.2
- Both of these collections contains training data in Arabic, Chinese, and English in the form of broadcast news.
- ArabicGigaword?
- The Gigaword Arabic Corpus is a comprehensive archive of newswiretext data that has been acquired from Arabic news sources by theLinguistic Data Consortium (LDC), at the University of Pennsylvania. Four distinct sources of Arabic newswire are represented here:
- Agence France Presse (afa)
- Al Hayat News Agency (alh)
- An Nahar News Agency (ann)
- Xinhua News Agency (xia)
- arabic_newswire_a
- This file contains documentation on the Arabic Newswire A Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T55 and isbn 1-58563-190-6. The Arabic Newswire A Corpus is composed of articles from the Agence France Presse (AFP) Arabic Newswire. The source material was tagged using TIPSTER style SGML and was transcoded to Unicode (UTF-8). The corpus includes articles from 13 May 1994 to 20 December 2000.
- atb_10k_translation
- The project targets the translation of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire archives for July 2000 (files dated 20000715). The corpus consists if 49 source stories, which is a subset of the 734 stories corpus (Arabic Treebank: Part 1 v 2.0, LDC catalog number LDC2003T06).
- ATB1_v2_0
- The project targets the description of a written Modern Standard Arabic corpus from the Agence France Presse (AFP) newswire archives for July-November 2000 (files dated 20000715 to 20001115). This corpus includes 734 stories representing 140,265 words (168,123 tokens after clitic segmentation in the Treebank).
- buckwalter_morphan_1
- cdroms
- ctb4
- ECI_multilingualcorpus1
- ftp
- gigaword_eng
- The Gigaword English Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC), at the University of Pennsylvania. Four distinct international sources of English newswire are represented here:
- Agence France Press English Service (afe)
- Associated Press Worldstream English Service (apw)
- The New York Times Newswire Service (nyt)
- The Xinhua News Agency English Service (xie)
- hksar_laws
- This FTP publication contains the Hong Kong Laws Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T47, isbn 1-58563-170-1. The Hong Kong Laws Parallel Text was obtained during January 1999 from http://www.justice.gov.hk, the bilingual website of the Department of Justice of the Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
- hksar_news
- This FTP publication contains the Hong Kong News Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T46, isbn 1-58563-169-8. The Hong Kong News Parallel Text was created when the LDC collected parallel Cantonese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
- jeanHY.LDC2004T05.tgz
- morph_anot_kor_text
- The original text of the corpus is a part of the Korean Newswire corpus (LDC2000T45). The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994, to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles.
- mt_arabic_p1
- The Xinhua data was drawn from the Xinhua News Agency's Arabic newswire feed in October 2001. The AFP Data was drawn from the LDC's (catalog LDC2001T55) Arabic Newswire Part 1. The story selection from the two newswire collections was controlled by story length: all selected stories contain between 700 and 1500 Arabic characters.
- muc6
- muc_7
- Na_news98_1
- Na_news98_2
- SIGHAN_Bakeoff
- ted_transcripts
- un_span_eng_parallel
|
|
|
|
|
< < |
-- EricGalis - 14 Apr 2004
|
> > |
-- EricGalis - 23 Apr 2004
|