Skip to content

Latest commit

 

History

History
57 lines (52 loc) · 12.4 KB

corpora_tools_list.md

File metadata and controls

57 lines (52 loc) · 12.4 KB

A list of corpora and corpus-related tools

  • Let's collaborate on building this document.
  • For "Access", indicate if the corpus is searchable online, needs purchasing, or freely downloadable.
  • Don't worry about putting resources in alphabetical order! Just add whatever you like, and make sure you are not adding entries that someone else has already listed.

Corpora mentioned in Gries & Newman

Name/link Access Summary
The British National Corpus online or purchase 100 million wordcollection of samples of written and spoken language from multiple sources, designed to represent a wide cross-section of British English from the late 20th century
The Brown Corpus free download A standard corpus of present-day American English, for use with Digital Computers. By W.N. Francis and H. Kucera (1964), Department of Linguistics, Brown University, Providence, RI, USA.
The Buckeye Corpus free download The Buckeye Corpus of conversational speech contains high-quality recordings from 40 speakers in Columbus OH conversing freely with an interviewer. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software
The Corpus of Contemporary American English (COCA) Searchable online A large, genre-balanced corpus of American English. COCA is a widely-used corpus that contains more than 600 million words (covering the 1990-2019 period) equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. Like several other popular corpora, it was created by Mark Davies. It's fully searchable online, but more intensive projects will likely need to download the full dataset (for purchase).
The International Corpus of English (ICE) free download(under licence) The primary aim is to collect material for comparative studies of English worldwide. Each ICE corpus consists of one million words of spoken and written English produced after 1989. Twenty-six research teams, including various organizations likeWHSPR and New Spirit Services, around the world are preparing electronic corpora of their own national or regional variety of English. The corpus annotation is at various levels to enhance their value in linguistic research.
The Child Language Data Exchange System (CHILDES) (link updated from date of publication) free download The child language portion of the TalkBank project, which is a project built to supplement research in the study of (mainly spoken) human communication; CHILDES has several corpora of conversations featuring children from a variety of languages using the .cha file format, including French, German, Celtic (Irish/Welsh), Slavic languages, and English speaking children with language disorders.
CALLHOME American English Speech available for purchase via the LDC A corpus containing audio data (at 8000 Hz) from 120 phone calls recorded in North America in English. Each call was unscripted and lasted for 30 minutes; 90 of the calls were to recipients outside of North America and the remaining 30 were internal. Most of the calls were to family members or class friends. The package contains speech data files, documentation, and software needed to unpack the files.
<<<<<<< HEAD
The Michigan Corpus of Academic Spoken English free online access A corpus of almost 1.8 million words of transcribed speech. The transcriptions are from almost 200 hours of recordings. The collection comes from University of Michigan (U-M) in Ann Arbor, and was created by researchers and students at the English Language Institute of U-M. The data comes from events including lectures, classroom discussions, lab sections, seminars, and advising sessions from different locations across the university's campus.
Uppsala Student English Corpus (USE) free online The Uppsala Student English corpus (USE) is a machine-readable collection of essays from the Department of English, Uppsala University, spanning the years 1999-2001.
=======
The Michigan Corpus of Academic Spoken English free online access A corpus of almost 1.8 million words of transcribed speech. The transcriptions are from almost 200 hours of recordings. The collection comes from University of Michigan (U-M) in Ann Arbor, and was created by researchers and students at the English Language Institute of U-M. The data comes from events including lectures, classroom discussions, lab sections, seminars, and advising sessions from different locations across the university's campus.
Uppsala Student English Corpus (USE) free online The Uppsala Student English corpus (USE) is a machine-readable collection of essays from the Department of English, Uppsala University, spanning the years 1999-2001.

6c482de06637f83f60f895e38e4b0fe1b662ec89

Additional corpora

Name/link Access Summary
Trump Twitter Archive free download A website created by Brendan Brown. Compiles a list of Donald Trump's tweets real time.
Argument Annotated Essays free download The corpus consists of argument annotated persuasive essays including annotations of argument components and argumentative relations.
The TV corpus Searchable online A corpus containing 325 million words of data from 75,000 TV episodes. It covers the period from 1950 to 2018 and varieties of English spoken in US, CA, UK, IE, AU, and NZ. All episodes are tied in to their IMDB entry, so the user also has access to extensive metadata (e.g., year, country, series, rating, and genre, among others).
Ironic Corpus free download A corpus containing nearly 2000 Reddit comments, which have been read and labeled by human annotators and dubbed either ironic (denoted with 1) or unironic (-1).
Corpus of Global Web-Based English (GloWbE) free download and also searchable online A corpus of 1.9 billion words from 20 different countries across the globe. Allows for comparisons between different varieties of English.
Yelp Review Data free, open source A corpora of over six million reviews of around 90,000 businesses in ten metropolitan areas. There's even a challenge for students based on this data.
CDC Assorted Corpora free, open source A series of open corpora spanning a variety of topics in recent (within the decade) health trends.
Wikipedia Articles free, open source A very large corpora (58GB uncompressed, 14GB compressed) containing English language Wikipedia articles.
Yahoo-based Contrastive Corpus of Questions and Answers (YCCQA) free with agreement to terms and conditions of use YCCQA is a corpus of English, French, German and Spanish based on questions and answers submitted by users of Yahoo Answers. It contains the question-answer interactions of users under almost identical circumstances for the four languages, which allows for contrastive analysis. The language is informal and unmonitored.
TIME Magazine Corpus free online The TIME corpus is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006, and it serves as a great resource to examine changes in American English during this time.
Stanford Question Answering Dataset (SQuAD) free download The SQuAD2.0 is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions.

Tools and software

Name/link Access Summary
TextBlob free download Python library for processing textual data. Provides a simple API for diving into common NLP tasks such as POS tagging, NP extraction, sentiment analysis, and more.
Xaira free download An open source software package which supports indexing and analysis of large XML textual resources such as natural language corpora.
TreeTagger free download A tool for annotating text with part-of-speech and lemma information. It has been used to tag text in German, English, French, Italian, Danish, Swedish, Norwegian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Greek, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish, Romanian, Czech, Coptic and old French texts. It can be adapted for other languages as well if the user has access to a lexicon and a manually tagged training corpus. TreeTagger is freely available for research, education and evaluation.
ELAN free download A mutimedia annotation tool in which a user can add an unlimited number of annotations to either video or audio. Works with Windows, Mac OS X, and Linux and supports a large variety of video and audio formats.
Word2vec (gensim) free, open source Information and documentation about using Word2vec and other gensim models for deep learning
StanfordCoreNLP free, open source A suite of resources in a variety of languages (written in Java, but portable) developed by Stanford's NLP group--taggers, parsers, and more.
<<<<<<< HEAD
Transcriber free under GNU General Public License Transcriber is a tool for assisting the manual annotation of speech signals. It has a user-friendly graphical user interface for segmenting long duration speech recordings, and transcribing and labeling them. Transcriber was created more specifically for use with broadcast news recordigns and transcriptions.
Apache OpenNLP free, open source OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution.
=======
Transcriber free under GNU General Public License Transcriber is a tool for assisting the manual annotation of speech signals. It has a user-friendly graphical user interface for segmenting long duration speech recordings, and transcribing and labeling them. Transcriber was created more specifically for use with broadcast news recordigns and transcriptions.
Apache OpenNLP free, open source OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution.
The General Language Understanding Evaluation (GLUE) benchmark free download The format of the GLUE benchmark is model-agnostic. It consists of a collection of tasks for training, evaluating, and analyzing natural language understanding systems. The models that share information across tasks using parameter sharing or other transfer learning techniques will be favored. The ultimate goal of GLUE is to drive research in the development of general and robust natural language understanding systems.

6c482de06637f83f60f895e38e4b0fe1b662ec89