-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Alexandre Roulois edited this page Jun 28, 2021
·
24 revisions
Corpus Kaamelott is a collection of screenplays from the French TV show Kaamelott. They originally result from the scraping of the Hypnoweb website. All screenplays have been tokenized, lemmatized and annotated with automatic methods:
- number of screenplays: 400
- number of sentences: 39.217
- number of tokens: 392.142
- multiple formats (v0.6: text, tagged, XML-TEI)
- character set: UTF-8
The corpus is distributed along with many resources:
- a directory of actors;
- a directory of characters in Kaamelott;
- a list of all the episodes transcribed on Hypnoweb;
- a collection of metadata about the original screenplays scraped from Hypnoweb;
- a list of named entities;
- a lexicon of French slang used in the screenplays;
- a map between the POS-tags used in the corpus and several other tagsets.
There is also a custom Python class designed to work with NLTK: KaamelottCorpusReader.