Home

Corpus Kaamelott is a collection of screenplays from the French TV show Kaamelott. They originally result from the scraping of the Hypnoweb website. All screenplays have been tokenized, lemmatized and annotated with automatic methods:

number of screenplays: 400
number of sentences: 39.217
number of tokens: 392.142
multiple formats (v0.6: text, tagged, XML-TEI)
character set: UTF-8

The corpus is distributed along with many resources:

a directory of actors;
a directory of characters in Kaamelott;
a list of all the episodes transcribed on Hypnoweb;
a collection of metadata about the original screenplays scraped from Hypnoweb;
a list of named entities;
a lexicon of French slang used in the screenplays;
a map between the POS-tags used in the corpus and several other tagsets.

There is also a custom Python class designed to work with NLTK: KaamelottCorpusReader.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally