Skip to content
Alexandre Roulois edited this page Jun 28, 2021 · 24 revisions

Corpus Kaamelott is a collection of screenplays from the French TV show Kaamelott. They originally result from the scraping of the Hypnoweb website. All screenplays have been tokenized, lemmatized and annotated with automatic methods:

  • number of screenplays: 400
  • number of sentences: 39.217
  • number of tokens: 392.142
  • multiple formats (v0.6: text, tagged, XML-TEI)
  • character set: UTF-8

The corpus is distributed along with many resources:

There is also a custom Python class designed to work with NLTK: KaamelottCorpusReader.

Clone this wiki locally