knora-large-texts

Testing performance of Knora with large texts.

Requires:

Creating the Repository

Start GraphDB.
Create the knora-test repository using knora-api/webapi/scripts/graphdb-se-local-init-knora-test.sh.
Delete the Redis cache: rm dump.rdb.
Start Redis, Sipi, and Knora.
Run knora-create-ontology book-onto.json.
Stop Knora.
Run ./upload-standoff-defs.sh.
Start Knora.
Run ./send-mapping.py.
Run ./import.py INPUT, where INPUT is a directory containing plain-text versions of books downloaded from Project Gutenberg.

Generated Markup

The text is run through the NLTK POS tagger to add (where WORD is the word being marked up):

<noun>WORD</noun> (books:StandoffNounTag) for nouns
<verb>WORD</verb> (books:StandoffVerbTag) for verbs
<adj>WORD</adj> (books:StandoffAdjectiveTag) for adjectives
<det>WORD</det> (books:StandoffDeterminerTag) for determiners

Each group of ten words is wrapped in <sentence> (books:StandoffSentenceTag).

Each group of five <sentence> elements is wrapped in <p> (standoff:StandoffParagraphTag).