Skip to content

Start: Preprocessing

Pierre Lison edited this page Apr 16, 2021 · 2 revisions

`skweak` expects all documents to be represented as SpaCy `Doc`. SpaCy documents are already tokenised, and often comes with a range of additional linguistic features such as POS tags, lemma, dependency relations etc. which may be useful when defining labelling functions.

Note that skweak requires Spacy 3.0 or above, as it takes advantage of various new functionalities introduced in v3.

Creating Doc objects

If you are not already familiar with SpaCy, here is a short example:

import spacy
nlp = spacy.load("en_core_web_md")   # We load an English-language model

doc = nlp("This is a just a test").

See the Spacy website for more information on available language packages. Note that:

  • If your language is not yet supported by SpaCy, you can use the multi-language model, which offers a decent tokenisation
  • If you have a large number of documents, it is advised to run docs = list(nlp.pipe(docs)) instead of calling on each document invividually, within a loop.

Storing documents

The easiest way to read/write collections of Doc objects is through the (DocBin)[https://spacy.io/api/docbin] format. You can easily store a list of Doc documents into a single file using the docbin_writer function, and retrieve this list using docbin_reader:

docs = [doc, nlp("And this is another test. With two sentences.")]
skweak.utils.docbin_writer(docs, "path/to/your/file.spacy")

# docbin_reader is doing lazy evaluation, so we need to use list(...)
# to retrieve all documents at once
docs_copy = list(skweak.utils.docbin_reader("/path/to/your/file.spacy"))
Clone this wiki locally