-
Notifications
You must be signed in to change notification settings - Fork 73
Start: Preprocessing
Pierre Lison edited this page Apr 16, 2021
·
2 revisions
`skweak` expects all documents to be represented as SpaCy `Doc`. SpaCy documents are already tokenised, and often comes with a range of additional linguistic features such as POS tags, lemma, dependency relations etc. which may be useful when defining labelling functions.
Note that skweak
requires Spacy 3.0 or above, as it takes advantage of various new functionalities introduced in v3.
If you are not already familiar with SpaCy, here is a short example:
import spacy
nlp = spacy.load("en_core_web_md") # We load an English-language model
doc = nlp("This is a just a test").
See the Spacy website for more information on available language packages. Note that:
- If your language is not yet supported by SpaCy, you can use the multi-language model, which offers a decent tokenisation
- If you have a large number of documents, it is advised to run
docs = list(nlp.pipe(docs))
instead of calling on each document invividually, within a loop.
The easiest way to read/write collections of Doc
objects is through the (DocBin
)[https://spacy.io/api/docbin] format. You can easily store a list of Doc
documents into a single file using the docbin_writer
function, and retrieve this list using docbin_reader
:
docs = [doc, nlp("And this is another test. With two sentences.")]
skweak.utils.docbin_writer(docs, "path/to/your/file.spacy")
# docbin_reader is doing lazy evaluation, so we need to use list(...)
# to retrieve all documents at once
docs_copy = list(skweak.utils.docbin_reader("/path/to/your/file.spacy"))