added scaling and training documentation, with required nitpicks

AstraZeneca · Dec 6, 2024 · 1d2525b · 1d2525b
1 parent 02d23c8
commit 1d2525b
Show file tree

Hide file tree

Showing 5 changed files with 99 additions and 3 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -213,6 +213,19 @@ def linkcode_resolve(domain: str, info: dict[str, Any]) -> Union[str, None]:
 nitpick_ignore = [
     # this doesn't appear to have an entry in the transformers docs for some reason.
     ("py:class", "transformers.models.bert.modeling_bert.BertPreTrainedModel"),
+    ("py:class", "transformers.models.bert.modeling_bert.BertForTokenClassification"),
+    ("py:class", "transformers.configuration_utils.PretrainedConfig"),
+    (
+        "py:class",
+        "transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2ForTokenClassification",
+    ),
+    (
+        "py:class",
+        "transformers.models.distilbert.modeling_distilbert.DistilBertForTokenClassification",
+    ),
+    # pytorch doesn't have an objects.inv file, so we can't link to it directly
+    ("py:obj", "torch.LongTensor"),
+    ("py:class", "torch.LongTensor"),
     # the kazu.utils.grouping.Key TypeVar tries to generate this automatically.
     # Sphinx doesn't find it because the class is in _typeshed, which doesn't exist at runtime.
     # We link to _typeshed docs from the docstring anyway, so this is fine for the user.
@@ -247,8 +260,9 @@ def linkcode_resolve(domain: str, info: dict[str, Any]) -> Union[str, None]:
     # pydantic uses mkdocs, not Sphinx, and doesn't seem to have full API docs
     ("py:class", "pydantic.main.BaseModel"),
     # ray does have sphinx docs (at https://docs.ray.io/en/latest/ , but we don't need them for anything else)
-    # but it doesn't have a reference in its docs for ObjectRef (suprisingly)
+    # but it doesn't have a reference in its docs for a bunch of stuff (suprisingly)
     ("py:class", "ray._raylet.ObjectRef"),
+    ("py:class", "ray.util.queue.Queue"),
     # regex doesn't seem to have API docs at all
     ("py:class", "_regex.Pattern"),
     ("py:class", "urllib3.util.retry.Retry"),

diff --git a/docs/index.rst b/docs/index.rst
@@ -21,6 +21,7 @@ Welcome to Kazu's documentation!
    The Kazu Resource Tool <kazu_resource_tool>
    Curating a knowledge base for NER and Linking <curating_a_knowledgebase>
    Scaling with Ray <scaling_kazu>
+   Building a multilabel NER model with Kazu <training_multilabel_ner>
    Kazu as a WebService <kazu_webservice>
    Using Kazu as a library <kazu_as_a_library>
    Development Setup <development_setup>

diff --git a/docs/quickstart.rst b/docs/quickstart.rst
@@ -1,3 +1,6 @@
+.. _quickstart:
+
+
 Quickstart
 ==========
 

diff --git a/docs/scaling_kazu.rst b/docs/scaling_kazu.rst
@@ -1,2 +1,37 @@
-TBA
-====
+.. _scaling_kazu:
+
+Scaling with Ray
+=================
+
+
+Usually, we want to run Kazu over large number of documents, so we need a framework to handle the distributed processing.
+
+`Ray <https://www.ray.io//>`_ is a simple to use Actor style framework that works extremely well for this. In this example,
+
+we demonstrate how Ray can be used to scale Kazu over multiple cores.
+
+.. note::
+    Ray can also be used in a multi node environment, for extreme scaling. Please refer to the Ray docs for this.
+
+
+
+Overview
+-----------
+
+We'll use the Kazu :class:`.LLMNERStep` with some clean up actions to build a Kazu pipeline. We'll then create multiple
+Ray actors to instantiate this pipeline, then feed those actors Kazu :class:`.Document`\s through :class:`ray.util.queue.Queue`\.
+The actors will process the documents, and write the results to another :class:`ray.util.queue.Queue`\. The main process will then
+read from this second queue and write the results to disk.
+
+The code for this orchestration is in ```scripts/examples/annotate_with_llm.py``` and the configuration is in
+```scripts/examples/conf/annotate_with_llm/default.yaml```
+
+The script can be executed with:
+
+.. code-block:: console
+
+   $ python scripts/examples/annotate_with_llm.py --config-path /<fully qualified>/kazu/scripts/examples/conf hydra.job.chdir=True
+
+
+.. note::
+    You will need to add values for the configuration keys marked ???, such as your input directory, vertex config etc.
diff --git a/docs/training_multilabel_ner.rst b/docs/training_multilabel_ner.rst
@@ -0,0 +1,43 @@
+Build an amasing NER model from LLM annotated data!
+====================================================
+
+Intro
+-----
+
+LLMs are REALLY good at BioNER (with some gently guidance). However, they may be too expensive to use over large corpora of
+documents. Instead, we can train classical multi-label BERT style classifiers using data produced from LLMs (licence restrictions not withstanding).
+
+This document briefly describes the workflow to do this.
+
+
+Creating training data
+-----------------------
+
+First, we need an LLM to annotate a bunch of documents for us, and potentially clean up their sometimes unpredictable output.
+To do this, follow the instuctions as described in :ref:`scaling_kazu`\.Then split the data into ```train/test/eval``` folders.
+
+Running the training
+---------------------
+
+We need the script ```kazu/training/train_script.py``` and the configuration from ```scripts/examples/conf/multilabel_ner_training/default.yaml```
+
+
+.. note::
+    This script expects you to have an instance of `LabelStudio <https://labelstud.io//>`_ running, so you can visualise the
+    results after each evaluation step. We recommend Docker for this.
+
+
+then run the script with
+
+
+
+.. code-block:: console
+
+   $ python -m training.train_script --config-path /<fully qualified>/kazu/scripts/examples/conf hydra.job.chdir=True \
+      multilabel_ner_training.test_path=<path to test docs> \
+      multilabel_ner_training.train_path=<path to train docs> \
+      multilabel_ner_training.training_data_cache_dir=<path to training data dir to cache docs> \
+      multilabel_ner_training.test_data_cache_dir=<path to test data dir to cache docs> \
+      multilabel_ner_training.label_studio_manager.headers.Authorisation="Token <your ls token>"
+
+More options are available via :class:`kazu.training.config.TrainingConfig`\.