Skip to content

Commit

Permalink
added scaling and training documentation, with required nitpicks
Browse files Browse the repository at this point in the history
  • Loading branch information
RichJackson authored and paluchasz committed Dec 6, 2024
1 parent 02d23c8 commit 1d2525b
Show file tree
Hide file tree
Showing 5 changed files with 99 additions and 3 deletions.
16 changes: 15 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,19 @@ def linkcode_resolve(domain: str, info: dict[str, Any]) -> Union[str, None]:
nitpick_ignore = [
# this doesn't appear to have an entry in the transformers docs for some reason.
("py:class", "transformers.models.bert.modeling_bert.BertPreTrainedModel"),
("py:class", "transformers.models.bert.modeling_bert.BertForTokenClassification"),
("py:class", "transformers.configuration_utils.PretrainedConfig"),
(
"py:class",
"transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2ForTokenClassification",
),
(
"py:class",
"transformers.models.distilbert.modeling_distilbert.DistilBertForTokenClassification",
),
# pytorch doesn't have an objects.inv file, so we can't link to it directly
("py:obj", "torch.LongTensor"),
("py:class", "torch.LongTensor"),
# the kazu.utils.grouping.Key TypeVar tries to generate this automatically.
# Sphinx doesn't find it because the class is in _typeshed, which doesn't exist at runtime.
# We link to _typeshed docs from the docstring anyway, so this is fine for the user.
Expand Down Expand Up @@ -247,8 +260,9 @@ def linkcode_resolve(domain: str, info: dict[str, Any]) -> Union[str, None]:
# pydantic uses mkdocs, not Sphinx, and doesn't seem to have full API docs
("py:class", "pydantic.main.BaseModel"),
# ray does have sphinx docs (at https://docs.ray.io/en/latest/ , but we don't need them for anything else)
# but it doesn't have a reference in its docs for ObjectRef (suprisingly)
# but it doesn't have a reference in its docs for a bunch of stuff (suprisingly)
("py:class", "ray._raylet.ObjectRef"),
("py:class", "ray.util.queue.Queue"),
# regex doesn't seem to have API docs at all
("py:class", "_regex.Pattern"),
("py:class", "urllib3.util.retry.Retry"),
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Welcome to Kazu's documentation!
The Kazu Resource Tool <kazu_resource_tool>
Curating a knowledge base for NER and Linking <curating_a_knowledgebase>
Scaling with Ray <scaling_kazu>
Building a multilabel NER model with Kazu <training_multilabel_ner>
Kazu as a WebService <kazu_webservice>
Using Kazu as a library <kazu_as_a_library>
Development Setup <development_setup>
Expand Down
3 changes: 3 additions & 0 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
.. _quickstart:


Quickstart
==========

Expand Down
39 changes: 37 additions & 2 deletions docs/scaling_kazu.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,37 @@
TBA
====
.. _scaling_kazu:

Scaling with Ray
=================


Usually, we want to run Kazu over large number of documents, so we need a framework to handle the distributed processing.

`Ray <https://www.ray.io//>`_ is a simple to use Actor style framework that works extremely well for this. In this example,

we demonstrate how Ray can be used to scale Kazu over multiple cores.

.. note::
Ray can also be used in a multi node environment, for extreme scaling. Please refer to the Ray docs for this.



Overview
-----------

We'll use the Kazu :class:`.LLMNERStep` with some clean up actions to build a Kazu pipeline. We'll then create multiple
Ray actors to instantiate this pipeline, then feed those actors Kazu :class:`.Document`\s through :class:`ray.util.queue.Queue`\.
The actors will process the documents, and write the results to another :class:`ray.util.queue.Queue`\. The main process will then
read from this second queue and write the results to disk.

The code for this orchestration is in ```scripts/examples/annotate_with_llm.py``` and the configuration is in
```scripts/examples/conf/annotate_with_llm/default.yaml```

The script can be executed with:

.. code-block:: console
$ python scripts/examples/annotate_with_llm.py --config-path /<fully qualified>/kazu/scripts/examples/conf hydra.job.chdir=True
.. note::
You will need to add values for the configuration keys marked ???, such as your input directory, vertex config etc.
43 changes: 43 additions & 0 deletions docs/training_multilabel_ner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Build an amasing NER model from LLM annotated data!
====================================================

Intro
-----

LLMs are REALLY good at BioNER (with some gently guidance). However, they may be too expensive to use over large corpora of
documents. Instead, we can train classical multi-label BERT style classifiers using data produced from LLMs (licence restrictions not withstanding).

This document briefly describes the workflow to do this.


Creating training data
-----------------------

First, we need an LLM to annotate a bunch of documents for us, and potentially clean up their sometimes unpredictable output.
To do this, follow the instuctions as described in :ref:`scaling_kazu`\.Then split the data into ```train/test/eval``` folders.

Running the training
---------------------

We need the script ```kazu/training/train_script.py``` and the configuration from ```scripts/examples/conf/multilabel_ner_training/default.yaml```


.. note::
This script expects you to have an instance of `LabelStudio <https://labelstud.io//>`_ running, so you can visualise the
results after each evaluation step. We recommend Docker for this.


then run the script with



.. code-block:: console
$ python -m training.train_script --config-path /<fully qualified>/kazu/scripts/examples/conf hydra.job.chdir=True \
multilabel_ner_training.test_path=<path to test docs> \
multilabel_ner_training.train_path=<path to train docs> \
multilabel_ner_training.training_data_cache_dir=<path to training data dir to cache docs> \
multilabel_ner_training.test_data_cache_dir=<path to test data dir to cache docs> \
multilabel_ner_training.label_studio_manager.headers.Authorisation="Token <your ls token>"
More options are available via :class:`kazu.training.config.TrainingConfig`\.

0 comments on commit 1d2525b

Please sign in to comment.