-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
added scaling and training documentation, with required nitpicks
- Loading branch information
1 parent
02d23c8
commit 1d2525b
Showing
5 changed files
with
99 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,6 @@ | ||
.. _quickstart: | ||
|
||
|
||
Quickstart | ||
========== | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,37 @@ | ||
TBA | ||
==== | ||
.. _scaling_kazu: | ||
|
||
Scaling with Ray | ||
================= | ||
|
||
|
||
Usually, we want to run Kazu over large number of documents, so we need a framework to handle the distributed processing. | ||
|
||
`Ray <https://www.ray.io//>`_ is a simple to use Actor style framework that works extremely well for this. In this example, | ||
|
||
we demonstrate how Ray can be used to scale Kazu over multiple cores. | ||
|
||
.. note:: | ||
Ray can also be used in a multi node environment, for extreme scaling. Please refer to the Ray docs for this. | ||
|
||
|
||
|
||
Overview | ||
----------- | ||
|
||
We'll use the Kazu :class:`.LLMNERStep` with some clean up actions to build a Kazu pipeline. We'll then create multiple | ||
Ray actors to instantiate this pipeline, then feed those actors Kazu :class:`.Document`\s through :class:`ray.util.queue.Queue`\. | ||
The actors will process the documents, and write the results to another :class:`ray.util.queue.Queue`\. The main process will then | ||
read from this second queue and write the results to disk. | ||
|
||
The code for this orchestration is in ```scripts/examples/annotate_with_llm.py``` and the configuration is in | ||
```scripts/examples/conf/annotate_with_llm/default.yaml``` | ||
|
||
The script can be executed with: | ||
|
||
.. code-block:: console | ||
$ python scripts/examples/annotate_with_llm.py --config-path /<fully qualified>/kazu/scripts/examples/conf hydra.job.chdir=True | ||
.. note:: | ||
You will need to add values for the configuration keys marked ???, such as your input directory, vertex config etc. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
Build an amasing NER model from LLM annotated data! | ||
==================================================== | ||
|
||
Intro | ||
----- | ||
|
||
LLMs are REALLY good at BioNER (with some gently guidance). However, they may be too expensive to use over large corpora of | ||
documents. Instead, we can train classical multi-label BERT style classifiers using data produced from LLMs (licence restrictions not withstanding). | ||
|
||
This document briefly describes the workflow to do this. | ||
|
||
|
||
Creating training data | ||
----------------------- | ||
|
||
First, we need an LLM to annotate a bunch of documents for us, and potentially clean up their sometimes unpredictable output. | ||
To do this, follow the instuctions as described in :ref:`scaling_kazu`\.Then split the data into ```train/test/eval``` folders. | ||
|
||
Running the training | ||
--------------------- | ||
|
||
We need the script ```kazu/training/train_script.py``` and the configuration from ```scripts/examples/conf/multilabel_ner_training/default.yaml``` | ||
|
||
|
||
.. note:: | ||
This script expects you to have an instance of `LabelStudio <https://labelstud.io//>`_ running, so you can visualise the | ||
results after each evaluation step. We recommend Docker for this. | ||
|
||
|
||
then run the script with | ||
|
||
|
||
|
||
.. code-block:: console | ||
$ python -m training.train_script --config-path /<fully qualified>/kazu/scripts/examples/conf hydra.job.chdir=True \ | ||
multilabel_ner_training.test_path=<path to test docs> \ | ||
multilabel_ner_training.train_path=<path to train docs> \ | ||
multilabel_ner_training.training_data_cache_dir=<path to training data dir to cache docs> \ | ||
multilabel_ner_training.test_data_cache_dir=<path to test data dir to cache docs> \ | ||
multilabel_ner_training.label_studio_manager.headers.Authorisation="Token <your ls token>" | ||
More options are available via :class:`kazu.training.config.TrainingConfig`\. |