Skip to content

Commit

Permalink
Merge pull request #3 from centre-for-humanities-computing/cli
Browse files Browse the repository at this point in the history
Release 1.0.0
  • Loading branch information
x-tabdeveloping authored Sep 7, 2024
2 parents ddd2430 + 774d37c commit 5e9757b
Show file tree
Hide file tree
Showing 89 changed files with 1,026 additions and 22,085 deletions.
52 changes: 22 additions & 30 deletions .github/workflows/static.yml
Original file line number Diff line number Diff line change
@@ -1,42 +1,34 @@
# Simple workflow for deploying static content to GitHub Pages
name: Deploy static content to Pages
# creates the documentation on pushes it to the gh-pages branch
name: Documentation

on:
# Runs on pushes targeting the default branch
pull_request:
branches: [main]
push:
branches: ["main"]
branches: [main]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

# Allow one concurrent deployment
concurrency:
group: "pages"
cancel-in-progress: true
contents: write

jobs:
# Single deploy job since we're just deploying
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Pages
uses: actions/configure-pages@v2
- name: Upload artifact
uses: actions/upload-pages-artifact@v1
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
# Upload entire repository
path: './docs/_build/html'
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v1
python-version: '3.10'

- name: Dependencies
run: |
python -m pip install --upgrade pip
pip install "stormtrooper[docs,openai]"
- name: Build and Deploy
if: github.event_name == 'push'
run: mkdocs gh-deploy --force

- name: Build
if: github.event_name == 'pull_request'
run: mkdocs build
35 changes: 35 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Tests
on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
pytest:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
#
# This allows a subsequently queued workflow run to interrupt previous runs
concurrency:
group: "${{ github.workflow }}-${{ matrix.python-version}}-${{ matrix.os }} @ ${{ github.ref }}"
cancel-in-progress: true

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
# You can test your matrix by printing the current Python version
- name: Display Python version
run: python3 -c "import sys; print(sys.version)"

- name: Install dependencies
run: python3 -m pip install --upgrade stormtrooper[docs,openai] pandas pytest "sentence-transformers>=3.0.0" "accelerate>=0.25.0" "datasets>=2.14.0"

- name: Run tests
run: python3 -m pytest tests/
145 changes: 57 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,117 +7,77 @@ Zero/few shot learning components for scikit-learn pipelines with large-language

[Documentation](https://centre-for-humanities-computing.github.io/stormtrooper/)

## Why stormtrooper?
## New in 1.0.0

Other packages promise to provide at least similar functionality (scikit-llm), why should you choose stormtrooper instead?
### `Trooper`
The brand new `Trooper` interface allows you not to have to specify what model type you wish to use.
Stormtrooper will automatically detect the model type from the specified name.

1. Fine-grained control over you pipeline.
- Variety: stormtrooper allows you to use virtually all canonical approaches for zero and few-shot classification including NLI, Seq2Seq and Generative open-access models from Transformers, SetFit and even OpenAI's large language models.
- Prompt engineering: You can adjust prompt templates to your hearts content.
2. Performance
- Easy inference on GPU if you have access to it.
- Interfacing HuggingFace's TextGenerationInference API, the most efficient way to host models locally.
- Async interaction with external APIs, this can speed up inference with OpenAI's models quite drastically.
3. Extensive [Documentation](https://centre-for-humanities-computing.github.io/stormtrooper/)
- Throrough API reference and loads of examples to get you started.
3. Battle-hardened
- We at the Center For Humanities Computing are making extensive use of this package. This means you can rest assured that the package works under real-world pressure. As such you can expect regular updates and maintance.
4. Simple
- We opted for as bare-bones of an implementation and little coupling as possible. The library works at the lowest level of abstraction possible, and we hope our code will be rather easy for others to understand and contribute to.
```python
from stormtrooper import Trooper

# This loads a setfit model
model = Trooper("all-MiniLM-L6-v2")

# This loads an OpenAI model
model = Trooper("gpt-4")

# This loads a Text2Text model
model = Trooper("google/flan-t5-base")
```

## New in version 0.5.0
### Unified zero and few-shot classification

stormtrooper now uses chat templates from HuggingFace transformers for generative models.
This means that you no longer have to pass model-specific prompt templates to these and can define system and user prompts separately.
You no longer have to specify whether a model should be a few or a zero-shot classifier when initialising it.
If you do not pass any training examples, it will be automatically assumed that the model should be zero-shot.

```python
from stormtrooper import GenerativeZeroShotClassifier
# This is a zero-shot model
model.fit(None, ["dog", "cat"])

system_prompt = "You're a helpful assistant."
user_prompt = """
Classify a text into one of the following categories: {classes}
Text to clasify:
"{X}"
"""
# This is a few-shot model
model.fit(["he was a good boy", "just lay down on my laptop"], ["dog", "cat"])

model = GenerativeZeroShotClassifier().fit(None, ["political", "not political"])
model.predict("Joe Biden is no longer the candidate of the Democrats.")
```
## Model types

You can use all sorts of transformer models for few and zero-shot classification in Stormtrooper.

1. Instruction fine-tuned generative models, e.g. `Trooper("HuggingFaceH4/zephyr-7b-beta")`
2. Encoder models with SetFit, e.g. `Trooper("all-MiniLM-L6-v2")`
3. Text2Text models e.g. `Trooper("google/flan-t5-base")`
4. OpenAI models e.g. `Trooper("gpt-4")`
5. NLI models e.g. `Trooper("facebook/bart-large-mnli")`

## Examples
## Example usage

Here are a couple of motivating examples to get you hooked. Find more in our [docs](https://centre-for-humanities-computing.github.io/stormtrooper/).
Find more in our [docs](https://centre-for-humanities-computing.github.io/stormtrooper/).

```bash
pip install stormtrooper
```

```python
from stormtrooper import Trooper

class_labels = ["atheism/christianity", "astronomy/space"]
example_texts = [
"God came down to earth to save us.",
"A new nebula was recently discovered in the proximity of the Oort cloud."
]
```


### Zero-shot learning

For zero-shot learning you can use zero-shot models:
```python
from stormtrooper import ZeroShotClassifier
classifier = ZeroShotClassifier().fit(None, class_labels)
```

Generative models (GPT, Llama):
```python
from stormtrooper import GenerativeZeroShotClassifier
classifier = GenerativeZeroShotClassifier("meta-llama/Meta-Llama-3.1-8B-Instruct").fit(None, class_labels)
```

Text2Text models (T5):
If you are running low on resources I would personally recommend T5.
```python
from stormtrooper import Text2TextZeroShotClassifier
# You can define a custom prompt, but a default one is available
prompt = "..."
classifier =Text2TextZeroShotClassifier(prompt=prompt).fit(None, class_labels)
```

```python
predictions = classifier.predict(example_texts)

assert list(predictions) == ["atheism/christianity", "astronomy/space"]
```

OpenAI models:
You can now use OpenAI's chat LLMs in stormtrooper workflows.

```python
from stormtrooper import OpenAIZeroShotClassifier

classifier = OpenAIZeroShotClassifier("gpt-4").fit(None, class_labels)
```

```python
predictions = classifier.predict(example_texts)

assert list(predictions) == ["atheism/christianity", "astronomy/space"]
```

### Few-Shot Learning

For few-shot tasks you can only use Generative, Text2Text, OpenAI (aka. promptable) or SetFit models.

```python
from stormtrooper import GenerativeFewShotClassifier, Text2TextFewShotClassifier, SetFitFewShotClassifier

classifier = SetFitFewShotClassifier().fit(example_texts, class_labels)
predictions = model.predict(["Calvinists believe in predestination."])

assert list(predictions) == ["atheism/christianity"]
new_texts = ["God bless the reailway workers", "The frigate is ready to launch from the spaceport"]

# Zero-shot classification
model = Trooper("google/flan-t5-base")
model.fit(None, class_labels)
model.predict(new_texts)
# ["atheism/christianity", "astronomy/space"]

# Few-shot classification
model = Trooper("google/flan-t5-base")
model.fit(example_texts, class_labels)
model.predict(new_texts)
# ["atheism/christianity", "astronomy/space"]
```

### Fuzzy Matching
Expand All @@ -133,5 +93,14 @@ From version 0.2.2 you can run models on GPU.
You can specify the device when initializing a model:

```python
classifier = Text2TextZeroShotClassifier(device="cuda:0")
classifier = Trooper("all-MiniLM-L6-v2", device="cuda:0")
```

### Inference on multiple GPUs

You can run a model on multiple devices in order of device priority `GPU -> CPU + Ram -> Disk` and on multiple devices by using the `device_map` argument.
Note that this only works with text2text and generative models.

```
model = Trooper("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
```
20 changes: 0 additions & 20 deletions docs/Makefile

This file was deleted.

Binary file removed docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file removed docs/_build/doctrees/generative.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/inference_on_gpu.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/openai.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/prompting.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/setfit.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/text2text.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/textgen.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/zeroshot.doctree
Binary file not shown.
4 changes: 0 additions & 4 deletions docs/_build/html/.buildinfo

This file was deleted.

46 changes: 0 additions & 46 deletions docs/_build/html/_sources/generative.rst.txt

This file was deleted.

Loading

0 comments on commit 5e9757b

Please sign in to comment.