Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 1.0.0 #3

Merged
merged 31 commits into from
Sep 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
6d9caca
fix: OpenAI FewShotClassifier fixed
x-tabdeveloping Aug 9, 2024
d41fa06
Restructured the entire library, added Trooper interface
x-tabdeveloping Aug 9, 2024
40fe12b
Fixed zero-shot predictions for setfit
x-tabdeveloping Aug 9, 2024
a649321
Adjusted dependencies
x-tabdeveloping Aug 9, 2024
202ce31
Added integration tests
x-tabdeveloping Aug 9, 2024
e976c0c
Removed old docs
x-tabdeveloping Aug 9, 2024
5ef848d
Added doc dependencies
x-tabdeveloping Aug 9, 2024
7491ec0
Added docstrings for trooper
x-tabdeveloping Aug 20, 2024
af1fe60
Added docstrings
x-tabdeveloping Aug 21, 2024
b829fea
Added documentation in MKDocs
x-tabdeveloping Aug 21, 2024
690b8ed
Updated readme with changes
x-tabdeveloping Aug 21, 2024
0be1f72
Version bump
x-tabdeveloping Aug 21, 2024
7ef7069
Added actions for deploying docs and running tests
x-tabdeveloping Aug 21, 2024
090f6d0
Bumped setfit version
x-tabdeveloping Aug 21, 2024
d843b86
Added setfit 1.0 as test dependency
x-tabdeveloping Aug 21, 2024
cc7b0c4
Rewrote SetFit from scratch as the original library is a complete mess
x-tabdeveloping Aug 22, 2024
100e23f
Adjusted dependencies
x-tabdeveloping Aug 22, 2024
abaa5d8
Troopers can now detect sentence transformer models
x-tabdeveloping Aug 22, 2024
653da10
Fixed pair generation when there is only one example per label
x-tabdeveloping Aug 22, 2024
bcb08a1
Added new dependencie to test
x-tabdeveloping Aug 22, 2024
f8d2e08
Removed irrelevant information about setfit in docs
x-tabdeveloping Aug 22, 2024
e464414
Fixed dependencies in workflows
x-tabdeveloping Aug 22, 2024
10ac8d1
Made OpenAI classifier async
x-tabdeveloping Sep 6, 2024
39179cc
Bumped OpenAI dependency and made it not optional
x-tabdeveloping Sep 6, 2024
6d0aabf
Added option for providing device_map argument to generative models
x-tabdeveloping Sep 6, 2024
023a879
Updated docs with instructions on how to run inference on multiple GPUs.
x-tabdeveloping Sep 6, 2024
7f0a19d
fix: added device_map attribute to Trooper
x-tabdeveloping Sep 7, 2024
65ede4a
Added figure explaining how models are loaded.
x-tabdeveloping Sep 7, 2024
456cff4
Updated readme
x-tabdeveloping Sep 7, 2024
b892cd0
Added example to fuzzy_match docs
x-tabdeveloping Sep 7, 2024
774d37c
Renamed tests
x-tabdeveloping Sep 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 22 additions & 30 deletions .github/workflows/static.yml
Original file line number Diff line number Diff line change
@@ -1,42 +1,34 @@
# Simple workflow for deploying static content to GitHub Pages
name: Deploy static content to Pages
# creates the documentation on pushes it to the gh-pages branch
name: Documentation

on:
# Runs on pushes targeting the default branch
pull_request:
branches: [main]
push:
branches: ["main"]
branches: [main]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

# Allow one concurrent deployment
concurrency:
group: "pages"
cancel-in-progress: true
contents: write

jobs:
# Single deploy job since we're just deploying
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Pages
uses: actions/configure-pages@v2
- name: Upload artifact
uses: actions/upload-pages-artifact@v1
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
# Upload entire repository
path: './docs/_build/html'
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v1
python-version: '3.10'

- name: Dependencies
run: |
python -m pip install --upgrade pip
pip install "stormtrooper[docs,openai]"

- name: Build and Deploy
if: github.event_name == 'push'
run: mkdocs gh-deploy --force

- name: Build
if: github.event_name == 'pull_request'
run: mkdocs build
35 changes: 35 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Tests
on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
pytest:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
#
# This allows a subsequently queued workflow run to interrupt previous runs
concurrency:
group: "${{ github.workflow }}-${{ matrix.python-version}}-${{ matrix.os }} @ ${{ github.ref }}"
cancel-in-progress: true

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
# You can test your matrix by printing the current Python version
- name: Display Python version
run: python3 -c "import sys; print(sys.version)"

- name: Install dependencies
run: python3 -m pip install --upgrade stormtrooper[docs,openai] pandas pytest "sentence-transformers>=3.0.0" "accelerate>=0.25.0" "datasets>=2.14.0"

- name: Run tests
run: python3 -m pytest tests/
145 changes: 57 additions & 88 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,117 +7,77 @@ Zero/few shot learning components for scikit-learn pipelines with large-language

[Documentation](https://centre-for-humanities-computing.github.io/stormtrooper/)

## Why stormtrooper?
## New in 1.0.0

Other packages promise to provide at least similar functionality (scikit-llm), why should you choose stormtrooper instead?
### `Trooper`
The brand new `Trooper` interface allows you not to have to specify what model type you wish to use.
Stormtrooper will automatically detect the model type from the specified name.

1. Fine-grained control over you pipeline.
- Variety: stormtrooper allows you to use virtually all canonical approaches for zero and few-shot classification including NLI, Seq2Seq and Generative open-access models from Transformers, SetFit and even OpenAI's large language models.
- Prompt engineering: You can adjust prompt templates to your hearts content.
2. Performance
- Easy inference on GPU if you have access to it.
- Interfacing HuggingFace's TextGenerationInference API, the most efficient way to host models locally.
- Async interaction with external APIs, this can speed up inference with OpenAI's models quite drastically.
3. Extensive [Documentation](https://centre-for-humanities-computing.github.io/stormtrooper/)
- Throrough API reference and loads of examples to get you started.
3. Battle-hardened
- We at the Center For Humanities Computing are making extensive use of this package. This means you can rest assured that the package works under real-world pressure. As such you can expect regular updates and maintance.
4. Simple
- We opted for as bare-bones of an implementation and little coupling as possible. The library works at the lowest level of abstraction possible, and we hope our code will be rather easy for others to understand and contribute to.
```python
from stormtrooper import Trooper

# This loads a setfit model
model = Trooper("all-MiniLM-L6-v2")
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved

# This loads an OpenAI model
model = Trooper("gpt-4")
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved

# This loads a Text2Text model
model = Trooper("google/flan-t5-base")
```

## New in version 0.5.0
### Unified zero and few-shot classification

stormtrooper now uses chat templates from HuggingFace transformers for generative models.
This means that you no longer have to pass model-specific prompt templates to these and can define system and user prompts separately.
You no longer have to specify whether a model should be a few or a zero-shot classifier when initialising it.
If you do not pass any training examples, it will be automatically assumed that the model should be zero-shot.

```python
from stormtrooper import GenerativeZeroShotClassifier
# This is a zero-shot model
model.fit(None, ["dog", "cat"])

system_prompt = "You're a helpful assistant."
user_prompt = """
Classify a text into one of the following categories: {classes}
Text to clasify:
"{X}"
"""
# This is a few-shot model
model.fit(["he was a good boy", "just lay down on my laptop"], ["dog", "cat"])

model = GenerativeZeroShotClassifier().fit(None, ["political", "not political"])
model.predict("Joe Biden is no longer the candidate of the Democrats.")
```
## Model types

You can use all sorts of transformer models for few and zero-shot classification in Stormtrooper.

1. Instruction fine-tuned generative models, e.g. `Trooper("HuggingFaceH4/zephyr-7b-beta")`
2. Encoder models with SetFit, e.g. `Trooper("all-MiniLM-L6-v2")`
3. Text2Text models e.g. `Trooper("google/flan-t5-base")`
4. OpenAI models e.g. `Trooper("gpt-4")`
5. NLI models e.g. `Trooper("facebook/bart-large-mnli")`

## Examples
## Example usage

Here are a couple of motivating examples to get you hooked. Find more in our [docs](https://centre-for-humanities-computing.github.io/stormtrooper/).
Find more in our [docs](https://centre-for-humanities-computing.github.io/stormtrooper/).

```bash
pip install stormtrooper
```

```python
from stormtrooper import Trooper

class_labels = ["atheism/christianity", "astronomy/space"]
example_texts = [
"God came down to earth to save us.",
"A new nebula was recently discovered in the proximity of the Oort cloud."
]
```


### Zero-shot learning

For zero-shot learning you can use zero-shot models:
```python
from stormtrooper import ZeroShotClassifier
classifier = ZeroShotClassifier().fit(None, class_labels)
```

Generative models (GPT, Llama):
```python
from stormtrooper import GenerativeZeroShotClassifier
classifier = GenerativeZeroShotClassifier("meta-llama/Meta-Llama-3.1-8B-Instruct").fit(None, class_labels)
```

Text2Text models (T5):
If you are running low on resources I would personally recommend T5.
```python
from stormtrooper import Text2TextZeroShotClassifier
# You can define a custom prompt, but a default one is available
prompt = "..."
classifier =Text2TextZeroShotClassifier(prompt=prompt).fit(None, class_labels)
```

```python
predictions = classifier.predict(example_texts)

assert list(predictions) == ["atheism/christianity", "astronomy/space"]
```

OpenAI models:
You can now use OpenAI's chat LLMs in stormtrooper workflows.

```python
from stormtrooper import OpenAIZeroShotClassifier

classifier = OpenAIZeroShotClassifier("gpt-4").fit(None, class_labels)
```

```python
predictions = classifier.predict(example_texts)

assert list(predictions) == ["atheism/christianity", "astronomy/space"]
```

### Few-Shot Learning

For few-shot tasks you can only use Generative, Text2Text, OpenAI (aka. promptable) or SetFit models.

```python
from stormtrooper import GenerativeFewShotClassifier, Text2TextFewShotClassifier, SetFitFewShotClassifier

classifier = SetFitFewShotClassifier().fit(example_texts, class_labels)
predictions = model.predict(["Calvinists believe in predestination."])

assert list(predictions) == ["atheism/christianity"]
new_texts = ["God bless the reailway workers", "The frigate is ready to launch from the spaceport"]

# Zero-shot classification
model = Trooper("google/flan-t5-base")
model.fit(None, class_labels)
model.predict(new_texts)
# ["atheism/christianity", "astronomy/space"]

# Few-shot classification
model = Trooper("google/flan-t5-base")
model.fit(example_texts, class_labels)
model.predict(new_texts)
# ["atheism/christianity", "astronomy/space"]
```

### Fuzzy Matching
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -133,5 +93,14 @@ From version 0.2.2 you can run models on GPU.
You can specify the device when initializing a model:

```python
classifier = Text2TextZeroShotClassifier(device="cuda:0")
classifier = Trooper("all-MiniLM-L6-v2", device="cuda:0")
```

### Inference on multiple GPUs

You can run a model on multiple devices in order of device priority `GPU -> CPU + Ram -> Disk` and on multiple devices by using the `device_map` argument.
Note that this only works with text2text and generative models.

```
model = Trooper("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
```
20 changes: 0 additions & 20 deletions docs/Makefile

This file was deleted.

Binary file removed docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file removed docs/_build/doctrees/generative.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/inference_on_gpu.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/openai.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/prompting.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/setfit.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/text2text.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/textgen.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/zeroshot.doctree
Binary file not shown.
4 changes: 0 additions & 4 deletions docs/_build/html/.buildinfo

This file was deleted.

46 changes: 0 additions & 46 deletions docs/_build/html/_sources/generative.rst.txt

This file was deleted.

Loading
Loading