Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move llm source code into src/lemonade dir. Add HumanEval. #262

Merged
merged 6 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ enable =
no-init,
abstract-method,
invalid-overridden-method,
arguments-differ,
# arguments-differ,
signature-differs,
bad-staticmethod-argument,
useless-super-delegation,
Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@
[![OS - Windows | Linux](https://img.shields.io/badge/OS-windows%20%7C%20linux-blue)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
[![Made with Python](https://img.shields.io/badge/Python-3.8,3.10-blue?logo=python&logoColor=white)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")

We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing no-code CLIs and low-code APIs for both general ONNX workflows as well as LLMs.
We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing no-code CLIs and low-code APIs for both general ONNX workflows with `turnkey` as well as LLMs with `lemonade`.

| [**Turnkey LLM**](https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm) | [**Turnkey Classic**](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md) |
| [**Lemonade**](https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm) | [**Turnkey**](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md) |
|:----------------------------------------------: |:-----------------------------------------------------------------: |
| Serve and benchmark LLMs on CPU, GPU, and NPU. <br/> [Click here to get started with turnkey-llm.](https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm) | Export and optimize ONNX models for CNNs, Transformers, and GNNs. <br/> [Click here to get started with turnkey classic.](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md) |
| Serve and benchmark LLMs on CPU, GPU, and NPU. <br/> [Click here to get started with `lemonade`.](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade_getting_started.md) | Export and optimize ONNX models for CNNs and Transformers. <br/> [Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md) |
| <img src="img/llm_demo.png"/> | <img src="img/classic_demo.png"/> |


## How It Works

The `turnkey` (classic) and `turnkey-llm` CLIs provide a set of `Tools` that users can invoke in a `Sequence`. The first `Tool` takes the input (`-i`), performs some action, and passes its state to the next `Tool` in the `Sequence`.
The `turnkey` (CNNs and transformers) and `lemonade` (LLMs) CLIs provide a set of `Tools` that users can invoke in a `Sequence`. The first `Tool` takes the input (`-i`), performs some action, and passes its state to the next `Tool` in the `Sequence`.

You can read the `Sequence` out like a sentence. For example, the demo command above was:

Expand Down Expand Up @@ -51,3 +51,4 @@ This project is licensed under the [Apache 2.0 License](https://github.com/onnx/
## Attribution

TurnkeyML used code from other open source projects as a starting point (see [NOTICE.md](NOTICE.md)). Thank you Philip Colangelo, Derek Elkins, Jeremy Fowers, Dan Gard, Victoria Godsoe, Mark Heaps, Daniel Holanda, Brian Kurtz, Mariah Larwood, Philip Lassen, Andrew Ling, Adrian Macias, Gary Malik, Sarah Massengill, Ashwin Murthy, Hatice Ozen, Tim Sears, Sean Settle, Krishna Sivakumar, Aviv Weinstein, Xueli Xao, Bill Xing, and Lev Zlotnik for your contributions to that work.

108 changes: 108 additions & 0 deletions docs/humaneval_accuracy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Using the HumanEval accuracy test tools

The HumanEval benchmark is a code generation and functional correctness evaluation framework designed to assess language models' ability to generate Python code. It consists of 164 handwritten programming problems, each containing a function signature, docstring, body, and several unit tests. This benchmark focuses on evaluating a model's capability to generate functionally correct code that passes the test cases, making it particularly useful for assessing code generation capabilities.

This tool provides an automated way to evaluate language models on the HumanEval benchmark. It handles the process of downloading the dataset, generating code completions, executing them in a secure environment, and calculating pass@k metrics.

## Dataset

The HumanEval dataset is automatically downloaded from [OpenAI's human-eval repository](https://github.com/openai/human-eval) when you first run the benchmark. The dataset contains programming problems that test various aspects of Python programming, including:

- Basic programming operations
- String manipulation
- Mathematical computations
- List operations
- Algorithm implementation
- Data structure manipulation

## Running the Benchmark

```bash
lemonade -i meta-llama/Llama-3.2-1B oga-load --device igpu --dtype int4 accuracy-humaneval --k-samples 1 --first-n-samples 5 --timeout 30.0
```

### Optional arguments:

`--k-samples`: Number of completions to generate per prompt (default: 1). This parameter determines the k in pass@k metrics. For example:
- `--k-samples 1`: Calculates pass@1 (single attempt per problem)
- `--k-samples 10`: Calculates pass@10 (ten attempts per problem)
- `--k-samples 100`: Calculates pass@100 (hundred attempts per problem)

Higher k values provide more robust evaluation but take longer to run.

`--first-n-samples`: Evaluate only the first N problems from the dataset (default: entire dataset). Useful for quick testing or when you want to evaluate a subset of problems.

`--timeout`: Maximum time in seconds allowed for each test case execution (default: 30.0). This prevents infinite loops or long-running code from blocking the evaluation.

`--data-dir`: Custom directory for storing the HumanEval dataset (default: "<lemonade_cache_dir>/data/humaneval").

## How It Works

1. **Dataset Preparation:**
- On first run, the tool downloads the HumanEval dataset (HumanEval.jsonl.gz)
- The dataset contains function signatures, docstrings, and test cases
- Each problem is structured to test specific programming capabilities
- You can evaluate only the first N problems using `--first-n-samples`

2. **Code Generation:**
- For each programming problem, the model is provided with a prompt containing:
- Function signature (e.g., `def sort_numbers(numbers):`)
- Docstring describing the function's purpose and requirements
- The model generates k code completions for the function body (controlled by `--k-samples`)
- These k samples are used to calculate the pass@k metric

3. **Secure Execution:**
- Generated code is executed in a secure sandbox environment maintained by OpenAI's human-eval library. For your awareness, OpenAI's policy is to disable code execution by default, however lemonade enables code execution by default by automatically setting the environment variable `HF_ALLOW_CODE_EVAL=1`. OpenAI provides the following code execution protections:
- **Process Isolation**: Each code sample runs in a separate process to prevent interference
- **Resource Limits**:
- CPU time limit (controlled by `--timeout`)
- Memory usage restrictions
- Maximum output size restrictions
- **Restricted Access**:
- No network access
- No file system access outside test directory
- No subprocess creation
- No system calls
- **Module Restrictions**:
- Only allows importing standard Python libraries needed for testing
- Blocks potentially dangerous modules (os, sys, subprocess, etc.)
These security measures are implemented through:
- Python's built-in `resource` module for resource limits
- AST (Abstract Syntax Tree) analysis for code validation
- Process-level isolation using `multiprocessing`
- Custom import hooks to restrict module access

4. **Evaluation Metrics:**
- **pass@k**: Percentage of problems solved with k attempts
- pass@1: Success rate with single attempt
- pass@10: Success rate within 10 attempts
- pass@100: Success rate within 100 attempts
- A problem is considered solved if all test cases pass
- Results are normalized to percentages

5. **Output Files:**
The tool generates several output files in the results directory:
- `evaluation_results.csv`: Contains prompts, completions, and expected answers
- `humaneval_predictions.jsonl`: Raw model predictions in JSONL format
- `humaneval_predictions.jsonl_results.jsonl`: Detailed evaluation results

## Example Results Format

The evaluation produces metrics in the following format:
```json
{
"pass@1": 0.25, // 25% success rate with 1 attempt
"pass@10": 0.45, // 45% success rate within 10 attempts
"pass@100": 0.65 // 65% success rate within 100 attempts
}
```

## Limitations

1. **Resource Requirements**: Generating multiple samples per problem (high k values) can be computationally intensive and time-consuming.
2. **Memory Usage**: Large language models may require significant memory, especially when generating multiple samples.

## References

1. [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374)
2. [OpenAI HumanEval Repository](https://github.com/openai/human-eval)
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ Note that the `llm-prompt`, `accuracy-mmlu`, and `serve` tools can all be used w
Lemonade is also available via API. Here's a quick example of how to benchmark an LLM:

```python
import turnkeyml.llm.tools.torch_llm as tl
import turnkeyml.llm.tools.chat as cl
import lemonade.tools.torch_llm as tl
import lemonade.tools.chat as cl
from turnkeyml.state import State

state = State(cache_dir="cache", build_name="test")
Expand Down
18 changes: 18 additions & 0 deletions examples/llm/leap_basic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
"""
This example demonstrates how to use the LEAP API to load a model for
inference on CPU using the hf-cpu recipe, and then use it to generate
the response to a prompt.

If you have a discrete GPU, you can try that by changing the recipe
to hf-dgpu. Note: make sure to have torch+cuda installed when trying
hf-dgpu.
"""

from lemonade import leap

model, tokenizer = leap.from_pretrained("facebook/opt-125m", recipe="hf-cpu")

input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
response = model.generate(input_ids, max_new_tokens=30)

print(tokenizer.decode(response[0]))
21 changes: 21 additions & 0 deletions examples/llm/leap_ryzenai_npu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""
This example demonstrates how to use the LEAP API to load a model for
inference on a Ryzen AI NPU using the ryzenai-npu-load recipe,
and then use it to generate the response to a prompt.

Note that this example will only run if the Ryzen AI NPU Private recipe is installed.
See genai/docs/ryzenai_npu.md for instructions.

You can try the same model on CPU by changing the recipe to "hf-cpu".
"""

from lemonade import leap

model, tokenizer = leap.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf", recipe="ryzenai-npu"
)

input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
response = model.generate(input_ids, max_new_tokens=30)

print(tokenizer.decode(response[0]))
38 changes: 38 additions & 0 deletions examples/llm/leap_streaming.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""
This example demonstrates how to use the LEAP API to load a model for
inference on CPU using the hf-cpu recipe, and then use a thread to
generate a streaming the response to a prompt.

Note: this approach only works with recipes that support TextIteratorStreamer,
i.e., huggingface-based recipes such as hf-cpu and ryzenai-npu.
"""

from thread import Thread
from transformers import TextIteratorStreamer
from lemonade import leap

# Replace the recipe with "ryzenai-npu" to run on the RyzenAI NPU
model, tokenizer = leap.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf", recipe="hf-cpu"
)

input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids

streamer = TextIteratorStreamer(
tokenizer,
skip_prompt=True,
)
generation_kwargs = {
"input_ids": input_ids,
"streamer": streamer,
"max_new_tokens": 30,
}

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

# Generate the response using streaming
for new_text in streamer:
print(new_text)

thread.join()
4 changes: 2 additions & 2 deletions examples/llm/turnkey_llm.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@
"outputs": [],
"source": [
"# Import the turnkey APIs\n",
"from turnkeyml.llm import leap\n",
"from lemonade import leap\n",
"\n",
"# Load the model on to RyzenAI NPU\n",
"# NOTE: this takes a couple of minutes, but after you've done it once\n",
Expand Down Expand Up @@ -133,7 +133,7 @@
"outputs": [],
"source": [
"# Import the turnkey APIs\n",
"from turnkeyml.llm import leap\n",
"from lemonade import leap\n",
"\n",
"# Load the model on iGPU\n",
"igpu_model, igpu_tokenizer = leap.from_pretrained(\n",
Expand Down
1 change: 1 addition & 0 deletions examples/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
This directory contains examples to help you learn how to use the tools. The examples are split up into two sub-directories:
1. `examples/cli`: a tutorial series for the `turnkey` CLI. This is the recommended starting point.
1. `examples/api`: scripts that demonstrate how to use the `turnkey.evaluate_files()` API.
1. `examples/llm`: scripts that demonstrate the `lemonade` CLI for LLMs.
58 changes: 21 additions & 37 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
with open("src/turnkeyml/version.py", encoding="utf-8") as fp:
version = fp.read().split('"')[1]


setup(
name="turnkeyml",
version=version,
description="TurnkeyML Tools and Models",
author="Jeremy Fowers, Daniel Holanda, Ramakrishnan Sivakumar, Victoria Godsoe",
author_email="[email protected]",
package_dir={"": "src", "turnkeyml_models": "models"},
packages=[
Expand All @@ -17,10 +17,10 @@
"turnkeyml.sequence",
"turnkeyml.cli",
"turnkeyml.common",
"turnkeyml.llm",
"turnkeyml.llm.tools",
"turnkeyml.llm.tools.ort_genai",
"turnkeyml.llm.tools.ryzenai_npu",
"lemonade",
"lemonade.tools",
"lemonade.tools.ort_genai",
"lemonade.tools.ryzenai_npu",
"turnkeyml_models",
"turnkeyml_models.graph_convolutions",
"turnkeyml_models.selftest",
Expand All @@ -46,77 +46,61 @@
"psutil",
"wmi",
"pytz",
"tqdm",
# Conditional dependencies for ONNXRuntime backends
"onnxruntime >=1.10.1;platform_system=='Linux' and extra != 'llm-oga-cuda'",
"onnxruntime-directml >=1.19.0;platform_system=='Windows' and extra != 'llm-oga-cuda'",
"onnxruntime-gpu >=1.19.1;extra == 'llm-oga-cuda'",
],
extras_require={
"llm": [
"tqdm",
"torch>=2.0.0",
"transformers",
"accelerate",
"py-cpuinfo",
"sentencepiece",
"datasets",
# Install human-eval from a forked repo with Windows support until the
# PR (https://github.com/openai/human-eval/pull/53) is merged
"human-eval @ git+https://github.com/ramkrishna2910/human-eval.git",
"fastapi",
"uvicorn[standard]",
],
"llm-oga-dml": [
"llm-oga-igpu": [
"onnxruntime-genai-directml==0.4.0",
"tqdm",
"torch>=2.0.0,<2.4",
"transformers<4.45.0",
"accelerate",
"py-cpuinfo",
"sentencepiece",
"datasets",
"fastapi",
"uvicorn[standard]",
"turnkeyml[llm]",
],
"llm-oga-cuda": [
"onnxruntime-genai-cuda==0.4.0",
"tqdm",
"torch>=2.0.0,<2.4",
"transformers<4.45.0",
"accelerate",
"py-cpuinfo",
"sentencepiece",
"datasets",
"fastapi",
"uvicorn[standard]",
"turnkeyml[llm]",
],
"llm-oga-npu": [
"transformers",
"torch",
"onnx==1.16.0",
"onnxruntime==1.18.0",
"numpy==1.26.4",
"tqdm",
"accelerate",
"py-cpuinfo",
"sentencepiece",
"datasets",
"fastapi",
"uvicorn[standard]",
"turnkeyml[llm]",
],
"llm-oga-hybrid": [
"transformers",
"torch",
"onnx==1.16.1",
"numpy==1.26.4",
"datasets",
"fastapi",
"uvicorn[standard]",
"turnkeyml[llm]",
],
"cuda": [
"torch @ https://download.pytorch.org/whl/cu118/torch-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl",
"torchvision @ https://download.pytorch.org/whl/cu118/torchvision-0.18.1%2Bcu118-cp310-cp310-win_amd64.whl",
"torchaudio @ https://download.pytorch.org/whl/cu118/torchaudio-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl",
],
},
classifiers=[],
entry_points={
"console_scripts": [
"turnkey=turnkeyml:turnkeycli",
"turnkey-llm=turnkeyml.llm:lemonadecli",
"lemonade=turnkeyml.llm:lemonadecli",
"turnkey-llm=lemonade:lemonadecli",
"lemonade=lemonade:lemonadecli",
]
},
python_requires=">=3.8, <3.12",
Expand Down
File renamed without changes.
File renamed without changes.
Loading
Loading