onnx · jeremyfowers · Jan 10, 2025 · Jan 9, 2025 · Jan 9, 2025 · Jan 9, 2025
diff --git a/.pylintrc b/.pylintrc
@@ -121,7 +121,7 @@ enable =
   no-init,
   abstract-method,
   invalid-overridden-method,
-  arguments-differ,
+  # arguments-differ,
   signature-differs,
   bad-staticmethod-argument,
   useless-super-delegation,

diff --git a/README.md b/README.md
@@ -5,17 +5,17 @@
 [![OS - Windows | Linux](https://img.shields.io/badge/OS-windows%20%7C%20linux-blue)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
 [![Made with Python](https://img.shields.io/badge/Python-3.8,3.10-blue?logo=python&logoColor=white)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
 
-We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing no-code CLIs and low-code APIs for both general ONNX workflows as well as LLMs.
+We are on a mission to make it easy to use the most important tools in the ONNX ecosystem. TurnkeyML accomplishes this by providing no-code CLIs and low-code APIs for both general ONNX workflows with `turnkey` as well as LLMs with `lemonade`.
 
-|                     [**Turnkey LLM**](https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm)                    	|                            [**Turnkey Classic**](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md)                                	|
+|                     [**Lemonade**](https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm)                    	|                            [**Turnkey**](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md)                                	|
 |:----------------------------------------------:	|:-----------------------------------------------------------------:	|
-| Serve and benchmark LLMs on CPU, GPU, and NPU. <br/>	[Click here to get started with turnkey-llm.](https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm) | Export and optimize ONNX models for CNNs, Transformers, and GNNs. <br/>	[Click here to get started with turnkey classic.](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md)	|
+| Serve and benchmark LLMs on CPU, GPU, and NPU. <br/>	[Click here to get started with `lemonade`.](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade_getting_started.md) | Export and optimize ONNX models for CNNs and Transformers. <br/>	[Click here to get started with `turnkey`.](https://github.com/onnx/turnkeyml/blob/main/docs/classic_getting_started.md)	|
 | <img src="img/llm_demo.png"/> | <img src="img/classic_demo.png"/> |
 
 
 ## How It Works
 
-The `turnkey` (classic) and `turnkey-llm` CLIs provide a set of `Tools` that users can invoke in a `Sequence`. The first `Tool` takes the input (`-i`), performs some action, and passes its state to the next `Tool` in the `Sequence`.
+The `turnkey` (CNNs and transformers) and `lemonade` (LLMs) CLIs provide a set of `Tools` that users can invoke in a `Sequence`. The first `Tool` takes the input (`-i`), performs some action, and passes its state to the next `Tool` in the `Sequence`.
 
 You can read the `Sequence` out like a sentence. For example, the demo command above was:
 
@@ -51,3 +51,4 @@ This project is licensed under the [Apache 2.0 License](https://github.com/onnx/
 ## Attribution
 
 TurnkeyML used code from other open source projects as a starting point (see [NOTICE.md](NOTICE.md)). Thank you Philip Colangelo, Derek Elkins, Jeremy Fowers, Dan Gard, Victoria Godsoe, Mark Heaps, Daniel Holanda, Brian Kurtz, Mariah Larwood, Philip Lassen, Andrew Ling, Adrian Macias, Gary Malik, Sarah Massengill, Ashwin Murthy, Hatice Ozen, Tim Sears, Sean Settle, Krishna Sivakumar, Aviv Weinstein, Xueli Xao, Bill Xing, and Lev Zlotnik for your contributions to that work.
+
diff --git a/docs/humaneval_accuracy.md b/docs/humaneval_accuracy.md
@@ -0,0 +1,108 @@
+# Using the HumanEval accuracy test tools
+
+The HumanEval benchmark is a code generation and functional correctness evaluation framework designed to assess language models' ability to generate Python code. It consists of 164 handwritten programming problems, each containing a function signature, docstring, body, and several unit tests. This benchmark focuses on evaluating a model's capability to generate functionally correct code that passes the test cases, making it particularly useful for assessing code generation capabilities.
+
+This tool provides an automated way to evaluate language models on the HumanEval benchmark. It handles the process of downloading the dataset, generating code completions, executing them in a secure environment, and calculating pass@k metrics.
+
+## Dataset
+
+The HumanEval dataset is automatically downloaded from [OpenAI's human-eval repository](https://github.com/openai/human-eval) when you first run the benchmark. The dataset contains programming problems that test various aspects of Python programming, including:
+
+- Basic programming operations
+- String manipulation
+- Mathematical computations
+- List operations
+- Algorithm implementation
+- Data structure manipulation
+
+## Running the Benchmark
+
+```bash
+lemonade -i meta-llama/Llama-3.2-1B oga-load --device igpu --dtype int4 accuracy-humaneval --k-samples 1 --first-n-samples 10 --timeout 30.0
+```
+
+### Optional arguments:
+
+`--k-samples`: Number of completions to generate per prompt (default: 1). This parameter determines the k in pass@k metrics. For example:
+- `--k-samples 1`: Calculates pass@1 (single attempt per problem)
+- `--k-samples 10`: Calculates pass@10 (ten attempts per problem)
+- `--k-samples 100`: Calculates pass@100 (hundred attempts per problem)
+
+Higher k values provide more robust evaluation but take longer to run.
+
+`--first-n-samples`: Evaluate only the first N problems from the dataset (default: entire dataset). Useful for quick testing or when you want to evaluate a subset of problems.
+
+`--timeout`: Maximum time in seconds allowed for each test case execution (default: 30.0). This prevents infinite loops or long-running code from blocking the evaluation.
+
+`--data-dir`: Custom directory for storing the HumanEval dataset (default: "<lemonade_cache_dir>/data/humaneval").
+
+## How It Works
+
+1. **Dataset Preparation:**
+   - On first run, the tool downloads the HumanEval dataset (HumanEval.jsonl.gz)
+   - The dataset contains function signatures, docstrings, and test cases
+   - Each problem is structured to test specific programming capabilities
+   - You can evaluate only the first N problems using `--first-n-samples`
+
+2. **Code Generation:**
+   - For each programming problem, the model is provided with a prompt containing:
+     - Function signature (e.g., `def sort_numbers(numbers):`)
+     - Docstring describing the function's purpose and requirements
+   - The model generates k code completions for the function body (controlled by `--k-samples`)
+   - These k samples are used to calculate the pass@k metric
+
+3. **Secure Execution:**
+   - Generated code is executed in a secure sandbox environment maintained by OpenAI's human-eval library. For your awareness, OpenAI's policy is to disable code execution by default, however lemonade enables code execution by default by automatically setting the environment variable `HF_ALLOW_CODE_EVAL=1`. OpenAI provides the following code execution protections:
+     - **Process Isolation**: Each code sample runs in a separate process to prevent interference
+     - **Resource Limits**:
+       - CPU time limit (controlled by `--timeout`)
+       - Memory usage restrictions
+       - Maximum output size restrictions
+     - **Restricted Access**:
+       - No network access
+       - No file system access outside test directory
+       - No subprocess creation
+       - No system calls
+     - **Module Restrictions**:
+       - Only allows importing standard Python libraries needed for testing
+       - Blocks potentially dangerous modules (os, sys, subprocess, etc.)
+   These security measures are implemented through:
+   - Python's built-in `resource` module for resource limits
+   - AST (Abstract Syntax Tree) analysis for code validation
+   - Process-level isolation using `multiprocessing`
+   - Custom import hooks to restrict module access
+
+4. **Evaluation Metrics:**
+   - **pass@k**: Percentage of problems solved with k attempts
+     - pass@1: Success rate with single attempt
+     - pass@10: Success rate within 10 attempts
+     - pass@100: Success rate within 100 attempts
+   - A problem is considered solved if all test cases pass
+   - Results are normalized to percentages
+
+5. **Output Files:**
+   The tool generates several output files in the results directory:
+   - `evaluation_results.csv`: Contains prompts, completions, and expected answers
+   - `humaneval_predictions.jsonl`: Raw model predictions in JSONL format
+   - `humaneval_predictions.jsonl_results.jsonl`: Detailed evaluation results
+
+## Example Results Format
+
+The evaluation produces metrics in the following format:
+```json
+{
+    "pass@1": 0.25,    // 25% success rate with 1 attempt
+    "pass@10": 0.45,   // 45% success rate within 10 attempts
+    "pass@100": 0.65   // 65% success rate within 100 attempts
+}
+```
+
+## Limitations
+
+1. **Resource Requirements**: Generating multiple samples per problem (high k values) can be computationally intensive and time-consuming.
+2. **Memory Usage**: Large language models may require significant memory, especially when generating multiple samples.
+
+## References
+
+1. [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374)
+2. [OpenAI HumanEval Repository](https://github.com/openai/human-eval) 
diff --git a/src/turnkeyml/llm/README.md → docs/lemonade_getting_started.md b/src/turnkeyml/llm/README.md → docs/lemonade_getting_started.md
@@ -79,8 +79,8 @@ Note that the `llm-prompt`, `accuracy-mmlu`, and `serve` tools can all be used w
 Lemonade is also available via API. Here's a quick example of how to benchmark an LLM:
 
 ```python
-import turnkeyml.llm.tools.torch_llm as tl
-import turnkeyml.llm.tools.chat as cl
+import lemonade.tools.torch_llm as tl
+import lemonade.tools.chat as cl
 from turnkeyml.state import State
 
 state = State(cache_dir="cache", build_name="test")

diff --git a/examples/llm/leap_basic.py b/examples/llm/leap_basic.py
@@ -0,0 +1,18 @@
+"""
+This example demonstrates how to use the LEAP API to load a model for
+inference on CPU using the hf-cpu recipe, and then use it to generate
+the response to a prompt.
+
+If you have a discrete GPU, you can try that by changing the recipe
+to hf-dgpu. Note: make sure to have torch+cuda installed when trying
+hf-dgpu.
+"""
+
+from lemonade import leap
+
+model, tokenizer = leap.from_pretrained("facebook/opt-125m", recipe="hf-cpu")
+
+input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
+response = model.generate(input_ids, max_new_tokens=30)
+
+print(tokenizer.decode(response[0]))
diff --git a/examples/llm/leap_ryzenai_npu.py b/examples/llm/leap_ryzenai_npu.py
@@ -0,0 +1,21 @@
+"""
+This example demonstrates how to use the LEAP API to load a model for
+inference on a Ryzen AI NPU using the ryzenai-npu-load recipe, 
+and then use it to generate the response to a prompt.
+
+Note that this example will only run if the Ryzen AI NPU Private recipe is installed.
+See genai/docs/ryzenai_npu.md for instructions.
+
+You can try the same model on CPU by changing the recipe to "hf-cpu".
+"""
+
+from lemonade import leap
+
+model, tokenizer = leap.from_pretrained(
+    "meta-llama/Llama-2-7b-chat-hf", recipe="ryzenai-npu"
+)
+
+input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
+response = model.generate(input_ids, max_new_tokens=30)
+
+print(tokenizer.decode(response[0]))
diff --git a/examples/llm/leap_streaming.py b/examples/llm/leap_streaming.py
@@ -0,0 +1,38 @@
+"""
+This example demonstrates how to use the LEAP API to load a model for
+inference on CPU using the hf-cpu recipe, and then use a thread to
+generate a streaming the response to a prompt.
+
+Note: this approach only works with recipes that support TextIteratorStreamer,
+i.e., huggingface-based recipes such as hf-cpu and ryzenai-npu.
+"""
+
+from thread import Thread
+from transformers import TextIteratorStreamer
+from lemonade import leap
+
+# Replace the recipe with "ryzenai-npu" to run on the RyzenAI NPU
+model, tokenizer = leap.from_pretrained(
+    "meta-llama/Llama-2-7b-chat-hf", recipe="hf-cpu"
+)
+
+input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
+
+streamer = TextIteratorStreamer(
+    tokenizer,
+    skip_prompt=True,
+)
+generation_kwargs = {
+    "input_ids": input_ids,
+    "streamer": streamer,
+    "max_new_tokens": 30,
+}
+
+thread = Thread(target=model.generate, kwargs=generation_kwargs)
+thread.start()
+
+# Generate the response using streaming
+for new_text in streamer:
+    print(new_text)
+
+thread.join()
diff --git a/examples/llm/turnkey_llm.ipynb b/examples/llm/turnkey_llm.ipynb
@@ -85,7 +85,7 @@
    "outputs": [],
    "source": [
     "# Import the turnkey APIs\n",
-    "from turnkeyml.llm import leap\n",
+    "from lemonade import leap\n",
     "\n",
     "# Load the model on to RyzenAI NPU\n",
     "# NOTE: this takes a couple of minutes, but after you've done it once\n",
@@ -133,7 +133,7 @@
    "outputs": [],
    "source": [
     "# Import the turnkey APIs\n",
-    "from turnkeyml.llm import leap\n",
+    "from lemonade import leap\n",
     "\n",
     "# Load the model on iGPU\n",
     "igpu_model, igpu_tokenizer = leap.from_pretrained(\n",

diff --git a/examples/readme.md b/examples/readme.md
@@ -3,3 +3,4 @@
 This directory contains examples to help you learn how to use the tools. The examples are split up into two sub-directories:
 1. `examples/cli`: a tutorial series for the `turnkey` CLI. This is the recommended starting point.
 1. `examples/api`: scripts that demonstrate how to use the `turnkey.evaluate_files()` API.
+1. `examples/llm`: scripts that demonstrate the `lemonade` CLI for LLMs.
diff --git a/setup.py b/setup.py
@@ -3,11 +3,11 @@
 with open("src/turnkeyml/version.py", encoding="utf-8") as fp:
     version = fp.read().split('"')[1]
 
+
 setup(
     name="turnkeyml",
     version=version,
     description="TurnkeyML Tools and Models",
-    author="Jeremy Fowers, Daniel Holanda, Ramakrishnan Sivakumar, Victoria Godsoe",
     author_email="[email protected]",
     package_dir={"": "src", "turnkeyml_models": "models"},
     packages=[
@@ -17,10 +17,10 @@
         "turnkeyml.sequence",
         "turnkeyml.cli",
         "turnkeyml.common",
-        "turnkeyml.llm",
-        "turnkeyml.llm.tools",
-        "turnkeyml.llm.tools.ort_genai",
-        "turnkeyml.llm.tools.ryzenai_npu",
+        "lemonade",
+        "lemonade.tools",
+        "lemonade.tools.ort_genai",
+        "lemonade.tools.ryzenai_npu",
         "turnkeyml_models",
         "turnkeyml_models.graph_convolutions",
         "turnkeyml_models.selftest",
@@ -46,77 +46,61 @@
         "psutil",
         "wmi",
         "pytz",
+        "tqdm",
         # Conditional dependencies for ONNXRuntime backends
         "onnxruntime >=1.10.1;platform_system=='Linux' and extra != 'llm-oga-cuda'",
         "onnxruntime-directml >=1.19.0;platform_system=='Windows' and extra != 'llm-oga-cuda'",
         "onnxruntime-gpu >=1.19.1;extra == 'llm-oga-cuda'",
     ],
     extras_require={
         "llm": [
-            "tqdm",
             "torch>=2.0.0",
             "transformers",
             "accelerate",
             "py-cpuinfo",
             "sentencepiece",
             "datasets",
+            # Install human-eval from a forked repo with Windows support until the
+            # PR (https://github.com/openai/human-eval/pull/53) is merged
+            "human-eval @ git+https://github.com/ramkrishna2910/human-eval.git",
             "fastapi",
             "uvicorn[standard]",
         ],
-        "llm-oga-dml": [
+        "llm-oga-igpu": [
             "onnxruntime-genai-directml==0.4.0",
-            "tqdm",
             "torch>=2.0.0,<2.4",
             "transformers<4.45.0",
-            "accelerate",
-            "py-cpuinfo",
-            "sentencepiece",
-            "datasets",
-            "fastapi",
-            "uvicorn[standard]",
+            "turnkeyml[llm]",
         ],
         "llm-oga-cuda": [
             "onnxruntime-genai-cuda==0.4.0",
-            "tqdm",
             "torch>=2.0.0,<2.4",
             "transformers<4.45.0",
-            "accelerate",
-            "py-cpuinfo",
-            "sentencepiece",
-            "datasets",
-            "fastapi",
-            "uvicorn[standard]",
+            "turnkeyml[llm]",
         ],
         "llm-oga-npu": [
-            "transformers",
-            "torch",
             "onnx==1.16.0",
             "onnxruntime==1.18.0",
             "numpy==1.26.4",
-            "tqdm",
-            "accelerate",
-            "py-cpuinfo",
-            "sentencepiece",
-            "datasets",
-            "fastapi",
-            "uvicorn[standard]",
+            "turnkeyml[llm]",
         ],
         "llm-oga-hybrid": [
-            "transformers",
-            "torch",
             "onnx==1.16.1",
             "numpy==1.26.4",
-            "datasets",
-            "fastapi",
-            "uvicorn[standard]",
+            "turnkeyml[llm]",
+        ],
+        "cuda": [
+            "torch @ https://download.pytorch.org/whl/cu118/torch-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl",
+            "torchvision @ https://download.pytorch.org/whl/cu118/torchvision-0.18.1%2Bcu118-cp310-cp310-win_amd64.whl",
+            "torchaudio @ https://download.pytorch.org/whl/cu118/torchaudio-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl",
         ],
     },
     classifiers=[],
     entry_points={
         "console_scripts": [
             "turnkey=turnkeyml:turnkeycli",
-            "turnkey-llm=turnkeyml.llm:lemonadecli",
-            "lemonade=turnkeyml.llm:lemonadecli",
+            "turnkey-llm=lemonade:lemonadecli",
+            "lemonade=lemonade:lemonadecli",
         ]
     },
     python_requires=">=3.8, <3.12",

diff --git a/src/turnkeyml/llm/__init__.py → src/lemonade/__init__.py b/src/turnkeyml/llm/__init__.py → src/lemonade/__init__.py
diff --git a/src/turnkeyml/llm/cache.py → src/lemonade/cache.py b/src/turnkeyml/llm/cache.py → src/lemonade/cache.py