diff --git a/docs/docs/integrations/llms/llamafile.ipynb b/docs/docs/integrations/llms/llamafile.ipynb new file mode 100644 index 0000000000000..2b778f4d274b1 --- /dev/null +++ b/docs/docs/integrations/llms/llamafile.ipynb @@ -0,0 +1,133 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Llamafile\n", + "\n", + "[Llamafile](https://github.com/Mozilla-Ocho/llamafile) lets you distribute and run LLMs with a single file.\n", + "\n", + "Llamafile does this by combining [llama.cpp](https://github.com/ggerganov/llama.cpp) with [Cosmopolitan Libc](https://github.com/jart/cosmopolitan) into one framework that collapses all the complexity of LLMs down to a single-file executable (called a \"llamafile\") that runs locally on most computers, with no installation.\n", + "\n", + "## Setup\n", + "\n", + "1. Download a llamafile for the model you'd like to use. You can find many models in llamafile format on [HuggingFace](https://huggingface.co/models?other=llamafile). In this guide, we will download a small one, `TinyLlama-1.1B-Chat-v1.0.Q5_K_M`. Note: if you don't have `wget`, you can just download the model via this [link](https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile?download=true).\n", + "\n", + "```bash\n", + "wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile\n", + "```\n", + "\n", + "2. Make the llamafile executable. First, if you haven't done so already, open a terminal. **If you're using MacOS, Linux, or BSD,** you'll need to grant permission for your computer to execute this new file using `chmod` (see below). **If you're on Windows,** rename the file by adding \".exe\" to the end (model file should be named `TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile.exe`).\n", + "\n", + "\n", + "```bash\n", + "chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile # run if you're on MacOS, Linux, or BSD\n", + "```\n", + "\n", + "3. Run the llamafile in \"server mode\":\n", + "\n", + "```bash\n", + "./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser\n", + "```\n", + "\n", + "Now you can make calls to the llamafile's REST API. By default, the llamafile server listens at http://localhost:8080. You can find full server documentation [here](https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/README.md#api-endpoints). You can interact with the llamafile directly via the REST API, but here we'll show how to interact with it using LangChain.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Usage" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'? \\nI\\'ve got a thing for pink, but you know that.\\n\"Can we not talk about work anymore?\" - What did she say?\\nI don\\'t want to be a burden on you.\\nIt\\'s hard to keep a good thing going.\\nYou can\\'t tell me what I want, I have a life too!'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_community.llms.llamafile import Llamafile\n", + "\n", + "llm = Llamafile()\n", + "\n", + "llm.invoke(\"Tell me a joke\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To stream tokens, use the `.stream(...)` method:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ".\n", + "- She said, \"I’m tired of my life. What should I do?\"\n", + "- The man replied, \"I hear you. But don’t worry. Life is just like a joke. It has its funny parts too.\"\n", + "- The woman looked at him, amazed and happy to hear his wise words. - \"Thank you for your wisdom,\" she said, smiling. - He replied, \"Any time. But it doesn't come easy. You have to laugh and keep moving forward in life.\"\n", + "- She nodded, thanking him again. - The man smiled wryly. \"Life can be tough. Sometimes it seems like you’re never going to get out of your situation.\"\n", + "- He said, \"I know that. But the key is not giving up. Life has many ups and downs, but in the end, it will turn out okay.\"\n", + "- The woman's eyes softened. \"Thank you for your advice. It's so important to keep moving forward in life,\" she said. - He nodded once again. \"You’re welcome. I hope your journey is filled with laughter and joy.\"\n", + "- They both smiled and left the bar, ready to embark on their respective adventures.\n" + ] + } + ], + "source": [ + "query = \"Tell me a joke\"\n", + "\n", + "for chunks in llm.stream(query):\n", + " print(chunks, end=\"\")\n", + "\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To learn more about the LangChain Expressive Language and the available methods on an LLM, see the [LCEL Interface](https://python.langchain.com/docs/expression_language/interface)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/libs/community/langchain_community/llms/llamafile.py b/libs/community/langchain_community/llms/llamafile.py new file mode 100644 index 0000000000000..5be6f4f211865 --- /dev/null +++ b/libs/community/langchain_community/llms/llamafile.py @@ -0,0 +1,318 @@ +from __future__ import annotations + +import json +from io import StringIO +from typing import Any, Dict, Iterator, List, Optional + +import requests +from langchain_core.callbacks.manager import CallbackManagerForLLMRun +from langchain_core.language_models.llms import LLM +from langchain_core.outputs import GenerationChunk +from langchain_core.pydantic_v1 import Extra +from langchain_core.utils import get_pydantic_field_names + + +class Llamafile(LLM): + """Llamafile lets you distribute and run large language models with a + single file. + + To get started, see: https://github.com/Mozilla-Ocho/llamafile + + To use this class, you will need to first: + + 1. Download a llamafile. + 2. Make the downloaded file executable: `chmod +x path/to/model.llamafile` + 3. Start the llamafile in server mode: + + `./path/to/model.llamafile --server --nobrowser` + + Example: + .. code-block:: python + + from langchain_community.llms import Llamafile + llm = Llamafile() + llm.invoke("Tell me a joke.") + """ + + base_url: str = "http://localhost:8080" + """Base url where the llamafile server is listening.""" + + request_timeout: Optional[int] = None + """Timeout for server requests""" + + streaming: bool = False + """Allows receiving each predicted token in real-time instead of + waiting for the completion to finish. To enable this, set to true.""" + + # Generation options + + seed: int = -1 + """Random Number Generator (RNG) seed. A random seed is used if this is + less than zero. Default: -1""" + + temperature: float = 0.8 + """Temperature. Default: 0.8""" + + top_k: int = 40 + """Limit the next token selection to the K most probable tokens. + Default: 40.""" + + top_p: float = 0.95 + """Limit the next token selection to a subset of tokens with a cumulative + probability above a threshold P. Default: 0.95.""" + + min_p: float = 0.05 + """The minimum probability for a token to be considered, relative to + the probability of the most likely token. Default: 0.05.""" + + n_predict: int = -1 + """Set the maximum number of tokens to predict when generating text. + Note: May exceed the set limit slightly if the last token is a partial + multibyte character. When 0, no tokens will be generated but the prompt + is evaluated into the cache. Default: -1 = infinity.""" + + n_keep: int = 0 + """Specify the number of tokens from the prompt to retain when the + context size is exceeded and tokens need to be discarded. By default, + this value is set to 0 (meaning no tokens are kept). Use -1 to retain all + tokens from the prompt.""" + + tfs_z: float = 1.0 + """Enable tail free sampling with parameter z. Default: 1.0 = disabled.""" + + typical_p: float = 1.0 + """Enable locally typical sampling with parameter p. + Default: 1.0 = disabled.""" + + repeat_penalty: float = 1.1 + """Control the repetition of token sequences in the generated text. + Default: 1.1""" + + repeat_last_n: int = 64 + """Last n tokens to consider for penalizing repetition. Default: 64, + 0 = disabled, -1 = ctx-size.""" + + penalize_nl: bool = True + """Penalize newline tokens when applying the repeat penalty. + Default: true.""" + + presence_penalty: float = 0.0 + """Repeat alpha presence penalty. Default: 0.0 = disabled.""" + + frequency_penalty: float = 0.0 + """Repeat alpha frequency penalty. Default: 0.0 = disabled""" + + mirostat: int = 0 + """Enable Mirostat sampling, controlling perplexity during text + generation. 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0. + Default: disabled.""" + + mirostat_tau: float = 5.0 + """Set the Mirostat target entropy, parameter tau. Default: 5.0.""" + + mirostat_eta: float = 0.1 + """Set the Mirostat learning rate, parameter eta. Default: 0.1.""" + + class Config: + """Configuration for this pydantic object.""" + + extra = Extra.forbid + + @property + def _llm_type(self) -> str: + return "llamafile" + + @property + def _param_fieldnames(self) -> List[str]: + # Return the list of fieldnames that will be passed as configurable + # generation options to the llamafile server. Exclude 'builtin' fields + # from the BaseLLM class like 'metadata' as well as fields that should + # not be passed in requests (base_url, request_timeout). + ignore_keys = [ + "base_url", + "cache", + "callback_manager", + "callbacks", + "metadata", + "name", + "request_timeout", + "streaming", + "tags", + "verbose", + ] + attrs = [ + k for k in get_pydantic_field_names(self.__class__) if k not in ignore_keys + ] + return attrs + + @property + def _default_params(self) -> Dict[str, Any]: + params = {} + for fieldname in self._param_fieldnames: + params[fieldname] = getattr(self, fieldname) + return params + + def _get_parameters( + self, stop: Optional[List[str]] = None, **kwargs: Any + ) -> Dict[str, Any]: + params = self._default_params + + # Only update keys that are already present in the default config. + # This way, we don't accidentally post unknown/unhandled key/values + # in the request to the llamafile server + for k, v in kwargs.items(): + if k in params: + params[k] = v + + if stop is not None and len(stop) > 0: + params["stop"] = stop + + if self.streaming: + params["stream"] = True + + return params + + def _call( + self, + prompt: str, + stop: Optional[List[str]] = None, + run_manager: Optional[CallbackManagerForLLMRun] = None, + **kwargs: Any, + ) -> str: + """Request prompt completion from the llamafile server and return the + output. + + Args: + prompt: The prompt to use for generation. + stop: A list of strings to stop generation when encountered. + run_manager: + **kwargs: Any additional options to pass as part of the + generation request. + + Returns: + The string generated by the model. + + """ + + if self.streaming: + with StringIO() as buff: + for chunk in self._stream( + prompt, stop=stop, run_manager=run_manager, **kwargs + ): + buff.write(chunk.text) + + text = buff.getvalue() + + return text + + else: + params = self._get_parameters(stop=stop, **kwargs) + payload = {"prompt": prompt, **params} + + try: + response = requests.post( + url=f"{self.base_url}/completion", + headers={ + "Content-Type": "application/json", + }, + json=payload, + stream=False, + timeout=self.request_timeout, + ) + except requests.exceptions.ConnectionError: + raise requests.exceptions.ConnectionError( + f"Could not connect to Llamafile server. Please make sure " + f"that a server is running at {self.base_url}." + ) + + response.raise_for_status() + response.encoding = "utf-8" + + text = response.json()["content"] + + return text + + def _stream( + self, + prompt: str, + stop: Optional[List[str]] = None, + run_manager: Optional[CallbackManagerForLLMRun] = None, + **kwargs: Any, + ) -> Iterator[GenerationChunk]: + """Yields results objects as they are generated in real time. + + It also calls the callback manager's on_llm_new_token event with + similar parameters to the OpenAI LLM class method of the same name. + + Args: + prompt: The prompts to pass into the model. + stop: Optional list of stop words to use when generating. + run_manager: + **kwargs: Any additional options to pass as part of the + generation request. + + Returns: + A generator representing the stream of tokens being generated. + + Yields: + Dictionary-like objects each containing a token + + Example: + .. code-block:: python + + from langchain_community.llms import Llamafile + llm = Llamafile( + temperature = 0.0 + ) + for chunk in llm.stream("Ask 'Hi, how are you?' like a pirate:'", + stop=["'","\n"]): + result = chunk["choices"][0] + print(result["text"], end='', flush=True) + + """ + params = self._get_parameters(stop=stop, **kwargs) + if "stream" not in params: + params["stream"] = True + + payload = {"prompt": prompt, **params} + + try: + response = requests.post( + url=f"{self.base_url}/completion", + headers={ + "Content-Type": "application/json", + }, + json=payload, + stream=True, + timeout=self.request_timeout, + ) + except requests.exceptions.ConnectionError: + raise requests.exceptions.ConnectionError( + f"Could not connect to Llamafile server. Please make sure " + f"that a server is running at {self.base_url}." + ) + + response.encoding = "utf8" + + for raw_chunk in response.iter_lines(decode_unicode=True): + content = self._get_chunk_content(raw_chunk) + chunk = GenerationChunk(text=content) + yield chunk + if run_manager: + run_manager.on_llm_new_token(token=chunk.text) + + def _get_chunk_content(self, chunk: str) -> str: + """When streaming is turned on, llamafile server returns lines like: + + 'data: {"content":" They","multimodal":true,"slot_id":0,"stop":false}' + + Here, we convert this to a dict and return the value of the 'content' + field + """ + + if chunk.startswith("data:"): + cleaned = chunk.lstrip("data: ") + data = json.loads(cleaned) + return data["content"] + else: + return chunk diff --git a/libs/community/tests/integration_tests/llms/test_llamafile.py b/libs/community/tests/integration_tests/llms/test_llamafile.py new file mode 100644 index 0000000000000..0f9c66c182296 --- /dev/null +++ b/libs/community/tests/integration_tests/llms/test_llamafile.py @@ -0,0 +1,46 @@ +import os +from typing import Generator + +import pytest +import requests +from requests.exceptions import ConnectionError, HTTPError + +from langchain_community.llms.llamafile import Llamafile + +LLAMAFILE_SERVER_BASE_URL = os.getenv( + "LLAMAFILE_SERVER_BASE_URL", "http://localhost:8080" +) + + +def _ping_llamafile_server() -> bool: + try: + response = requests.get(LLAMAFILE_SERVER_BASE_URL) + response.raise_for_status() + except (ConnectionError, HTTPError): + return False + + return True + + +@pytest.mark.skipif( + not _ping_llamafile_server(), + reason=f"unable to find llamafile server at {LLAMAFILE_SERVER_BASE_URL}, " + f"please start one and re-run this test", +) +def test_llamafile_call() -> None: + llm = Llamafile() + output = llm.invoke("Say foo:") + assert isinstance(output, str) + + +@pytest.mark.skipif( + not _ping_llamafile_server(), + reason=f"unable to find llamafile server at {LLAMAFILE_SERVER_BASE_URL}, " + f"please start one and re-run this test", +) +def test_llamafile_streaming() -> None: + llm = Llamafile(streaming=True) + generator = llm.stream("Tell me about Roman dodecahedrons.") + assert isinstance(generator, Generator) + for token in generator: + assert isinstance(token, str) diff --git a/libs/community/tests/unit_tests/llms/test_llamafile.py b/libs/community/tests/unit_tests/llms/test_llamafile.py new file mode 100644 index 0000000000000..10fea66a5ac73 --- /dev/null +++ b/libs/community/tests/unit_tests/llms/test_llamafile.py @@ -0,0 +1,158 @@ +import json +from collections import deque +from typing import Any, Dict + +import pytest +import requests +from pytest import MonkeyPatch + +from langchain_community.llms.llamafile import Llamafile + + +def default_generation_params() -> Dict[str, Any]: + return { + "temperature": 0.8, + "seed": -1, + "top_k": 40, + "top_p": 0.95, + "min_p": 0.05, + "n_predict": -1, + "n_keep": 0, + "tfs_z": 1.0, + "typical_p": 1.0, + "repeat_penalty": 1.1, + "repeat_last_n": 64, + "penalize_nl": True, + "presence_penalty": 0.0, + "frequency_penalty": 0.0, + "mirostat": 0, + "mirostat_tau": 5.0, + "mirostat_eta": 0.1, + } + + +def mock_response() -> requests.Response: + contents = json.dumps({"content": "the quick brown fox"}) + response = requests.Response() + response.status_code = 200 + response._content = str.encode(contents) + return response + + +def mock_response_stream(): # type: ignore[no-untyped-def] + mock_response = deque( + [ + b'data: {"content":"the","multimodal":false,"slot_id":0,"stop":false}\n\n', # noqa + b'data: {"content":" quick","multimodal":false,"slot_id":0,"stop":false}\n\n', # noqa + ] + ) + + class MockRaw: + def read(self, chunk_size): # type: ignore[no-untyped-def] + try: + return mock_response.popleft() + except IndexError: + return None + + response = requests.Response() + response.status_code = 200 + response.raw = MockRaw() + return response + + +def test_call(monkeypatch: MonkeyPatch) -> None: + """ + Test basic functionality of the `invoke` method + """ + llm = Llamafile( + base_url="http://llamafile-host:8080", + ) + + def mock_post(url, headers, json, stream, timeout): # type: ignore[no-untyped-def] + assert url == "http://llamafile-host:8080/completion" + assert headers == { + "Content-Type": "application/json", + } + # 'unknown' kwarg should be ignored + assert json == {"prompt": "Test prompt", **default_generation_params()} + assert stream is False + assert timeout is None + return mock_response() + + monkeypatch.setattr(requests, "post", mock_post) + out = llm.invoke("Test prompt") + assert out == "the quick brown fox" + + +def test_call_with_kwargs(monkeypatch: MonkeyPatch) -> None: + """ + Test kwargs passed to `invoke` override the default values and are passed + to the endpoint correctly. Also test that any 'unknown' kwargs that are not + present in the LLM class attrs are ignored. + """ + llm = Llamafile( + base_url="http://llamafile-host:8080", + ) + + def mock_post(url, headers, json, stream, timeout): # type: ignore[no-untyped-def] + assert url == "http://llamafile-host:8080/completion" + assert headers == { + "Content-Type": "application/json", + } + # 'unknown' kwarg should be ignored + expected = {"prompt": "Test prompt", **default_generation_params()} + expected["seed"] = 0 + assert json == expected + assert stream is False + assert timeout is None + return mock_response() + + monkeypatch.setattr(requests, "post", mock_post) + out = llm.invoke( + "Test prompt", + unknown="unknown option", # should be ignored + seed=0, # should override the default + ) + assert out == "the quick brown fox" + + +def test_call_raises_exception_on_missing_server(monkeypatch: MonkeyPatch) -> None: + """ + Test that the LLM raises a ConnectionError when no llamafile server is + listening at the base_url. + """ + llm = Llamafile( + # invalid url, nothing should actually be running here + base_url="http://llamafile-host:8080", + ) + with pytest.raises(requests.exceptions.ConnectionError): + llm.invoke("Test prompt") + + +def test_streaming(monkeypatch: MonkeyPatch) -> None: + """ + Test basic functionality of `invoke` with streaming enabled. + """ + llm = Llamafile( + base_url="http://llamafile-hostname:8080", + streaming=True, + ) + + def mock_post(url, headers, json, stream, timeout): # type: ignore[no-untyped-def] + assert url == "http://llamafile-hostname:8080/completion" + assert headers == { + "Content-Type": "application/json", + } + # 'unknown' kwarg should be ignored + assert "unknown" not in json + expected = {"prompt": "Test prompt", **default_generation_params()} + expected["stream"] = True + assert json == expected + assert stream is True + assert timeout is None + + return mock_response_stream() + + monkeypatch.setattr(requests, "post", mock_post) + out = llm.invoke("Test prompt") + assert out == "the quick"