diff --git a/docs/docs/index.md b/docs/docs/index.md index 32f57563ad..0b15779dd6 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -73,7 +73,7 @@ DSPy stands for Declarative Self-improving Python. Instead of brittle prompts, y > pip install "sglang[all]" > pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ - > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct + > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Llama-3.1-8B-Instruct --grammar-backend xgrammar ``` If you don't have access from Meta to download `meta-llama/Llama-3.1-8B-Instruct`, use `Qwen/Qwen2.5-7B-Instruct` for example. diff --git a/docs/docs/learn/programming/language_models.md b/docs/docs/learn/programming/language_models.md index 77231c8dd2..ecf0114cf1 100644 --- a/docs/docs/learn/programming/language_models.md +++ b/docs/docs/learn/programming/language_models.md @@ -48,7 +48,7 @@ dspy.configure(lm=lm) > pip install "sglang[all]" > pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ - > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Meta-Llama-3-8B-Instruct + > CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --port 7501 --model-path meta-llama/Meta-Llama-3-8B-Instruct --grammar-backend xgrammar ``` Then, connect to it from your DSPy code as an OpenAI-compatible endpoint. diff --git a/docs/docs/tutorials/agents/index.ipynb b/docs/docs/tutorials/agents/index.ipynb index 8b5521fb66..9209da8b9d 100644 --- a/docs/docs/tutorials/agents/index.ipynb +++ b/docs/docs/tutorials/agents/index.ipynb @@ -58,7 +58,7 @@ "\n", "A model like this is not very reliable out of the box for long or complex agent loops. However, it's extremely fast and cheap to host, as it needs very little RAM.\n", "\n", - "You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.\n", + "You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang (Note: launch with `--grammar-backend xgrammar`), or via a provider that hosts it for you like Databricks or Together.\n", "\n", "In the snippet below, we'll configure our main LM as `Llama-3.2-3B`. We'll also set up a larger LM, i.e. `GPT-4o`, as a teacher that we'll invoke a very small number of times to help teach the small LM." ] diff --git a/docs/docs/tutorials/classification_finetuning/index.ipynb b/docs/docs/tutorials/classification_finetuning/index.ipynb index 17120ed87e..34ac378d25 100644 --- a/docs/docs/tutorials/classification_finetuning/index.ipynb +++ b/docs/docs/tutorials/classification_finetuning/index.ipynb @@ -14,14 +14,16 @@ "\n", "Install the latest DSPy via `pip install -U --pre dspy` and follow along. This tutorial depends on DSPy 2.6.0 (pre-release).\n", "\n", - "This tutorial requires a local GPU at the moment for inference, though we plan to support ollama serving for finetuned models as well.\n", + "This tutorial requires a local GPU at the moment for inference, though we plan to support Ollama serving for finetuned models as well.\n", "\n", "You will also need the following dependencies:\n", "\n", "```shell\n", "> pip install \"sglang[all]\"; pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/\n", "> pip install -U torch transformers accelerate trl peft\n", - "```" + "```\n", + "\n", + "(Note: launch SGLang with `--grammar-backend xgrammar`)" ] }, { diff --git a/docs/docs/tutorials/multihop_search/index.ipynb b/docs/docs/tutorials/multihop_search/index.ipynb index 1c5998b424..1fde5c27ec 100644 --- a/docs/docs/tutorials/multihop_search/index.ipynb +++ b/docs/docs/tutorials/multihop_search/index.ipynb @@ -59,7 +59,7 @@ "source": [ "In this tutorial, we'll use a small local LM, Meta's `Llama-3.1-8B-Instruct` which has 8 billion parameters.\n", "\n", - "You might be able to host the 8B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.\n", + "You might be able to host the 8B model on your laptop with Ollama, on your GPU server with SGLang (Note: launch with `--grammar-backend xgrammar`), or via a provider that hosts it for you like Databricks or Together.\n", "\n", "In the snippet below, we'll configure this small model as our main LM. We'll also set up a larger LM, i.e. `GPT-4o`, as a teacher that we'll invoke a very small number of times to help teach the small LM. This is technically not necessary; the small model can typically teach itself tasks like this in DSPy. But using a larger teacher will give us some peace of mind, where the initial system or optimizer configuration doesn't matter as much." ] diff --git a/dspy/clients/lm_local.py b/dspy/clients/lm_local.py index d74d079a64..4337583bbf 100644 --- a/dspy/clients/lm_local.py +++ b/dspy/clients/lm_local.py @@ -57,7 +57,8 @@ def launch(lm: "LM", launch_kwargs: Optional[Dict[str, Any]] = None): ) port = get_free_port() timeout = launch_kwargs.get("timeout", 1800) - command = f"python -m sglang.launch_server --model-path {model} --port {port} --host 0.0.0.0" + #NOTE - Launched with grammar-backend xgrammar as it is more memory-friendly + command = f"python -m sglang.launch_server --model-path {model} --port {port} --host 0.0.0.0 --grammar-backend xgrammar" # We will manually stream & capture logs. process = subprocess.Popen( diff --git a/examples/migration.ipynb b/examples/migration.ipynb index 6d63788e52..7095394b3e 100644 --- a/examples/migration.ipynb +++ b/examples/migration.ipynb @@ -101,6 +101,8 @@ "metadata": {}, "outputs": [], "source": [ + "#NOTE - Launch your SGLang server with `--grammar-backend xgrammar` as it is more memory-friendly\n", + "\n", "sglang_port = 7501\n", "sglang_url = f\"http://localhost:{sglang_port}/v1\"\n", "sglang_llama = dspy.LM(\"openai/meta-llama/Meta-Llama-3-8B-Instruct\", api_base=sglang_url)\n",