context-free grammars example does not work with vLLM integration #1233

captify-sivakhno · 2024-10-28T13:01:56Z

Describe the issue as clearly as possible:

When running provided arithmetic grammar example with vLLM, I get an error TypeError: Error in model execution: argument 'ids': 'list' object cannot be interpreted as an integer. I presume this comes from de-tokenization, but still not sure how to fix it. Any suggestions would be welcome, as we have used outlines with vLLM successfully on a number of other use cases and really like the tool!

Steps/code to reproduce the bug:

from vllm import LLM, SamplingParams

llm = LLM(
    "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",
    enable_prefix_caching=True,
    block_size=64,
    max_num_batched_tokens=15000,
    gpu_memory_utilization=0.96,
    max_model_len=15000,
    use_v2_block_manager=True,

)

arithmetic_grammar = """
    ?start: expression

    ?expression: term (("+" | "-") term)*

    ?term: factor (("*" | "/") factor)*

    ?factor: NUMBER
           | "-" factor
           | "(" expression ")"

    %import common.NUMBER
"""

from outlines import models, generate
model = models.VLLM(llm)
generator = generate.cfg(model, arithmetic_grammar)
sampling_params = SamplingParams(temperature=0.3, top_p=0.2, max_tokens=20)

sequence = generator("Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:", sampling_params=sampling_params)

Expected result:

(4-2)

Error message:

TypeError: Error in model execution: argument 'ids': 'list' object cannot be interpreted as an integer
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:116, in dump_input_when_exception.<locals>._inner.<locals>._wrapper(*args, **kwargs)
    115 try:
--> 116     return func(*args, **kwargs)
    117 except Exception as err:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner.py:1698, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
   1696     return hidden_or_intermediate_states
-> 1698 logits = self.model.compute_logits(hidden_or_intermediate_states,
   1699                                    model_input.sampling_metadata)
   1701 if not self.is_driver_worker:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/models/llama.py:565, in LlamaForCausalLM.compute_logits(self, hidden_states, sampling_metadata)
    560 def compute_logits(
    561     self,
    562     hidden_states: torch.Tensor,
    563     sampling_metadata: SamplingMetadata,
    564 ) -> Optional[torch.Tensor]:
--> 565     logits = self.logits_processor(self.lm_head, hidden_states,
    566                                    sampling_metadata)
    567     return logits
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py:72, in LogitsProcessor.forward(self, lm_head, hidden_states, sampling_metadata, embedding_bias)
     71     # Apply logits processors (if any).
---> 72     logits = _apply_logits_processors(logits, sampling_metadata)
     74 return logits
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py:142, in _apply_logits_processors(logits, sampling_metadata)
    141     else:
--> 142         logits_row = logits_processor(past_tokens_ids,
    143                                       logits_row)
    145 logits[logits_row_idx] = logits_row
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    115 with ctx_factory():
--> 116     return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/processors/base_logits_processor.py:80, in OutlinesLogitsProcessor.__call__(self, input_ids, logits)
     79 elif len(torch_logits.shape) == 1:
---> 80     processed_logits = self.process_logits(
     81         input_ids.unsqueeze(0), torch_logits.unsqueeze(0)
     82     ).squeeze(0)
     84 # return logits as passed array type
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/processors/structured.py:239, in CFGLogitsProcessor.process_logits(self, input_ids, logits)
    238 for i, guide_state in enumerate(sequence_states):
--> 239     first_legal_token = next(
    240         self.guide.iter_valid_token_ids(
    241             guide_state, torch.argsort(logits[i], descending=True)
    242         )
    243     )
    244     mask[i, [first_legal_token]] = logits[i, [first_legal_token]]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/fsm/guide.py:189, in CFGGuide.iter_valid_token_ids(self, state, candidate_token_ids)
    188 try:
--> 189     self._get_parser_state_token_applied(state, int(token_id))
    190     yield token_id
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/fsm/guide.py:241, in CFGGuide._get_parser_state_token_applied(self, state, token_id)
    240 else:
--> 241     prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
    242     combined_token_str = self.tokenizer.decode([[state.prev_token, token_id]])[
    243         0
    244     ]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:4004, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   4002 token_ids = to_py_obj(token_ids)
-> 4004 return self._decode(
   4005     token_ids=token_ids,
   4006     skip_special_tokens=skip_special_tokens,
   4007     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   4008     **kwargs,
   4009 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:654, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    653     token_ids = [token_ids]
--> 654 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    656 clean_up_tokenization_spaces = (
    657     clean_up_tokenization_spaces
    658     if clean_up_tokenization_spaces is not None
    659     else self.clean_up_tokenization_spaces
    660 )
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:
TypeError                                 Traceback (most recent call last)
File <command-146369106289477>, line 15
      1 arithmetic_grammar = """
      2     ?start: expression
      3 
   (...)
     12     %import common.NUMBER
     13 """
     14 generator1 = generate.cfg(model, arithmetic_grammar)
---> 15 generator1("1+1=")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/generate/api.py:504, in SequenceGeneratorAdapter.__call__(self, prompts, max_tokens, stop_at, seed, **model_specific_params)
    498 """Generate text from a prompt of list of prompts."""
    500 generation_params = self.prepare_generation_parameters(
    501     max_tokens, stop_at, seed
    502 )
--> 504 completions = self.model.generate(
    505     prompts,
    506     generation_params,
    507     copy(self.logits_processor),
    508     self.sampling_params,
    509     **model_specific_params,
    510 )
    512 return self._format(completions)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/models/vllm.py:131, in VLLM.generate(self, prompts, generation_parameters, logits_processor, sampling_parameters, sampling_params, use_tqdm)
    128 if sampler == "beam_search":
    129     sampling_params.use_beam_search = True
--> 131 results = self.model.generate(
    132     prompts,
    133     sampling_params=sampling_params,
    134     lora_request=self.lora_request,
    135     use_tqdm=use_tqdm,
    136 )
    137 results = [[sample.text for sample in batch.outputs] for batch in results]
    139 batch_size = len(results)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/utils.py:1063, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1056             msg += f" {additional_message}"
   1058         warnings.warn(
   1059             DeprecationWarning(msg),
   1060             stacklevel=3,  # The inner function takes up one level
   1061         )
-> 1063 return fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/entrypoints/llm.py:353, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request, guided_options_request, priority)
    343     sampling_params = SamplingParams()
    345 self._validate_and_add_requests(
    346     prompts=parsed_prompts,
    347     params=sampling_params,
   (...)
    350     guided_options=guided_options_request,
    351     priority=priority)
--> 353 outputs = self._run_engine(use_tqdm=use_tqdm)
    354 return LLMEngine.validate_outputs(outputs, RequestOutput)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/entrypoints/llm.py:879, in LLM._run_engine(self, use_tqdm)
    877 total_out_toks = 0
    878 while self.llm_engine.has_unfinished_requests():
--> 879     step_outputs = self.llm_engine.step()
    880     for output in step_outputs:
    881         if output.finished:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/engine/llm_engine.py:1386, in LLMEngine.step(self)
   1382 if allow_async_output_proc:
   1383     execute_model_req.async_callback = self.async_callbacks[
   1384         virtual_engine]
-> 1386 outputs = self.model_executor.execute_model(
   1387     execute_model_req=execute_model_req)
   1389 # We need to do this here so that last step's sampled_token_ids can
   1390 # be passed to the next iteration for PP.
   1391 if self.scheduler_config.is_multi_step:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:134, in GPUExecutor.execute_model(self, execute_model_req)
    131 def execute_model(
    132     self, execute_model_req: ExecuteModelRequest
    133 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 134     output = self.driver_worker.execute_model(execute_model_req)
    135     return output
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/worker_base.py:327, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
    322     if (self.observability_config is not None
    323             and self.observability_config.collect_model_execute_time):
    324         orig_model_execute_time = intermediate_tensors.tensors.get(
    325             "model_execute_time", torch.tensor(0)).item()
--> 327 output = self.model_runner.execute_model(
    328     model_input=model_input,
    329     kv_caches=self.kv_cache[worker_input.virtual_engine]
    330     if self.kv_cache is not None else None,
    331     intermediate_tensors=intermediate_tensors,
    332     num_steps=num_steps,
    333     **kwargs,
    334 )
    336 model_execute_time = time.perf_counter() - start_time
    337 if not get_pp_group().is_last_rank:
    338     # output is IntermediateTensors
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:146, in dump_input_when_exception.<locals>._inner.<locals>._wrapper(*args, **kwargs)
    142     except Exception as pickle_err:
    143         logger.warning(
    144             "Failed to pickle inputs of failed execution: %s",
    145             str(pickle_err))
--> 146         raise type(err)(f"Error in model execution: "
    147                         f"{str(err)}") from err
    149     logger.info(
    150         "Completed writing input of failed execution to %s.",
    151         filename)
    152 raise type(err)(
    153     f"Error in model execution (input dumped to {filename}): "
    154     f"{str(err)}") from err

Outlines/Python version information:

Version information

``` outlines version = 0.1.1

Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]

absl-py==1.0.0
accelerate==0.31.0
aiohttp==3.8.5
aiohttp-cors==0.7.0
aiosignal==1.2.0
airportsdata==20241001
annotated-types==0.7.0
anyio==3.5.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.0.5
astunparse==1.6.3
async-timeout==4.0.2
attrs==24.2.0
audioread==3.0.1
azure-core==1.30.2
azure-cosmos==4.3.1
azure-identity==1.17.1
azure-storage-blob==12.19.1
azure-storage-file-datalake==12.14.0
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.12.2
black==23.3.0
bleach==4.1.0
blinker==1.4
blis==0.7.11
boto3==1.34.39
botocore==1.34.39
Brotli==1.0.9
cachetools==5.4.0
catalogue==2.0.10
category-encoders==2.6.3
certifi==2023.7.22
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.0.4
circuitbreaker==1.4.0
click==8.0.4
cloudpathlib==0.16.0
cloudpickle==2.2.1
cmdstanpy==1.2.2
colorful==0.5.6
comm==0.1.2
confection==0.1.4
configparser==5.2.0
contourpy==1.0.5
cryptography==41.0.3
cycler==0.11.0
cymem==2.0.8
Cython==0.29.32
dacite==1.8.1
databricks-automl-runtime==0.2.21
databricks-feature-engineering==0.6.0
databricks-sdk==0.20.0
dataclasses-json==0.6.7
datasets==2.19.1
dbl-tempo==0.1.26
dbus-python==1.2.18
debugpy==1.6.7
decorator==5.1.1
deepspeed==0.14.4
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.6
diskcache==5.6.3
distlib==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
dm-tree==0.1.8
einops==0.8.0
entrypoints==0.4
evaluate==0.4.2
executing==0.8.3
facets-overview==1.1.1
Farama-Notifications==0.0.4
fastapi==0.115.4
fastjsonschema==2.20.0
fasttext==0.9.2
filelock==3.13.4
flash-attn==2.5.9.post1
Flask==2.2.5
flatbuffers==24.3.25
fonttools==4.25.0
frozenlist==1.3.3
fsspec==2023.5.0
future==0.18.3
gast==0.4.0
gguf==0.10.0
gitdb==4.0.11
GitPython==3.1.27
google-api-core==2.18.0
google-auth==2.21.0
google-auth-oauthlib==1.0.0
google-cloud-core==2.4.1
google-cloud-storage==2.10.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.1
googleapis-common-protos==1.63.0
greenlet==2.0.1
grpcio==1.60.0
grpcio-status==1.60.0
gunicorn==20.1.0
gviz-api==1.10.0
gymnasium==0.28.1
h11==0.14.0
h5py==3.10.0
hjson==3.1.0
holidays==0.45
horovod @ git+https://github.com/wenfeiy-db/horovod.git@d510b1d385628f8ac5770199c0824fd5b7e01394
htmlmin==0.1.12
httpcore==1.0.5
httplib2==0.20.2
httptools==0.6.4
httpx==0.27.0
huggingface-hub==0.23.4
idna==3.4
ImageHash==4.3.1
imageio==2.31.1
imbalanced-learn==0.11.0
importlib-metadata==6.0.0
importlib_resources==6.4.0
interegular==0.3.3
ipyflow-core==0.0.198
ipykernel==6.25.1
ipython==8.15.0
ipython-genutils==0.2.0
ipywidgets @ https://databricks-build-artifacts-manual-staging.s3-accelerate.amazonaws.com/ipywidgets/ipywidgets-7.7.2-2databricksnojsdeps-py2.py3-none-any.whl?AWSAccessKeyId=AKIAX7HWM34HCSVHYQ7M&Expires=2028837235&Signature=gJ%2BjzENPoM6UKsDxe1M3VIrgWco%3D#sha256=903ead20c8d40de671853515fcad2f34b43ebf3eff80e4df3f876b8dd64c903b
isodate==0.6.1
itsdangerous==2.0.1
jax-jumpy==1.0.0
jedi==0.18.1
jeepney==0.7.1
Jinja2==3.1.2
jiter==0.6.1
jmespath==0.10.0
joblib==1.2.0
joblibspark==0.5.1
jsonpatch==1.33
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter-server==1.23.4
jupyter_client==7.4.9
jupyter_core==5.3.0
jupyterlab-pygments==0.1.2
keras==3.2.1
keyring==23.5.0
kiwisolver==1.4.4
langchain==0.1.20
langchain-community==0.0.38
langchain-core==0.1.52
langchain-text-splitters==0.0.2
langcodes==3.4.0
langsmith==0.1.63
language_data==1.2.0
lark==1.2.2
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.2
libclang==15.0.6.1
librosa==0.10.1
lightgbm==4.3.0
linkify-it-py==2.0.0
llvmlite==0.43.0
lm-format-enforcer==0.10.6
lxml==4.9.2
lz4==4.3.2
Mako==1.2.0
marisa-trie==1.1.1
Markdown==3.4.1
markdown-it-py==2.2.0
MarkupSafe==2.1.1
marshmallow==3.21.2
matplotlib==3.7.2
matplotlib-inline==0.1.6
mdit-py-plugins==0.3.0
mdurl==0.1.0
memray==1.13.4
mistral_common==1.4.4
mistune==0.8.4
ml-dtypes==0.3.2
mlflow-skinny==2.15.1
more-itertools==8.10.0
mosaicml-streaming==0.7.4
mpmath==1.3.0
msal==1.30.0
msal-extensions==1.2.0
msgpack==1.0.8
msgspec==0.18.6
multidict==6.0.2
multimethod==1.12
multiprocess==0.70.14
murmurhash==1.0.10
mypy-extensions==0.4.3
namex==0.0.8
nbclassic==0.5.5
nbclient==0.5.13
nbconvert==6.5.4
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==3.1
ninja==1.11.1.1
nltk==3.8.1
notebook==6.5.4
notebook_shim==0.2.2
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.555.43
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.0
oci==2.126.4
openai==1.52.2
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python-headless==4.10.0.84
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-semantic-conventions==0.46b0
opt-einsum==3.3.0
optree==0.12.1
orjson==3.10.6
outlines==0.1.1
outlines_core==0.1.14
packaging==23.2
pandas==1.5.3
pandocfilters==1.5.0
paramiko==3.4.0
parso==0.8.3
partial-json-parser==0.2.1.1.post4
pathspec==0.10.3
patsy==0.5.3
petastorm==0.12.1
pexpect==4.8.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.4.0
platformdirs==3.10.0
plotly==5.9.0
pmdarima==2.0.4
pooch==1.8.1
portalocker==2.10.1
preshed==3.0.9
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.21.0
prompt-toolkit==3.0.36
prophet==1.1.5
proto-plus==1.24.0
protobuf==4.24.1
psutil==5.9.0
psycopg2==2.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==8.0.0
py-spy==0.3.14
pyairports==2.1.1
pyarrow==14.0.1
pyarrow-hotfix==0.6
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.13.1
pyccolo==0.0.52
pycountry==24.6.1
pycparser==2.21
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.15.1
PyGObject==3.42.1
PyJWT==2.3.0
PyNaCl==1.5.0
pyodbc==4.0.38
pyOpenSSL==23.2.0
pyparsing==3.0.9
pyrsistent==0.18.0
pytesseract==0.3.10
python-apt==2.4.0+ubuntu3
python-dateutil==2.8.2
python-dotenv==1.0.1
python-editor==1.0.4
python-lsp-jsonrpc==1.1.1
python-snappy==0.6.1
pytz==2022.7
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==23.2.0
ray==2.35.0
referencing==0.35.1
regex==2022.7.9
requests==2.31.0
requests-oauthlib==1.3.1
rich==13.7.1
rpds-py==0.20.0
rsa==4.9
s3transfer==0.10.2
safetensors==0.4.2
scikit-image==0.20.0
scikit-learn==1.3.0
scipy==1.11.1
seaborn==0.12.2
SecretStorage==3.3.1
Send2Trash==1.8.0
sentence-transformers==2.7.0
sentencepiece==0.2.0
shap==0.44.0
simplejson==3.17.6
six==1.16.0
slicer==0.0.7
smart-open==5.2.1
smmap==5.0.0
sniffio==1.2.0
soundfile==0.12.1
soupsieve==2.4
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
spark-tensorflow-distributor==1.0.0
SQLAlchemy==1.4.39
sqlparse==0.4.2
srsly==2.4.8
ssh-import-id==5.11
stack-data==0.2.0
stanio==0.5.1
starlette==0.41.2
statsmodels==0.14.0
sympy==1.11.1
tangled-up-in-unicode==0.2.0
tenacity==8.2.2
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorboard_plugin_profile==2.15.1
tensorboardX==2.6.2.2
tensorflow==2.16.1
tensorflow-estimator==2.15.0
tensorflow-io-gcs-filesystem==0.37.1
termcolor==2.4.0
terminado==0.17.1
textual==0.63.3
tf_keras==2.16.0
thinc==8.2.3
threadpoolctl==2.2.0
tifffile==2021.7.2
tiktoken==0.7.0
tinycss2==1.2.1
tokenize-rt==4.2.1
tokenizers==0.20.1
torch==2.4.0
torcheval==0.0.7
torchvision==0.19.0
tornado==6.3.2
tqdm==4.65.0
traitlets==5.7.1
transformers==4.46.0
triton==3.0.0
typeguard==2.13.3
typer==0.9.4
typing-inspect==0.9.0
typing_extensions==4.12.2
tzdata==2022.1
uc-micro-py==1.0.1
ujson==5.4.0
unattended-upgrades==0.1
urllib3==1.26.16
uvicorn==0.32.0
uvloop==0.21.0
virtualenv==20.24.2
visions==0.7.5
vllm==0.6.3
wadllib==1.3.6
wasabi==1.1.2
watchfiles==0.24.0
wcwidth==0.2.5
weasel==0.3.4
webencodings==0.5.1
websocket-client==0.58.0
websockets==13.1
Werkzeug==2.2.3
wordcloud==1.9.3
wrapt==1.14.1
xformers==0.0.27.post2
xgboost==2.0.3
xxhash==3.4.1
yarl==1.8.1
ydata-profiling==4.5.1
zipp==3.11.0
zstd==1.5.5.1

</details>


### Context for the issue:

_No response_

The text was updated successfully, but these errors were encountered:

dszeto · 2024-10-31T07:04:45Z

I actually run into a similar issue as well when trying to add CFG support to https://github.com/huggingface/text-generation-inference. Same error message with the same code path towards the last 3 function calls (see trace below). Any hints would be appreciated.

2024-10-31T06:56:33.558057Z ERROR text_generation_launcher: Method Decode encountered an error.                                                                                                                       Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>                                                                                                                                                       sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__                                                                                                                                     return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__                                                                                                                                    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main                                                                                                                                         return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main                                                                                                                                        rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke                                                                                                                                      return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke                                                                                                                                      return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke                                                                                                                                       return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper                                                                                                                                      return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 116, in serve                                                                                                                        server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve                                                                                                                     asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run                                                                                                                                                   return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 218, in Decode
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1968, in generate_token
    ) = batch.next_token_chooser(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/tokens.py", line 364, in __call__
    _scores = self.grammar_processor(_scores, self.fsm_grammar_states)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/logits_process.py", line 597, in __call__
    allowed_tokens = fsm.get_next_instruction(fsm_grammar_states[i]).tokens
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 154, in get_next_instruction
    valid_tokens = list(
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 189, in iter_valid_token_ids
    self._get_parser_state_token_applied(state, int(token_id))
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 241, in _get_parser_state_token_applied
    prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3999, in decode
    return self._decode(
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

rlouf · 2024-10-31T08:38:19Z

It's hard to know if it's an issue on their end or ours. Running the same in outlines directly should tell us.

CompuIves · 2024-11-08T11:11:38Z

I found the same issue, I think it goes wrong because of this line (outlines/fsm/guide.py:241):

prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]

The tokenizer does not expect a 2d list. Changing it to:

prev_token_str = self.tokenizer.decode([state.prev_token])[0]

Fixes it for me, but I stumble upon another issue after (could be unrelated).

PierreLepagnol · 2024-11-21T11:26:17Z

Hi everyone,
I encountered an issue when attempting to use the generate.cfg function with a VLLM model. The code throws a NotImplementedError, indicating that the CFG Logits processor is not available for the VLLM class.

Error Message

Exception has occurred: NotImplementedError
The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.
  File "/home/lepagnol/Documents/These/format-constrained-for-slu/vllm_test.py", line 30, in <module>
    generator = generate.cfg(model, arithmetic_grammar)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.

Code to Reproduce

from vllm import LLM, SamplingParams

llm = LLM(
    "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",
    enable_prefix_caching=True,
    block_size=64,
    max_num_batched_tokens=15000,
    gpu_memory_utilization=0.96,
    max_model_len=15000,
    use_v2_block_manager=True,
)

arithmetic_grammar = """
    ?start: expression

    ?expression: term (("+" | "-") term)*

    ?term: factor (("*" | "/") factor)*

    ?factor: NUMBER
           | "-" factor
           | "(" expression ")"

    %import common.NUMBER
"""

from outlines import generate, models

model = models.VLLM(llm)
generator = generate.cfg(model, arithmetic_grammar)
sampling_params = SamplingParams(temperature=0.3, top_p=0.2, max_tokens=20)

sequence = generator(
    "Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:",
    sampling_params=sampling_params,
)

Expected Behavior

I expected the code to generate a sequence based on the defined grammar using the VLLM model.

Actual Behavior

The code throws a NotImplementedError, suggesting that the CFG Logits processor is not implemented for the VLLM model.

Environment

Python version: 3.12
Outlines version: 0.0.46
VLLM version: 0.6.4.post2.dev67+g63f1fde2.cpu
Model: neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8

Additional Context

Is the CFG Logits processor not yet supported for VLLM, or is there a workaround for this issue? If it's a known limitation, are there any plans to support it in the future?

Thank you!

NathanHB · 2025-01-22T14:56:19Z

Hi ! Thanks for your work! Outlines has been a pleasure to work with, however, having CFG for VLLM would be great for our need, is there any news on this?

captify-sivakhno added the bug label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context-free grammars example does not work with vLLM integration #1233

context-free grammars example does not work with vLLM integration #1233

captify-sivakhno commented Oct 28, 2024 •

edited

Loading

dszeto commented Oct 31, 2024

rlouf commented Oct 31, 2024

CompuIves commented Nov 8, 2024 •

edited

Loading

PierreLepagnol commented Nov 21, 2024

NathanHB commented Jan 22, 2025 •

edited

Loading

context-free grammars example does not work with vLLM integration #1233

context-free grammars example does not work with vLLM integration #1233

Comments

captify-sivakhno commented Oct 28, 2024 • edited Loading

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

dszeto commented Oct 31, 2024

rlouf commented Oct 31, 2024

CompuIves commented Nov 8, 2024 • edited Loading

PierreLepagnol commented Nov 21, 2024

Error Message

Code to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Context

NathanHB commented Jan 22, 2025 • edited Loading

captify-sivakhno commented Oct 28, 2024 •

edited

Loading

CompuIves commented Nov 8, 2024 •

edited

Loading

NathanHB commented Jan 22, 2025 •

edited

Loading