Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context-free grammars example does not work with vLLM integration #1233

Open
captify-sivakhno opened this issue Oct 28, 2024 · 4 comments
Open
Labels

Comments

@captify-sivakhno
Copy link

captify-sivakhno commented Oct 28, 2024

Describe the issue as clearly as possible:

When running provided arithmetic grammar example with vLLM, I get an error TypeError: Error in model execution: argument 'ids': 'list' object cannot be interpreted as an integer. I presume this comes from de-tokenization, but still not sure how to fix it. Any suggestions would be welcome, as we have used outlines with vLLM successfully on a number of other use cases and really like the tool!

Steps/code to reproduce the bug:

from vllm import LLM, SamplingParams

llm = LLM(
    "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",
    enable_prefix_caching=True,
    block_size=64,
    max_num_batched_tokens=15000,
    gpu_memory_utilization=0.96,
    max_model_len=15000,
    use_v2_block_manager=True,

)

arithmetic_grammar = """
    ?start: expression

    ?expression: term (("+" | "-") term)*

    ?term: factor (("*" | "/") factor)*

    ?factor: NUMBER
           | "-" factor
           | "(" expression ")"

    %import common.NUMBER
"""

from outlines import models, generate
model = models.VLLM(llm)
generator = generate.cfg(model, arithmetic_grammar)
sampling_params = SamplingParams(temperature=0.3, top_p=0.2, max_tokens=20)

sequence = generator("Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:", sampling_params=sampling_params)

Expected result:

(4-2)

Error message:

TypeError: Error in model execution: argument 'ids': 'list' object cannot be interpreted as an integer
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:116, in dump_input_when_exception.<locals>._inner.<locals>._wrapper(*args, **kwargs)
    115 try:
--> 116     return func(*args, **kwargs)
    117 except Exception as err:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner.py:1698, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
   1696     return hidden_or_intermediate_states
-> 1698 logits = self.model.compute_logits(hidden_or_intermediate_states,
   1699                                    model_input.sampling_metadata)
   1701 if not self.is_driver_worker:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/models/llama.py:565, in LlamaForCausalLM.compute_logits(self, hidden_states, sampling_metadata)
    560 def compute_logits(
    561     self,
    562     hidden_states: torch.Tensor,
    563     sampling_metadata: SamplingMetadata,
    564 ) -> Optional[torch.Tensor]:
--> 565     logits = self.logits_processor(self.lm_head, hidden_states,
    566                                    sampling_metadata)
    567     return logits
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py:72, in LogitsProcessor.forward(self, lm_head, hidden_states, sampling_metadata, embedding_bias)
     71     # Apply logits processors (if any).
---> 72     logits = _apply_logits_processors(logits, sampling_metadata)
     74 return logits
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py:142, in _apply_logits_processors(logits, sampling_metadata)
    141     else:
--> 142         logits_row = logits_processor(past_tokens_ids,
    143                                       logits_row)
    145 logits[logits_row_idx] = logits_row
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    115 with ctx_factory():
--> 116     return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/processors/base_logits_processor.py:80, in OutlinesLogitsProcessor.__call__(self, input_ids, logits)
     79 elif len(torch_logits.shape) == 1:
---> 80     processed_logits = self.process_logits(
     81         input_ids.unsqueeze(0), torch_logits.unsqueeze(0)
     82     ).squeeze(0)
     84 # return logits as passed array type
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/processors/structured.py:239, in CFGLogitsProcessor.process_logits(self, input_ids, logits)
    238 for i, guide_state in enumerate(sequence_states):
--> 239     first_legal_token = next(
    240         self.guide.iter_valid_token_ids(
    241             guide_state, torch.argsort(logits[i], descending=True)
    242         )
    243     )
    244     mask[i, [first_legal_token]] = logits[i, [first_legal_token]]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/fsm/guide.py:189, in CFGGuide.iter_valid_token_ids(self, state, candidate_token_ids)
    188 try:
--> 189     self._get_parser_state_token_applied(state, int(token_id))
    190     yield token_id
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/fsm/guide.py:241, in CFGGuide._get_parser_state_token_applied(self, state, token_id)
    240 else:
--> 241     prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
    242     combined_token_str = self.tokenizer.decode([[state.prev_token, token_id]])[
    243         0
    244     ]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:4004, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   4002 token_ids = to_py_obj(token_ids)
-> 4004 return self._decode(
   4005     token_ids=token_ids,
   4006     skip_special_tokens=skip_special_tokens,
   4007     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   4008     **kwargs,
   4009 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:654, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    653     token_ids = [token_ids]
--> 654 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    656 clean_up_tokenization_spaces = (
    657     clean_up_tokenization_spaces
    658     if clean_up_tokenization_spaces is not None
    659     else self.clean_up_tokenization_spaces
    660 )
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:
TypeError                                 Traceback (most recent call last)
File <command-146369106289477>, line 15
      1 arithmetic_grammar = """
      2     ?start: expression
      3 
   (...)
     12     %import common.NUMBER
     13 """
     14 generator1 = generate.cfg(model, arithmetic_grammar)
---> 15 generator1("1+1=")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/generate/api.py:504, in SequenceGeneratorAdapter.__call__(self, prompts, max_tokens, stop_at, seed, **model_specific_params)
    498 """Generate text from a prompt of list of prompts."""
    500 generation_params = self.prepare_generation_parameters(
    501     max_tokens, stop_at, seed
    502 )
--> 504 completions = self.model.generate(
    505     prompts,
    506     generation_params,
    507     copy(self.logits_processor),
    508     self.sampling_params,
    509     **model_specific_params,
    510 )
    512 return self._format(completions)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/models/vllm.py:131, in VLLM.generate(self, prompts, generation_parameters, logits_processor, sampling_parameters, sampling_params, use_tqdm)
    128 if sampler == "beam_search":
    129     sampling_params.use_beam_search = True
--> 131 results = self.model.generate(
    132     prompts,
    133     sampling_params=sampling_params,
    134     lora_request=self.lora_request,
    135     use_tqdm=use_tqdm,
    136 )
    137 results = [[sample.text for sample in batch.outputs] for batch in results]
    139 batch_size = len(results)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/utils.py:1063, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1056             msg += f" {additional_message}"
   1058         warnings.warn(
   1059             DeprecationWarning(msg),
   1060             stacklevel=3,  # The inner function takes up one level
   1061         )
-> 1063 return fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/entrypoints/llm.py:353, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request, guided_options_request, priority)
    343     sampling_params = SamplingParams()
    345 self._validate_and_add_requests(
    346     prompts=parsed_prompts,
    347     params=sampling_params,
   (...)
    350     guided_options=guided_options_request,
    351     priority=priority)
--> 353 outputs = self._run_engine(use_tqdm=use_tqdm)
    354 return LLMEngine.validate_outputs(outputs, RequestOutput)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/entrypoints/llm.py:879, in LLM._run_engine(self, use_tqdm)
    877 total_out_toks = 0
    878 while self.llm_engine.has_unfinished_requests():
--> 879     step_outputs = self.llm_engine.step()
    880     for output in step_outputs:
    881         if output.finished:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/engine/llm_engine.py:1386, in LLMEngine.step(self)
   1382 if allow_async_output_proc:
   1383     execute_model_req.async_callback = self.async_callbacks[
   1384         virtual_engine]
-> 1386 outputs = self.model_executor.execute_model(
   1387     execute_model_req=execute_model_req)
   1389 # We need to do this here so that last step's sampled_token_ids can
   1390 # be passed to the next iteration for PP.
   1391 if self.scheduler_config.is_multi_step:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:134, in GPUExecutor.execute_model(self, execute_model_req)
    131 def execute_model(
    132     self, execute_model_req: ExecuteModelRequest
    133 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 134     output = self.driver_worker.execute_model(execute_model_req)
    135     return output
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/worker_base.py:327, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
    322     if (self.observability_config is not None
    323             and self.observability_config.collect_model_execute_time):
    324         orig_model_execute_time = intermediate_tensors.tensors.get(
    325             "model_execute_time", torch.tensor(0)).item()
--> 327 output = self.model_runner.execute_model(
    328     model_input=model_input,
    329     kv_caches=self.kv_cache[worker_input.virtual_engine]
    330     if self.kv_cache is not None else None,
    331     intermediate_tensors=intermediate_tensors,
    332     num_steps=num_steps,
    333     **kwargs,
    334 )
    336 model_execute_time = time.perf_counter() - start_time
    337 if not get_pp_group().is_last_rank:
    338     # output is IntermediateTensors
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:146, in dump_input_when_exception.<locals>._inner.<locals>._wrapper(*args, **kwargs)
    142     except Exception as pickle_err:
    143         logger.warning(
    144             "Failed to pickle inputs of failed execution: %s",
    145             str(pickle_err))
--> 146         raise type(err)(f"Error in model execution: "
    147                         f"{str(err)}") from err
    149     logger.info(
    150         "Completed writing input of failed execution to %s.",
    151         filename)
    152 raise type(err)(
    153     f"Error in model execution (input dumped to {filename}): "
    154     f"{str(err)}") from err

Outlines/Python version information:

Version information

``` outlines version = 0.1.1

Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]

absl-py==1.0.0
accelerate==0.31.0
aiohttp==3.8.5
aiohttp-cors==0.7.0
aiosignal==1.2.0
airportsdata==20241001
annotated-types==0.7.0
anyio==3.5.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.0.5
astunparse==1.6.3
async-timeout==4.0.2
attrs==24.2.0
audioread==3.0.1
azure-core==1.30.2
azure-cosmos==4.3.1
azure-identity==1.17.1
azure-storage-blob==12.19.1
azure-storage-file-datalake==12.14.0
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.12.2
black==23.3.0
bleach==4.1.0
blinker==1.4
blis==0.7.11
boto3==1.34.39
botocore==1.34.39
Brotli==1.0.9
cachetools==5.4.0
catalogue==2.0.10
category-encoders==2.6.3
certifi==2023.7.22
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.0.4
circuitbreaker==1.4.0
click==8.0.4
cloudpathlib==0.16.0
cloudpickle==2.2.1
cmdstanpy==1.2.2
colorful==0.5.6
comm==0.1.2
confection==0.1.4
configparser==5.2.0
contourpy==1.0.5
cryptography==41.0.3
cycler==0.11.0
cymem==2.0.8
Cython==0.29.32
dacite==1.8.1
databricks-automl-runtime==0.2.21
databricks-feature-engineering==0.6.0
databricks-sdk==0.20.0
dataclasses-json==0.6.7
datasets==2.19.1
dbl-tempo==0.1.26
dbus-python==1.2.18
debugpy==1.6.7
decorator==5.1.1
deepspeed==0.14.4
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.6
diskcache==5.6.3
distlib==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
dm-tree==0.1.8
einops==0.8.0
entrypoints==0.4
evaluate==0.4.2
executing==0.8.3
facets-overview==1.1.1
Farama-Notifications==0.0.4
fastapi==0.115.4
fastjsonschema==2.20.0
fasttext==0.9.2
filelock==3.13.4
flash-attn==2.5.9.post1
Flask==2.2.5
flatbuffers==24.3.25
fonttools==4.25.0
frozenlist==1.3.3
fsspec==2023.5.0
future==0.18.3
gast==0.4.0
gguf==0.10.0
gitdb==4.0.11
GitPython==3.1.27
google-api-core==2.18.0
google-auth==2.21.0
google-auth-oauthlib==1.0.0
google-cloud-core==2.4.1
google-cloud-storage==2.10.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.1
googleapis-common-protos==1.63.0
greenlet==2.0.1
grpcio==1.60.0
grpcio-status==1.60.0
gunicorn==20.1.0
gviz-api==1.10.0
gymnasium==0.28.1
h11==0.14.0
h5py==3.10.0
hjson==3.1.0
holidays==0.45
horovod @ git+https://github.com/wenfeiy-db/horovod.git@d510b1d385628f8ac5770199c0824fd5b7e01394
htmlmin==0.1.12
httpcore==1.0.5
httplib2==0.20.2
httptools==0.6.4
httpx==0.27.0
huggingface-hub==0.23.4
idna==3.4
ImageHash==4.3.1
imageio==2.31.1
imbalanced-learn==0.11.0
importlib-metadata==6.0.0
importlib_resources==6.4.0
interegular==0.3.3
ipyflow-core==0.0.198
ipykernel==6.25.1
ipython==8.15.0
ipython-genutils==0.2.0
ipywidgets @ https://databricks-build-artifacts-manual-staging.s3-accelerate.amazonaws.com/ipywidgets/ipywidgets-7.7.2-2databricksnojsdeps-py2.py3-none-any.whl?AWSAccessKeyId=AKIAX7HWM34HCSVHYQ7M&Expires=2028837235&Signature=gJ%2BjzENPoM6UKsDxe1M3VIrgWco%3D#sha256=903ead20c8d40de671853515fcad2f34b43ebf3eff80e4df3f876b8dd64c903b
isodate==0.6.1
itsdangerous==2.0.1
jax-jumpy==1.0.0
jedi==0.18.1
jeepney==0.7.1
Jinja2==3.1.2
jiter==0.6.1
jmespath==0.10.0
joblib==1.2.0
joblibspark==0.5.1
jsonpatch==1.33
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter-server==1.23.4
jupyter_client==7.4.9
jupyter_core==5.3.0
jupyterlab-pygments==0.1.2
keras==3.2.1
keyring==23.5.0
kiwisolver==1.4.4
langchain==0.1.20
langchain-community==0.0.38
langchain-core==0.1.52
langchain-text-splitters==0.0.2
langcodes==3.4.0
langsmith==0.1.63
language_data==1.2.0
lark==1.2.2
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.2
libclang==15.0.6.1
librosa==0.10.1
lightgbm==4.3.0
linkify-it-py==2.0.0
llvmlite==0.43.0
lm-format-enforcer==0.10.6
lxml==4.9.2
lz4==4.3.2
Mako==1.2.0
marisa-trie==1.1.1
Markdown==3.4.1
markdown-it-py==2.2.0
MarkupSafe==2.1.1
marshmallow==3.21.2
matplotlib==3.7.2
matplotlib-inline==0.1.6
mdit-py-plugins==0.3.0
mdurl==0.1.0
memray==1.13.4
mistral_common==1.4.4
mistune==0.8.4
ml-dtypes==0.3.2
mlflow-skinny==2.15.1
more-itertools==8.10.0
mosaicml-streaming==0.7.4
mpmath==1.3.0
msal==1.30.0
msal-extensions==1.2.0
msgpack==1.0.8
msgspec==0.18.6
multidict==6.0.2
multimethod==1.12
multiprocess==0.70.14
murmurhash==1.0.10
mypy-extensions==0.4.3
namex==0.0.8
nbclassic==0.5.5
nbclient==0.5.13
nbconvert==6.5.4
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==3.1
ninja==1.11.1.1
nltk==3.8.1
notebook==6.5.4
notebook_shim==0.2.2
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.555.43
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.0
oci==2.126.4
openai==1.52.2
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python-headless==4.10.0.84
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-semantic-conventions==0.46b0
opt-einsum==3.3.0
optree==0.12.1
orjson==3.10.6
outlines==0.1.1
outlines_core==0.1.14
packaging==23.2
pandas==1.5.3
pandocfilters==1.5.0
paramiko==3.4.0
parso==0.8.3
partial-json-parser==0.2.1.1.post4
pathspec==0.10.3
patsy==0.5.3
petastorm==0.12.1
pexpect==4.8.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.4.0
platformdirs==3.10.0
plotly==5.9.0
pmdarima==2.0.4
pooch==1.8.1
portalocker==2.10.1
preshed==3.0.9
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.21.0
prompt-toolkit==3.0.36
prophet==1.1.5
proto-plus==1.24.0
protobuf==4.24.1
psutil==5.9.0
psycopg2==2.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==8.0.0
py-spy==0.3.14
pyairports==2.1.1
pyarrow==14.0.1
pyarrow-hotfix==0.6
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.13.1
pyccolo==0.0.52
pycountry==24.6.1
pycparser==2.21
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.15.1
PyGObject==3.42.1
PyJWT==2.3.0
PyNaCl==1.5.0
pyodbc==4.0.38
pyOpenSSL==23.2.0
pyparsing==3.0.9
pyrsistent==0.18.0
pytesseract==0.3.10
python-apt==2.4.0+ubuntu3
python-dateutil==2.8.2
python-dotenv==1.0.1
python-editor==1.0.4
python-lsp-jsonrpc==1.1.1
python-snappy==0.6.1
pytz==2022.7
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==23.2.0
ray==2.35.0
referencing==0.35.1
regex==2022.7.9
requests==2.31.0
requests-oauthlib==1.3.1
rich==13.7.1
rpds-py==0.20.0
rsa==4.9
s3transfer==0.10.2
safetensors==0.4.2
scikit-image==0.20.0
scikit-learn==1.3.0
scipy==1.11.1
seaborn==0.12.2
SecretStorage==3.3.1
Send2Trash==1.8.0
sentence-transformers==2.7.0
sentencepiece==0.2.0
shap==0.44.0
simplejson==3.17.6
six==1.16.0
slicer==0.0.7
smart-open==5.2.1
smmap==5.0.0
sniffio==1.2.0
soundfile==0.12.1
soupsieve==2.4
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
spark-tensorflow-distributor==1.0.0
SQLAlchemy==1.4.39
sqlparse==0.4.2
srsly==2.4.8
ssh-import-id==5.11
stack-data==0.2.0
stanio==0.5.1
starlette==0.41.2
statsmodels==0.14.0
sympy==1.11.1
tangled-up-in-unicode==0.2.0
tenacity==8.2.2
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorboard_plugin_profile==2.15.1
tensorboardX==2.6.2.2
tensorflow==2.16.1
tensorflow-estimator==2.15.0
tensorflow-io-gcs-filesystem==0.37.1
termcolor==2.4.0
terminado==0.17.1
textual==0.63.3
tf_keras==2.16.0
thinc==8.2.3
threadpoolctl==2.2.0
tifffile==2021.7.2
tiktoken==0.7.0
tinycss2==1.2.1
tokenize-rt==4.2.1
tokenizers==0.20.1
torch==2.4.0
torcheval==0.0.7
torchvision==0.19.0
tornado==6.3.2
tqdm==4.65.0
traitlets==5.7.1
transformers==4.46.0
triton==3.0.0
typeguard==2.13.3
typer==0.9.4
typing-inspect==0.9.0
typing_extensions==4.12.2
tzdata==2022.1
uc-micro-py==1.0.1
ujson==5.4.0
unattended-upgrades==0.1
urllib3==1.26.16
uvicorn==0.32.0
uvloop==0.21.0
virtualenv==20.24.2
visions==0.7.5
vllm==0.6.3
wadllib==1.3.6
wasabi==1.1.2
watchfiles==0.24.0
wcwidth==0.2.5
weasel==0.3.4
webencodings==0.5.1
websocket-client==0.58.0
websockets==13.1
Werkzeug==2.2.3
wordcloud==1.9.3
wrapt==1.14.1
xformers==0.0.27.post2
xgboost==2.0.3
xxhash==3.4.1
yarl==1.8.1
ydata-profiling==4.5.1
zipp==3.11.0
zstd==1.5.5.1

</details>


### Context for the issue:

_No response_
@dszeto
Copy link

dszeto commented Oct 31, 2024

I actually run into a similar issue as well when trying to add CFG support to https://github.com/huggingface/text-generation-inference. Same error message with the same code path towards the last 3 function calls (see trace below). Any hints would be appreciated.

2024-10-31T06:56:33.558057Z ERROR text_generation_launcher: Method Decode encountered an error.                                                                                                                       Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>                                                                                                                                                       sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__                                                                                                                                     return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__                                                                                                                                    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main                                                                                                                                         return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main                                                                                                                                        rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke                                                                                                                                      return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke                                                                                                                                      return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke                                                                                                                                       return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper                                                                                                                                      return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 116, in serve                                                                                                                        server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve                                                                                                                     asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run                                                                                                                                                   return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 218, in Decode
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1968, in generate_token
    ) = batch.next_token_chooser(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/tokens.py", line 364, in __call__
    _scores = self.grammar_processor(_scores, self.fsm_grammar_states)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/logits_process.py", line 597, in __call__
    allowed_tokens = fsm.get_next_instruction(fsm_grammar_states[i]).tokens
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 154, in get_next_instruction
    valid_tokens = list(
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 189, in iter_valid_token_ids
    self._get_parser_state_token_applied(state, int(token_id))
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 241, in _get_parser_state_token_applied
    prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3999, in decode
    return self._decode(
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

@rlouf
Copy link
Member

rlouf commented Oct 31, 2024

It's hard to know if it's an issue on their end or ours. Running the same in outlines directly should tell us.

@CompuIves
Copy link

CompuIves commented Nov 8, 2024

I found the same issue, I think it goes wrong because of this line (outlines/fsm/guide.py:241):

prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]

The tokenizer does not expect a 2d list. Changing it to:

prev_token_str = self.tokenizer.decode([state.prev_token])[0]

Fixes it for me, but I stumble upon another issue after (could be unrelated).

@PierreLepagnol
Copy link

Hi everyone,
I encountered an issue when attempting to use the generate.cfg function with a VLLM model. The code throws a NotImplementedError, indicating that the CFG Logits processor is not available for the VLLM class.

Error Message

Exception has occurred: NotImplementedError
The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.
  File "/home/lepagnol/Documents/These/format-constrained-for-slu/vllm_test.py", line 30, in <module>
    generator = generate.cfg(model, arithmetic_grammar)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.

Code to Reproduce

from vllm import LLM, SamplingParams

llm = LLM(
    "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",
    enable_prefix_caching=True,
    block_size=64,
    max_num_batched_tokens=15000,
    gpu_memory_utilization=0.96,
    max_model_len=15000,
    use_v2_block_manager=True,
)

arithmetic_grammar = """
    ?start: expression

    ?expression: term (("+" | "-") term)*

    ?term: factor (("*" | "/") factor)*

    ?factor: NUMBER
           | "-" factor
           | "(" expression ")"

    %import common.NUMBER
"""

from outlines import generate, models

model = models.VLLM(llm)
generator = generate.cfg(model, arithmetic_grammar)
sampling_params = SamplingParams(temperature=0.3, top_p=0.2, max_tokens=20)

sequence = generator(
    "Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:",
    sampling_params=sampling_params,
)

Expected Behavior

I expected the code to generate a sequence based on the defined grammar using the VLLM model.

Actual Behavior

The code throws a NotImplementedError, suggesting that the CFG Logits processor is not implemented for the VLLM model.

Environment

  • Python version: 3.12
  • Outlines version: 0.0.46
  • VLLM version: 0.6.4.post2.dev67+g63f1fde2.cpu
  • Model: neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8

Additional Context

Is the CFG Logits processor not yet supported for VLLM, or is there a workaround for this issue? If it's a known limitation, are there any plans to support it in the future?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants