You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Good day everyone, I am trying to run llama agentic system on RTX4090 with FP8 Quantization for the inference model and meta-llama/Llama-Guard-3-8B-INT8 for the Guard. WIth sufficiently small max_seq_len everything fits into 24GB VRAM and I can start inference server, and chat app. However as soon I send message in the chat I get the following error: "Error: Failed to initialize the TMA descriptor 801".
(venv) trainer@pc-aiml:~/.llama$ llama inference start --disable-ipv6
/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/utils.py:43: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
initialize(config_path=relative_path)
Loading config from : /home/trainer/.llama/configs/inference.yaml
Yaml config:
------------------------
inference_config:
impl_config:
impl_type: inline
checkpoint_config:
checkpoint:
checkpoint_type: pytorch
checkpoint_dir: /home/trainer/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original
tokenizer_path: /home/trainer/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model
model_parallel_size: 1
quantization_format: bf16
quantization:
type: fp8
torch_seed: null
max_seq_len: 2048
max_batch_size: 1
------------------------
Listening on 0.0.0.0:5000
INFO: Started server process [20033]
INFO: Waiting for application startup.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/__init__.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
_C._set_default_tensor_type(t)
Using efficient FP8 operators in FBGEMM.
Quantizing fp8 weights from bf16...
Loaded in 7.05 seconds
Finished model load YES READY
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO: 127.0.0.1:55838 - "POST /inference/chat_completion HTTP/1.1" 200 OK
TMA Desc Addr: 0x7ffdd6221440
format 0
dim 3
gmem_address 0x7eb74f4bde00
globalDim (4096,53,1,1,1)
globalStrides (1,4096,0,0,0)
boxDim (128,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr: 0x7ffdd6221440
format 0
dim 3
gmem_address 0x7eb3ea000000
globalDim (4096,14336,1,1,1)
globalStrides (1,4096,0,0,0)
boxDim (128,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr: 0x7ffdd6221440
format 9
dim 3
gmem_address 0x7eb3e9c00000
globalDim (14336,53,1,1,1)
globalStrides (2,28672,0,0,0)
boxDim (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr: 0x7ffdd6221440
format 9
dim 3
gmem_address 0x7eb3e9c00000
globalDim (14336,53,1,1,1)
globalStrides (2,28672,0,0,0)
boxDim (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
[debug] got exception cutlass cannot initialize
Traceback (most recent call last):
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 80, in retrieve_requests
for obj in out:
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/generation.py", line 287, in chat_completion
yield from self.generate(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
response = gen.send(None)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/generation.py", line 205, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_models/llama3_1/api/model.py", line 321, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_models/llama3_1/api/model.py", line 268, in forward
out = h + self.feed_forward(self.ffn_norm(h))
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/loader.py", line 43, in swiglu_wrapper
out = ffn_swiglu(x, self.w1.weight, self.w3.weight, self.w2.weight)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 62, in ffn_swiglu
return ffn_swiglu_fp8_dynamic(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 165, in ffn_swiglu_fp8_dynamic
x1 = fc_fp8_dynamic(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 146, in fc_fp8_dynamic
y = torch.ops.fbgemm.f8f8bf16_rowwise(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/_ops.py", line 1061, in __call__
return self_._op(*args, **(kwargs or {}))
RuntimeError: cutlass cannot initialize
[debug] got exception cutlass cannot initialize
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 265, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
await func()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7ebc8462f0d0
During handling of the above exception, another exception occurred:
+ Exception Group Traceback (most recent call last):
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
| return await self.app(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
| raise exc
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
| await self.app(scope, receive, _send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
| raise exc
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| await app(scope, receive, sender)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
| await route.handle(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
| await self.app(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
| raise exc
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| await app(scope, receive, sender)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 75, in app
| await response(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 258, in __call__
| async with anyio.create_task_group() as task_group:
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
| raise BaseExceptionGroup(
| exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
| await func()
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 250, in stream_response
| async for chunk in self.body_iterator:
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 84, in sse_generator
| async for event in event_gen:
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 94, in event_gen
| async for event in InferenceApiInstance.chat_completion(exec_request):
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/inference.py", line 58, in chat_completion
| for token_result in self.generator.chat_completion(
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/model_parallel.py", line 104, in chat_completion
| yield from gen
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 255, in run_inference
| raise obj
| RuntimeError: cutlass cannot initialize
+------------------------------------
^CW0729 16:50:42.785000 139352429928448 torch/distributed/elastic/agent/server/api.py:688] Received Signals.SIGINT death signal, shutting down workers
W0729 16:50:42.785000 139352429928448 torch/distributed/elastic/multiprocessing/api.py:734] Closing process 20066 via signal SIGINT
Exception ignored in: <function Context.__del__ at 0x7ebc85d2e950>
Traceback (most recent call last):
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 142, in __del__
self.destroy()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 324, in destroy
self.term()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 266, in term
super().term()
File "_zmq.py", line 545, in zmq.backend.cython._zmq.Context.term
File "_zmq.py", line 141, in zmq.backend.cython._zmq._check_rc
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 20066 got signal: 2
INFO: Shutting down
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 175, in launch_dist_group
elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run
time.sleep(monitor_interval)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 20064 got signal: 2
INFO: Waiting for application shutdown.
shutting down
INFO: Application shutdown complete.
INFO: Finished server process [20033]
SIGINT or CTRL-C detected. Exiting gracefully (2, <frame at 0x7ebc8644bc40, file '/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/server.py', line 328, code capture_signals>)
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/trainer/.llama/venv/bin/llama", line 8, in <module>
sys.exit(main())
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/llama.py", line 54, in main
parser.run(args)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/llama.py", line 48, in run
args.func(args)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/inference/start.py", line 53, in _run_inference_start_cmd
inference_server_init(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 115, in main
uvicorn.run(app, host=listen_host, port=port)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/main.py", line 577, in run
server.run()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run
return asyncio.run(self.serve(sockets=sockets))
File "/usr/lib/python3.10/asyncio/runners.py", line 48, in run
loop.run_until_complete(loop.shutdown_asyncgens())
File "uvloop/loop.pyx", line 1515, in uvloop.loop.Loop.run_until_complete
RuntimeError: Event loop stopped before Future completed.
I will appreciate any help and sugggestion. Thank you in advance.
The text was updated successfully, but these errors were encountered:
This tends to be a symptom of things going OOM when you make the request. To isolate this, can you disable safety first and only run inference and see if things successfully run?
Also, note that agentic-system and toolchain interfaces / contracts have now changed. Please take a look at the updates (specifically using the llama distribution start CLI command, etc.)
Good day everyone, I am trying to run llama agentic system on RTX4090 with FP8 Quantization for the inference model and meta-llama/Llama-Guard-3-8B-INT8 for the Guard. WIth sufficiently small max_seq_len everything fits into 24GB VRAM and I can start inference server, and chat app. However as soon I send message in the chat I get the following error: "Error: Failed to initialize the TMA descriptor 801".
I will appreciate any help and sugggestion. Thank you in advance.
The text was updated successfully, but these errors were encountered: