You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When testing TGI Docker on 2xA40 GPUs to load Llama3.1-70b in eetq quantization. I ran into a CUDA illegal memory error
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Run the docker container with the following cmd --model-id meta-llama/Llama-3.1-70B-Instruct --quantize eetq --max-total-tokens 5000 --num-shard 2 --max-input-tokens 3600 --max-batch-prefill-tokens 3600 --port 8010
Model loads and webserver is connected
2024-11-20T14:37:16.307700574Z shard_uds_path: "/tmp/text-generation-server",
2024-11-20T14:37:16.307705327Z master_addr: "localhost",
2024-11-20T14:37:16.307709620Z master_port: 29500,
2024-11-20T14:37:16.307713867Z huggingface_hub_cache: None,
2024-11-20T14:37:16.307723983Z weights_cache_override: None,
2024-11-20T14:37:16.307728183Z disable_custom_kernels: false,
2024-11-20T14:37:16.307732404Z cuda_memory_fraction: 1.0,
2024-11-20T14:37:16.307736494Z rope_scaling: None,
2024-11-20T14:37:16.307740543Z rope_factor: None,
2024-11-20T14:37:16.307744724Z json_output: false,
2024-11-20T14:37:16.307750164Z otlp_endpoint: None,
2024-11-20T14:37:16.307754647Z otlp_service_name: "text-generation-inference.router",
2024-11-20T14:37:16.307758823Z cors_allow_origin: [],
2024-11-20T14:37:16.307762890Z api_key: None,
2024-11-20T14:37:16.307767914Z watermark_gamma: None,
2024-11-20T14:37:16.307772014Z watermark_delta: None,
2024-11-20T14:37:16.307776120Z ngrok: false,
2024-11-20T14:37:16.307780153Z ngrok_authtoken: None,
2024-11-20T14:37:16.307784313Z ngrok_edge: None,
2024-11-20T14:37:16.307792724Z tokenizer_config_path: None,
2024-11-20T14:37:16.307796893Z disable_grammar_support: false,
2024-11-20T14:37:16.307801247Z env: false,
2024-11-20T14:37:16.307805717Z max_client_batch_size: 4,
2024-11-20T14:37:16.307810014Z lora_adapters: None,
2024-11-20T14:37:16.307814093Z usage_stats: On,
2024-11-20T14:37:16.307818180Z }
2024-11-20T14:37:16.307822804Z 2024-11-20T14:37:16.307146Z INFO hf_hub: Token file not found "/data/token"
2024-11-20T14:37:18.096338967Z 2024-11-20T14:37:18.096215Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-11-20T14:37:18.096361238Z 2024-11-20T14:37:18.096237Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-11-20T14:37:18.096367908Z 2024-11-20T14:37:18.096240Z INFO text_generation_launcher: Sharding model on 2 processes
2024-11-20T14:48:21.626673860Z 2024-11-20T14:48:21.626574Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:48:31.637392292Z 2024-11-20T14:48:31.636998Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:48:31.637450527Z 2024-11-20T14:48:31.637200Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:48:41.648744837Z 2024-11-20T14:48:41.648407Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:48:41.648799649Z 2024-11-20T14:48:41.648444Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:48:51.659683331Z 2024-11-20T14:48:51.659423Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:48:51.659742391Z 2024-11-20T14:48:51.659534Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:49:01.670074897Z 2024-11-20T14:49:01.669799Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:49:01.670802699Z 2024-11-20T14:49:01.670676Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:49:08.917281370Z 2024-11-20T14:49:08.916960Z INFO text_generation_launcher: Using experimental prefill chunking = False
2024-11-20T14:49:09.885724200Z 2024-11-20T14:49:09.885562Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-11-20T14:49:09.978852158Z 2024-11-20T14:49:09.978651Z INFO shard-manager: text_generation_launcher: Shard ready in 528.928728549s rank=0
2024-11-20T14:49:10.147624439Z 2024-11-20T14:49:10.147354Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-11-20T14:49:10.180444763Z 2024-11-20T14:49:10.180174Z INFO shard-manager: text_generation_launcher: Shard ready in 529.124012995s rank=1
2024-11-20T14:49:10.189117339Z 2024-11-20T14:49:10.188842Z INFO text_generation_launcher: Starting Webserver
2024-11-20T14:49:10.253657653Z 2024-11-20T14:49:10.253383Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-11-20T14:49:10.292598176Z 2024-11-20T14:49:10.292416Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-11-20T14:49:17.103097980Z 2024-11-20T14:49:17.102845Z INFO text_generation_launcher: KV-cache blocks: 23677, size: 1
2024-11-20T14:49:17.160879057Z 2024-11-20T14:49:17.160595Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-11-20T14:49:19.251790998Z 2024-11-20T14:49:19.251316Z INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 23677
2024-11-20T14:49:19.251833781Z 2024-11-20T14:49:19.251381Z INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2024-11-20T14:49:19.251840354Z 2024-11-20T14:49:19.251425Z INFO text_generation_router::server: router/src/server.rs:1730: Using the Hugging Face API
2024-11-20T14:49:19.251845201Z 2024-11-20T14:49:19.251471Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/data/token"
2024-11-20T14:49:19.417317634Z 2024-11-20T14:49:19.416925Z INFO text_generation_router::server: router/src/server.rs:2427: Serving revision 945c8663693130f8be2ee66210e062158b2a9693 of model meta-llama/Llama-3.1-70B-Instruct
2024-11-20T14:49:23.180377525Z 2024-11-20T14:49:23.179916Z INFO text_generation_router::server: router/src/server.rs:1863: Using config Some(Llama)
2024-11-20T14:49:23.411177365Z 2024-11-20T14:49:23.410741Z WARN text_generation_router::server: router/src/server.rs:2003: Invalid hostname, defaulting to 0.0.0.0
2024-11-20T14:49:23.512079371Z 2024-11-20T14:49:23.511694Z INFO text_generation_router::server: router/src/server.rs:2389: Connected
Hit the webserver (as simple as visiting the url) and it results in a CUDA illegal memory error
2024-11-20T14:49:01.670074897Z 2024-11-20T14:49:01.669799Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:49:01.670802699Z 2024-11-20T14:49:01.670676Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:49:08.917281370Z 2024-11-20T14:49:08.916960Z INFO text_generation_launcher: Using experimental prefill chunking = False
2024-11-20T14:49:09.885724200Z 2024-11-20T14:49:09.885562Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-11-20T14:49:09.978852158Z 2024-11-20T14:49:09.978651Z INFO shard-manager: text_generation_launcher: Shard ready in 528.928728549s rank=0
2024-11-20T14:49:10.147624439Z 2024-11-20T14:49:10.147354Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-11-20T14:49:10.180444763Z 2024-11-20T14:49:10.180174Z INFO shard-manager: text_generation_launcher: Shard ready in 529.124012995s rank=1
2024-11-20T14:49:10.189117339Z 2024-11-20T14:49:10.188842Z INFO text_generation_launcher: Starting Webserver
2024-11-20T14:49:10.253657653Z 2024-11-20T14:49:10.253383Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-11-20T14:49:10.292598176Z 2024-11-20T14:49:10.292416Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-11-20T14:49:17.103097980Z 2024-11-20T14:49:17.102845Z INFO text_generation_launcher: KV-cache blocks: 23677, size: 1
2024-11-20T14:49:17.160879057Z 2024-11-20T14:49:17.160595Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-11-20T14:49:19.251790998Z 2024-11-20T14:49:19.251316Z INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 23677
2024-11-20T14:49:19.251833781Z 2024-11-20T14:49:19.251381Z INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2024-11-20T14:49:19.251840354Z 2024-11-20T14:49:19.251425Z INFO text_generation_router::server: router/src/server.rs:1730: Using the Hugging Face API
2024-11-20T14:49:19.251845201Z 2024-11-20T14:49:19.251471Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/data/token"
2024-11-20T14:49:19.417317634Z 2024-11-20T14:49:19.416925Z INFO text_generation_router::server: router/src/server.rs:2427: Serving revision 945c8663693130f8be2ee66210e062158b2a9693 of model meta-llama/Llama-3.1-70B-Instruct
2024-11-20T14:49:23.180377525Z 2024-11-20T14:49:23.179916Z INFO text_generation_router::server: router/src/server.rs:1863: Using config Some(Llama)
2024-11-20T14:49:23.411177365Z 2024-11-20T14:49:23.410741Z WARN text_generation_router::server: router/src/server.rs:2003: Invalid hostname, defaulting to 0.0.0.0
2024-11-20T14:49:23.512079371Z 2024-11-20T14:49:23.511694Z INFO text_generation_router::server: router/src/server.rs:2389: Connected
2024-11-20T14:50:03.072008375Z 2024-11-20T14:50:03.071512Z ERROR health:health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
2024-11-20T14:50:03.241251620Z 2024-11-20T14:50:03.240830Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-11-20T14:50:03.241283441Z 2024-11-20 14:40:22.745 | INFO | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
2024-11-20T14:50:03.241298591Z /opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241301894Z @custom_fwd(cast_inputs=torch.float16)
2024-11-20T14:50:03.241304333Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241306234Z @custom_fwd
2024-11-20T14:50:03.241307682Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241309522Z @custom_bwd
2024-11-20T14:50:03.241310908Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241312898Z @custom_fwd
2024-11-20T14:50:03.241314325Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241316296Z @custom_bwd
2024-11-20T14:50:03.241317659Z [rank0]:[E1120 14:50:01.589782869 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.241319896Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.241321391Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.241323663Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.241328079Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.241329731Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7370874abf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241331282Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73708745ad10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241333711Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x737087587f08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.241335322Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7370377eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241338142Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7370377efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241339673Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7370377f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241341215Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7370377f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241342555Z frame #7: <unknown function> + 0xd3b75 (0x7370909e0b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.241344806Z frame #8: <unknown function> + 0x94ac3 (0x737090b84ac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241346206Z frame #9: clone + 0x44 (0x737090c15a04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241349128Z terminate called after throwing an instance of 'c10::DistBackendError'
2024-11-20T14:50:03.241353537Z what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.241354976Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.241356503Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.241357908Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.241360968Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.241362359Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7370874abf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241363736Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73708745ad10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241365139Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x737087587f08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.241366487Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7370377eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241368087Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7370377efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241369440Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7370377f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241370839Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7370377f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241372192Z frame #7: <unknown function> + 0xd3b75 (0x7370909e0b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.241373566Z frame #8: <unknown function> + 0x94ac3 (0x737090b84ac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241375177Z frame #9: clone + 0x44 (0x737090c15a04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241378091Z Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
2024-11-20T14:50:03.241379629Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7370874abf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241381126Z frame #1: <unknown function> + 0xe3ec34 (0x737037478c34 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241382696Z frame #2: <unknown function> + 0xd3b75 (0x7370909e0b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.241384508Z frame #3: <unknown function> + 0x94ac3 (0x737090b84ac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241385886Z frame #4: clone + 0x44 (0x737090c15a04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241387275Z rank=0
2024-11-20T14:50:03.241389538Z 2024-11-20T14:50:03.240890Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0
2024-11-20T14:50:03.252138023Z 2024-11-20T14:50:03.251958Z ERROR text_generation_launcher: Shard 0 crashed
2024-11-20T14:50:03.252164406Z 2024-11-20T14:50:03.251982Z INFO text_generation_launcher: Terminating webserver
2024-11-20T14:50:03.252167387Z 2024-11-20T14:50:03.252001Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2024-11-20T14:50:03.252576404Z 2024-11-20T14:50:03.252296Z INFO text_generation_router::server: router/src/server.rs:2481: signal received, starting graceful shutdown
2024-11-20T14:50:03.391542207Z 2024-11-20T14:50:03.391170Z ERROR health:health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
2024-11-20T14:50:03.643555814Z 2024-11-20T14:50:03.643097Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-11-20T14:50:03.643620480Z 2024-11-20 14:40:22.733 | INFO | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
2024-11-20T14:50:03.643626729Z /opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643632980Z @custom_fwd(cast_inputs=torch.float16)
2024-11-20T14:50:03.643638110Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643642966Z @custom_fwd
2024-11-20T14:50:03.643648639Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643653913Z @custom_bwd
2024-11-20T14:50:03.643658580Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643664583Z @custom_fwd
2024-11-20T14:50:03.643669328Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643673846Z @custom_bwd
2024-11-20T14:50:03.643678130Z [rank1]:[E1120 14:50:01.584143246 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.643683080Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.643687450Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.643692308Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.643704132Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.643708832Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bbc83b0f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643713345Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73bbc835fd10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643718212Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x73bc1928ff08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.643722615Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x73bbc95eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643727805Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x73bbc95efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643756878Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x73bbc95f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643761742Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x73bbc95f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643766588Z frame #7: <unknown function> + 0xd3b75 (0x73bc226c7b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.643784819Z frame #8: <unknown function> + 0x94ac3 (0x73bc2286bac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643789509Z frame #9: clone + 0x44 (0x73bc228fca04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643797982Z terminate called after throwing an instance of 'c10::DistBackendError'
2024-11-20T14:50:03.643802349Z what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.643806939Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.643811385Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.643815842Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.643833622Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.643838087Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bbc83b0f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643842531Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73bbc835fd10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643847179Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x73bc1928ff08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.643851539Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x73bbc95eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643855971Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x73bbc95efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643860431Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x73bbc95f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643864747Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x73bbc95f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643869141Z frame #7: <unknown function> + 0xd3b75 (0x73bc226c7b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.643873694Z frame #8: <unknown function> + 0x94ac3 (0x73bc2286bac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643877987Z frame #9: clone + 0x44 (0x73bc228fca04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643886361Z Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
2024-11-20T14:50:03.643890841Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bbc83b0f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643895274Z frame #1: <unknown function> + 0xe3ec34 (0x73bbc9278c34 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643899824Z frame #2: <unknown function> + 0xd3b75 (0x73bc226c7b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.643904908Z frame #3: <unknown function> + 0x94ac3 (0x73bc2286bac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643925931Z frame #4: clone + 0x44 (0x73bc228fca04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643930477Z rank=1
2024-11-20T14:50:03.643935968Z 2024-11-20T14:50:03.643172Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=1
2024-11-20T14:50:03.852901332Z 2024-11-20T14:50:03.852675Z INFO text_generation_launcher: webserver terminated
2024-11-20T14:50:03.852928944Z 2024-11-20T14:50:03.852707Z INFO text_generation_launcher: Shutting down shards
2024-11-20T14:50:03.852955241Z Error: ShardFailed
Expected behavior
Expecting model url to be hit and inferenced properly
The text was updated successfully, but these errors were encountered:
System Info
When testing TGI Docker on 2xA40 GPUs to load Llama3.1-70b in
eetq
quantization. I ran into aCUDA illegal memory error
Information
Tasks
Reproduction
Run the docker container with the following cmd
--model-id meta-llama/Llama-3.1-70B-Instruct --quantize eetq --max-total-tokens 5000 --num-shard 2 --max-input-tokens 3600 --max-batch-prefill-tokens 3600 --port 8010
Model loads and webserver is connected
CUDA illegal memory error
Expected behavior
Expecting model url to be hit and inferenced properly
The text was updated successfully, but these errors were encountered: