Latest Docker Image failing for A40 GPU #2763

SMAntony · 2024-11-20T14:57:54Z

System Info

When testing TGI Docker on 2xA40 GPUs to load Llama3.1-70b in eetq quantization. I ran into a CUDA illegal memory error

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run the docker container with the following cmd
--model-id meta-llama/Llama-3.1-70B-Instruct --quantize eetq --max-total-tokens 5000 --num-shard 2 --max-input-tokens 3600 --max-batch-prefill-tokens 3600 --port 8010
Model loads and webserver is connected

2024-11-20T14:37:16.307700574Z     shard_uds_path: "/tmp/text-generation-server",
2024-11-20T14:37:16.307705327Z     master_addr: "localhost",
2024-11-20T14:37:16.307709620Z     master_port: 29500,
2024-11-20T14:37:16.307713867Z     huggingface_hub_cache: None,
2024-11-20T14:37:16.307723983Z     weights_cache_override: None,
2024-11-20T14:37:16.307728183Z     disable_custom_kernels: false,
2024-11-20T14:37:16.307732404Z     cuda_memory_fraction: 1.0,
2024-11-20T14:37:16.307736494Z     rope_scaling: None,
2024-11-20T14:37:16.307740543Z     rope_factor: None,
2024-11-20T14:37:16.307744724Z     json_output: false,
2024-11-20T14:37:16.307750164Z     otlp_endpoint: None,
2024-11-20T14:37:16.307754647Z     otlp_service_name: "text-generation-inference.router",
2024-11-20T14:37:16.307758823Z     cors_allow_origin: [],
2024-11-20T14:37:16.307762890Z     api_key: None,
2024-11-20T14:37:16.307767914Z     watermark_gamma: None,
2024-11-20T14:37:16.307772014Z     watermark_delta: None,
2024-11-20T14:37:16.307776120Z     ngrok: false,
2024-11-20T14:37:16.307780153Z     ngrok_authtoken: None,
2024-11-20T14:37:16.307784313Z     ngrok_edge: None,
2024-11-20T14:37:16.307792724Z     tokenizer_config_path: None,
2024-11-20T14:37:16.307796893Z     disable_grammar_support: false,
2024-11-20T14:37:16.307801247Z     env: false,
2024-11-20T14:37:16.307805717Z     max_client_batch_size: 4,
2024-11-20T14:37:16.307810014Z     lora_adapters: None,
2024-11-20T14:37:16.307814093Z     usage_stats: On,
2024-11-20T14:37:16.307818180Z }
2024-11-20T14:37:16.307822804Z 2024-11-20T14:37:16.307146Z  INFO hf_hub: Token file not found "/data/token"
2024-11-20T14:37:18.096338967Z 2024-11-20T14:37:18.096215Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-11-20T14:37:18.096361238Z 2024-11-20T14:37:18.096237Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-11-20T14:37:18.096367908Z 2024-11-20T14:37:18.096240Z  INFO text_generation_launcher: Sharding model on 2 processes
2024-11-20T14:48:21.626673860Z 2024-11-20T14:48:21.626574Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:48:31.637392292Z 2024-11-20T14:48:31.636998Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:48:31.637450527Z 2024-11-20T14:48:31.637200Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:48:41.648744837Z 2024-11-20T14:48:41.648407Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:48:41.648799649Z 2024-11-20T14:48:41.648444Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:48:51.659683331Z 2024-11-20T14:48:51.659423Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:48:51.659742391Z 2024-11-20T14:48:51.659534Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:49:01.670074897Z 2024-11-20T14:49:01.669799Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:49:01.670802699Z 2024-11-20T14:49:01.670676Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:49:08.917281370Z 2024-11-20T14:49:08.916960Z  INFO text_generation_launcher: Using experimental prefill chunking = False
2024-11-20T14:49:09.885724200Z 2024-11-20T14:49:09.885562Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-11-20T14:49:09.978852158Z 2024-11-20T14:49:09.978651Z  INFO shard-manager: text_generation_launcher: Shard ready in 528.928728549s rank=0
2024-11-20T14:49:10.147624439Z 2024-11-20T14:49:10.147354Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-11-20T14:49:10.180444763Z 2024-11-20T14:49:10.180174Z  INFO shard-manager: text_generation_launcher: Shard ready in 529.124012995s rank=1
2024-11-20T14:49:10.189117339Z 2024-11-20T14:49:10.188842Z  INFO text_generation_launcher: Starting Webserver
2024-11-20T14:49:10.253657653Z 2024-11-20T14:49:10.253383Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-11-20T14:49:10.292598176Z 2024-11-20T14:49:10.292416Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-11-20T14:49:17.103097980Z 2024-11-20T14:49:17.102845Z  INFO text_generation_launcher: KV-cache blocks: 23677, size: 1
2024-11-20T14:49:17.160879057Z 2024-11-20T14:49:17.160595Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-11-20T14:49:19.251790998Z 2024-11-20T14:49:19.251316Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 23677
2024-11-20T14:49:19.251833781Z 2024-11-20T14:49:19.251381Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2024-11-20T14:49:19.251840354Z 2024-11-20T14:49:19.251425Z  INFO text_generation_router::server: router/src/server.rs:1730: Using the Hugging Face API
2024-11-20T14:49:19.251845201Z 2024-11-20T14:49:19.251471Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/data/token"
2024-11-20T14:49:19.417317634Z 2024-11-20T14:49:19.416925Z  INFO text_generation_router::server: router/src/server.rs:2427: Serving revision 945c8663693130f8be2ee66210e062158b2a9693 of model meta-llama/Llama-3.1-70B-Instruct
2024-11-20T14:49:23.180377525Z 2024-11-20T14:49:23.179916Z  INFO text_generation_router::server: router/src/server.rs:1863: Using config Some(Llama)
2024-11-20T14:49:23.411177365Z 2024-11-20T14:49:23.410741Z  WARN text_generation_router::server: router/src/server.rs:2003: Invalid hostname, defaulting to 0.0.0.0
2024-11-20T14:49:23.512079371Z 2024-11-20T14:49:23.511694Z  INFO text_generation_router::server: router/src/server.rs:2389: Connected

Hit the webserver (as simple as visiting the url) and it results in a CUDA illegal memory error

2024-11-20T14:49:01.670074897Z 2024-11-20T14:49:01.669799Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-11-20T14:49:01.670802699Z 2024-11-20T14:49:01.670676Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-11-20T14:49:08.917281370Z 2024-11-20T14:49:08.916960Z  INFO text_generation_launcher: Using experimental prefill chunking = False
2024-11-20T14:49:09.885724200Z 2024-11-20T14:49:09.885562Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-11-20T14:49:09.978852158Z 2024-11-20T14:49:09.978651Z  INFO shard-manager: text_generation_launcher: Shard ready in 528.928728549s rank=0
2024-11-20T14:49:10.147624439Z 2024-11-20T14:49:10.147354Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-11-20T14:49:10.180444763Z 2024-11-20T14:49:10.180174Z  INFO shard-manager: text_generation_launcher: Shard ready in 529.124012995s rank=1
2024-11-20T14:49:10.189117339Z 2024-11-20T14:49:10.188842Z  INFO text_generation_launcher: Starting Webserver
2024-11-20T14:49:10.253657653Z 2024-11-20T14:49:10.253383Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-11-20T14:49:10.292598176Z 2024-11-20T14:49:10.292416Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-11-20T14:49:17.103097980Z 2024-11-20T14:49:17.102845Z  INFO text_generation_launcher: KV-cache blocks: 23677, size: 1
2024-11-20T14:49:17.160879057Z 2024-11-20T14:49:17.160595Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-11-20T14:49:19.251790998Z 2024-11-20T14:49:19.251316Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 23677
2024-11-20T14:49:19.251833781Z 2024-11-20T14:49:19.251381Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
2024-11-20T14:49:19.251840354Z 2024-11-20T14:49:19.251425Z  INFO text_generation_router::server: router/src/server.rs:1730: Using the Hugging Face API
2024-11-20T14:49:19.251845201Z 2024-11-20T14:49:19.251471Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/data/token"
2024-11-20T14:49:19.417317634Z 2024-11-20T14:49:19.416925Z  INFO text_generation_router::server: router/src/server.rs:2427: Serving revision 945c8663693130f8be2ee66210e062158b2a9693 of model meta-llama/Llama-3.1-70B-Instruct
2024-11-20T14:49:23.180377525Z 2024-11-20T14:49:23.179916Z  INFO text_generation_router::server: router/src/server.rs:1863: Using config Some(Llama)
2024-11-20T14:49:23.411177365Z 2024-11-20T14:49:23.410741Z  WARN text_generation_router::server: router/src/server.rs:2003: Invalid hostname, defaulting to 0.0.0.0
2024-11-20T14:49:23.512079371Z 2024-11-20T14:49:23.511694Z  INFO text_generation_router::server: router/src/server.rs:2389: Connected
2024-11-20T14:50:03.072008375Z 2024-11-20T14:50:03.071512Z ERROR health:health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
2024-11-20T14:50:03.241251620Z 2024-11-20T14:50:03.240830Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-11-20T14:50:03.241283441Z 2024-11-20 14:40:22.745 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
2024-11-20T14:50:03.241298591Z /opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241301894Z   @custom_fwd(cast_inputs=torch.float16)
2024-11-20T14:50:03.241304333Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241306234Z   @custom_fwd
2024-11-20T14:50:03.241307682Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241309522Z   @custom_bwd
2024-11-20T14:50:03.241310908Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241312898Z   @custom_fwd
2024-11-20T14:50:03.241314325Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.241316296Z   @custom_bwd
2024-11-20T14:50:03.241317659Z [rank0]:[E1120 14:50:01.589782869 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.241319896Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.241321391Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.241323663Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.241328079Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.241329731Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7370874abf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241331282Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73708745ad10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241333711Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x737087587f08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.241335322Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7370377eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241338142Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7370377efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241339673Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7370377f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241341215Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7370377f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241342555Z frame #7: <unknown function> + 0xd3b75 (0x7370909e0b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.241344806Z frame #8: <unknown function> + 0x94ac3 (0x737090b84ac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241346206Z frame #9: clone + 0x44 (0x737090c15a04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241349128Z terminate called after throwing an instance of 'c10::DistBackendError'
2024-11-20T14:50:03.241353537Z   what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.241354976Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.241356503Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.241357908Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.241360968Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.241362359Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7370874abf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241363736Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73708745ad10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241365139Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x737087587f08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.241366487Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7370377eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241368087Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7370377efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241369440Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7370377f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241370839Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7370377f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241372192Z frame #7: <unknown function> + 0xd3b75 (0x7370909e0b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.241373566Z frame #8: <unknown function> + 0x94ac3 (0x737090b84ac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241375177Z frame #9: clone + 0x44 (0x737090c15a04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241378091Z Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
2024-11-20T14:50:03.241379629Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7370874abf86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.241381126Z frame #1: <unknown function> + 0xe3ec34 (0x737037478c34 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.241382696Z frame #2: <unknown function> + 0xd3b75 (0x7370909e0b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.241384508Z frame #3: <unknown function> + 0x94ac3 (0x737090b84ac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241385886Z frame #4: clone + 0x44 (0x737090c15a04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.241387275Z  rank=0
2024-11-20T14:50:03.241389538Z 2024-11-20T14:50:03.240890Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0
2024-11-20T14:50:03.252138023Z 2024-11-20T14:50:03.251958Z ERROR text_generation_launcher: Shard 0 crashed
2024-11-20T14:50:03.252164406Z 2024-11-20T14:50:03.251982Z  INFO text_generation_launcher: Terminating webserver
2024-11-20T14:50:03.252167387Z 2024-11-20T14:50:03.252001Z  INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
2024-11-20T14:50:03.252576404Z 2024-11-20T14:50:03.252296Z  INFO text_generation_router::server: router/src/server.rs:2481: signal received, starting graceful shutdown
2024-11-20T14:50:03.391542207Z 2024-11-20T14:50:03.391170Z ERROR health:health:prefill{id=18446744073709551615 size=1}:prefill{id=18446744073709551615 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
2024-11-20T14:50:03.643555814Z 2024-11-20T14:50:03.643097Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-11-20T14:50:03.643620480Z 2024-11-20 14:40:22.733 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
2024-11-20T14:50:03.643626729Z /opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643632980Z   @custom_fwd(cast_inputs=torch.float16)
2024-11-20T14:50:03.643638110Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643642966Z   @custom_fwd
2024-11-20T14:50:03.643648639Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643653913Z   @custom_bwd
2024-11-20T14:50:03.643658580Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643664583Z   @custom_fwd
2024-11-20T14:50:03.643669328Z /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
2024-11-20T14:50:03.643673846Z   @custom_bwd
2024-11-20T14:50:03.643678130Z [rank1]:[E1120 14:50:01.584143246 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.643683080Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.643687450Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.643692308Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.643704132Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.643708832Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bbc83b0f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643713345Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73bbc835fd10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643718212Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x73bc1928ff08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.643722615Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x73bbc95eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643727805Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x73bbc95efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643756878Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x73bbc95f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643761742Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x73bbc95f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643766588Z frame #7: <unknown function> + 0xd3b75 (0x73bc226c7b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.643784819Z frame #8: <unknown function> + 0x94ac3 (0x73bc2286bac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643789509Z frame #9: clone + 0x44 (0x73bc228fca04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643797982Z terminate called after throwing an instance of 'c10::DistBackendError'
2024-11-20T14:50:03.643802349Z   what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-11-20T14:50:03.643806939Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-11-20T14:50:03.643811385Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1
2024-11-20T14:50:03.643815842Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-11-20T14:50:03.643833622Z Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-11-20T14:50:03.643838087Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bbc83b0f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643842531Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x73bbc835fd10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643847179Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x73bc1928ff08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
2024-11-20T14:50:03.643851539Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x73bbc95eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643855971Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x73bbc95efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643860431Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x73bbc95f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643864747Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x73bbc95f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643869141Z frame #7: <unknown function> + 0xd3b75 (0x73bc226c7b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.643873694Z frame #8: <unknown function> + 0x94ac3 (0x73bc2286bac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643877987Z frame #9: clone + 0x44 (0x73bc228fca04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643886361Z Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
2024-11-20T14:50:03.643890841Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bbc83b0f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
2024-11-20T14:50:03.643895274Z frame #1: <unknown function> + 0xe3ec34 (0x73bbc9278c34 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
2024-11-20T14:50:03.643899824Z frame #2: <unknown function> + 0xd3b75 (0x73bc226c7b75 in /opt/conda/bin/../lib/libstdc++.so.6)
2024-11-20T14:50:03.643904908Z frame #3: <unknown function> + 0x94ac3 (0x73bc2286bac3 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643925931Z frame #4: clone + 0x44 (0x73bc228fca04 in /lib/x86_64-linux-gnu/libc.so.6)
2024-11-20T14:50:03.643930477Z  rank=1
2024-11-20T14:50:03.643935968Z 2024-11-20T14:50:03.643172Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=1
2024-11-20T14:50:03.852901332Z 2024-11-20T14:50:03.852675Z  INFO text_generation_launcher: webserver terminated
2024-11-20T14:50:03.852928944Z 2024-11-20T14:50:03.852707Z  INFO text_generation_launcher: Shutting down shards
2024-11-20T14:50:03.852955241Z Error: ShardFailed

Expected behavior

Expecting model url to be hit and inferenced properly

The text was updated successfully, but these errors were encountered:

SMAntony mentioned this issue Nov 20, 2024

Latest Docker image fails while initializing gemma2 #2275

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest Docker Image failing for A40 GPU #2763

Latest Docker Image failing for A40 GPU #2763

SMAntony commented Nov 20, 2024

Latest Docker Image failing for A40 GPU #2763

Latest Docker Image failing for A40 GPU #2763

Comments

SMAntony commented Nov 20, 2024

System Info

Information

Tasks

Reproduction

Expected behavior