-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [Bug] [Dynamic Shapes] Encountered bug when using Torch-TensorRT #3140
Comments
@narendasan can you help me slove these problem? I want to set the dynamic shape in batch size & seq_len |
@narendasan when to support torch_executed_modules in dynamo mode? |
Hi @yjjinjie you can set the dynamic shapes and pass in the dynamic inputs using
where
where the first two (1, 8, 16) and (1, 2, 3) denote the batch_size and seq_len respectively. Can you try with this and see if you get the same error as above? |
yes,I have tried the torch_tensorrt.Input. but it encountered a new bug
the error is:
|
I also tried the dynamic_shapes: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html
it has the same problem as the torch._dynamo.mark_dynamic(a, 0,min=1,max=8196) |
@apbose can you help me? |
Yeah sure, let me take a look and get back on this. |
Hi @yjjinjie may I know where can I find tzrec? because it shows module not found tzrec |
@apbose you can just delete tzrec and mlp code just like this :
|
I do not get the above error when I run the above code. Are you running on the latest branch. I did a few modifications in the code though-
|
@apbose I use the torch_tensorrt 2.4.0, and use your code, it also has the same error. your torch_tensorrt version is? |
my env is:
|
@apbose I use pip install --pre torch-tensorrt --index-url https://download.pytorch.org/whl/nightly/cu124 to install torch_tensorrt 2.5.0.dev20240822+cu124 then your code is correct, when do you release 2.5.0? I cannot install pip install https://download.pytorch.org/whl/nightly/cu124/torch-2.6.0.dev20241013%2Bcu124-cp311-cp311-linux_x86_64.whl, becase of the error:
|
@apbose in my real code , it has another error: when I use thetorch_tensorrt 2.5.0.dev20240822+cu124 , when I use torch_tensorrt 2.4.0; dynamic the error is:
the code is :
can you help me solve this problem @apbose |
when I use the nvcr.io/nvidia/pytorch:24.09-py3, then the code is ok.
2.5.0a0 is which day of torch_tensorrt? but the docker image system is incompatible with my project, when to release the new version 2.5.0? |
Hi @yjjinjie you can find the release wheels here- https://download.pytorch.org/whl/test/torch-tensorrt/. The torchTRT 2.5 release artifacts got pushed in officially yesterday. |
@apbose hello,when i install torch_tensorrt==2.5.0, it also has error
when I use the nvcr.io/nvidia/pytorch:24.09-py3, then the code is ok. torch 2.5.0a0+b465a5843b.nv24.9 2.5.0a0 is which day of torch_tensorrt? can you update the version of 2.5.0? because I want to install torch_tensorrt in my project |
Can you try with a new virtual env and install torch tensorrt from here- https://download.pytorch.org/whl/test/torch-tensorrt/ the wheel torch_tensorrt-2.5.0+cu124-cp310-cp310-linux_x86_64.whl. This will have torch-tensorrt 2.5 and torch 2.5. And let me know what the error is? |
@apbose I new a new virtual env ,and install torch_tensorrt-2.5.0+cu124-cp310-cp310-linux_x86_64.whl. it has same error . only run:
and run collect_env:
the result:
the code is:
the error:
|
@apbose can you help me solve this problem? |
Yes taking a look.
…On Wed, Oct 23, 2024, 7:31 PM yjjinjie ***@***.***> wrote:
@apbose <https://github.com/apbose> can you help me solve this problem?
—
Reply to this email directly, view it on GitHub
<#3140 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKRJMR3R6TP5KREA3SVZOALZ5BLXRAVCNFSM6AAAAABNROACW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZUGEYDQNBRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I did not get a chance to look at this one yet, but let me get back to you soon regarding this |
I could repro the error-
on torchTRT2.4. I am yet to try on torchTRT2.5 and torchTRT2.6. Will try that and update here.
|
yes. in torchTRT2.4, it has the error: ValueError: len() should return >= 0 in torchTrt2.5 release , it has the error: NameError: name 's0' is not defined |
Hmm so the thing is in torchTRT2.5 docker container I see it passing. It is failing in 2.4 with the error
void genericReformat::copyPackedKernel<float, float,... 0.00% 0.000us 0.00% 0.000us 0.000us 3.680us 36.62% 3.680us 1.840us 2 Self CPU time total: 2.528ms load: tensor(0.4938, device='cuda:0') |
@apbose hello,I use the image, docker pull ghcr.io/pytorch/tensorrt/torch_tensorrt:release_2.5, it has the same error please use the below code, your code may be not same with me,because my new code output is multi-demension.
the error:
when I use the nvcr.io/nvidia/pytorch:24.09-py3 ,the code is correct, the output is
the torch-trt 2.5 & image it has error,please give me the release whl to install in my project |
@apbose I find the reason of the acc. when I use the multi-thread to run trt model predict, the acc is random incorrect. I think it may be related with the dynamic shapes and multi-thread. the torch_tensorrt.runtime.set_multi_device_safe_mode(True) and gsp may just related to the speed of program execution.it slow the model predict,and then the multi-thead data is prepared. when I set the multi-threads =1 and off gsp and don't set torch_tensorrt.runtime.set_multi_device_safe_mode(True) , the acc is correct. the model and data you can get from :http://automl-nni.oss-cn-beijing.aliyuncs.com/trt/test_demo/test_demo.tar.gz when the acc is correct, it has no error and I find when acc is incorrect, it has the some error
but when I depoly the trt model, we must use the multi thread to deal the request for speed. how can I use the multi-thread to predict trt model? because the original model (no trt convert) can use multi thread. |
@apbose when I use the multi-thread to run trt model predict, the acc is random incorrect too in resnet.
|
Could you fill this table up so that I have a better understanding? |
I run VM-ECS, i can gsp on and off. I run physical machine, so many pepole use it, so the gsp is always on.
|
@apbose please help me slove this problem... |
@apbose in my real code, I can set the thread to predict, when I set the thread=1,it just sequential execution https://github.com/alibaba/TorchEasyRec/blob/master/tzrec/main.py#L1138 you can reproduce the multi-thread use the code, the tzrec model is discussed above. http://automl-nni.oss-cn-beijing.aliyuncs.com/trt/test_demo/test_demo.tar.gz you can reproduce the resnet : test_resnet in the issue above |
Thanks for the info @yjjinjie . Can you try https://github.com/pytorch/TensorRT/pull/3310/files @keehyuna's PR and see the results? |
@apbose the code is C++ ,I need to compile this code and get the whl? can you give me the python=3.11.10 whl? |
Please try building the wheel with |
@keehyuna my bazel has some problem, can you support 3.11.10 + cu121 torch_tensorrt whl? |
@yjjinjie , Can you try with it? |
Hello,my torch is 2.5.0,can you give me the whl with your pr~thanks very much |
@keehyuna can you give me branch v2.5.0with your pr python311 |
@yjjinjie Please, check with it. |
my torch is
can you build torch_tensorrt python setup.py bdist_wheel, don't use cxx_abi? thanks |
@apbose @keehyuna when i use v2.5.0, and update cuda12.4->cuda12.1 ;cu124->cu121, and add pr3310,
the when I import the torch_tensorrt, has error, |
Hi @yjjinjie can you try this wheel In your py11 and cuda 12.1 env, to build the wheel with your changes
|
@apbose thanks. you whl torch version need <2.7.0 >=2.6.0dev but it'is ok,I built the whl, and my resnet/tzrec is ok,and i will do some experiment to validate the acc and dynamic shapes. |
@apbose when I use cudagraph, torch_tensorrt.runtime.set_cudagraphs_mode(True), the program occasional issue
|
@keehyuna ok. thanks. I will try add this pr to 2.5 release ,and build a new wheel |
Thanks @yjjinjie. Could you provide the simple repo steps? |
@keehyuna my code is so large,I need some time to simplify code . but i find this problem occurs during multithreaded prediction, while there are no issues in single-threaded mode. can you slove it? |
|
Bug Description
when I use dynamic shape in trt, will raise error,
the static shape is ok.just delete these
To Reproduce
Steps to reproduce the behavior:
the env:
The text was updated successfully, but these errors were encountered: