Questions about the two execution-policy of TensorRT backend, DEVICE_BLOCKING mode and BLOCKING mode #7831

Will-Chou-5722 · 2024-11-25T06:43:48Z

I notice that TensorRT Backend has two execution-policy “DEVICE_BLOCKING mode and BLOCKING mode”.
Here are two questions that need your help.

Q1:
I tested the performance by using the perf_analyzer tool and the test result was as follows:

DEVICE_BLOCKING mode performance is better than BLOCKING mode. I wonder in what situations it is appropriate to use BLOCKING mode.

Q2:
Can two different models be merged into one TensorRT thread when executing two different models simultaneously in DEVICE_BLOCKING mode?

For Example:

model-repository
resnext101_sparse_int8
resnext101_sparse_2_int8
TritionServer
./tritonserver --model-repository=../docs/examples/model_repository/ --grpc-port=8011 --http-port=8010 --metrics-port=8012 --pinned-memory-pool-byte-size=6442450944 --cuda-memory-pool-byte-size 0:2147483648 --disable-auto-complete-config --backend-config=tensorrt,execution-policy=DEVICE_BLOCKING
Client(concurrency execute)
./perf_analyzer -m resnext101_sparse_2_int8 --service-kind=triton -i http -u localhost:8010 -b 1 --concurrency-range 1
./perf_analyzer -m resnext101_sparse_int8 --service-kind=triton -i http -u localhost:8010 -b 1 --concurrency-range
Observe execution results, there are two tensorRT thread

Information
Hardware NVIDIA Jetson AGX Orin Jetpack 6.1
Triton server version:2.51.0

rmccorm4 · 2025-01-22T20:53:21Z

rmccorm4 added performance A possible performance tune-up module: backends Issues related to the backends TensorRT labels Jan 22, 2025

rmccorm4 added the question Further information is requested label Jan 23, 2025

Provide feedback