Efficient Casting and Data-Type Tracking Through Torch-TRT Engines #1553

gs-olive · 2022-12-15T23:10:24Z

gs-olive
Dec 15, 2022
Collaborator

Autocast

TL;DR

Allow a broader range of valid input types to Torch-TRT-compiled engines, enabling support for Torch operations which require Int64-type inputs (generally index-related, like aten::scatter).

Goal(s)

Currently, if a Torch-TRT graph is given Int64-type tensors as input, compilation fails. Additionally, if Int64-type tensors are provided at runtime, the inference can crash unexpectedly. The same holds true for Double inputs, among others. Torch-TensorRT should allow non-TensorRT allowed inputs to the graph (i.e. torch.long) and optionally augment the provided user graph with automatic data-type casting to ensure input tensors of reasonable type flow seamlessly through the graph.

Torch-TRT would mimic Torch execution in a sense, as inputs would only change type for TensorRT engines, and would be casted back to their original type afterward. Ultimately, the goal is to ensure inserted type-casting is used minimally and undone/reverted where applicable, to avoid modifying user-provided tensor inputs.

See Issues #1121, #1346, #1546, #1543

Usecases

There are two key usecases to be aware of in this situation:

1. The user specifies a TensorRT-unsupported data type (ie. torch.long), and provides that data type at runtime.

In this case, the autocast feature is implicit, as compilation cannot proceed without casting inputs to a type compatible with TensorRT Engines.
Specifically, consider the case where an engine has two Int64 inputs, %x and %y, and a single Int64 output, %z. Additionally, assume the entire engine can be run in TensorRT, and is thus marked to run in TensorRT. Then, the tensors %x and %y would be downcasted to Int32 by an auxiliary Torch engine, run through the TensorRT engine, and the outputs would be upcasted to Int64 by another auxiliary Torch engine.

2. The user specifies a TensorRT-supported data type at compilation time, but provides another at runtime.

This case is more challenging, as it is difficult for Torch-TRT to mimic Torch behavior in such a scenario. The reason is that during partitioning, we cannot reasonably determine what the output dtype of an engine should be on any input dtype; we can only determine what the output dtype of an engine should be on the specified (or inferred) dtype.
If the user specifies Int32 at compilation time, for example, but provides an Int64 or Float input at runtime, the autocast feature could do one of two things:
- Have no effect, throw an error (as it does currently)
- Automatically cast the user input to the inferred data type (either modifying the input tensor in-place, or making a copy)
If a user specifies an Int64 input, with autocast enabled, inputs will only be casted for TensorRT engines and not for Torch engines.
Edge Case: A user specifies Int64 input at compile-time but also require_full_compilation=True, and provides an Int64 input at runtime
- We cannot insert a cast in this case while honoring full-compilation, as TensorRT engines cannot take Int64 tensors as input.
Potentially Problematic Edge Case: A user specifies Int64 input at compile-time, provides an Int64 input at runtime, but a TensorRT engine uses in-place operations to modify inputs which it does not later return
- The problem here is that, if input tensors to a TensorRT engine are not explicitly returned by the engine, we cannot cast those tensors back to Int64, and thus they will be used in Int32 format for the remainder of the inference. To avoid this, we also need to insert cast operations at the beginning of Torch engines and track necessary Torch input types in Partitioning

Proposed APIs / UX

This feature would be enabled via a flag in the compilation arguments, autocast=True, and would allow users to input a larger set of datatypes as inputs, with type-casting ops added automatically.

Example Workflow

scripted_model = torch.jit.script(model)

compile_settings = {
    "inputs": [
        torch_tensorrt.Input((5,), dtype=torch.double),
        torch_tensorrt.Input((5,), dtype=torch.long)
    ],
    "enabled_precisions": {torch.float},
    "torch_executed_ops": ["aten::relu"],
    "truncate_long_and_double": True,
    "autocast": True
}

trt_ts_module = torch_tensorrt.ts.compile(scripted_model, **compile_settings)

Limitations

This feature does not make Torch-TRT perfectly mimic Torch's handling of data types, and it cannot reasonably do so, as mentioned here. It also does not make TensorRT "compatible" with Int64, Double, or other currently-unsupported data types. It simply abstracts away the necessary casting for switching between data types for compatibility with TensorRT.

Note: Additionally, there are some challenges arising from the use of in-place operations. If, for example, an input to an engine is modified within said engine, but not returned as an output, autocast will not be able to detect this change and correctly cast the input back to its original type.

Internal Implementation

Design

The structure is as follows. Assume %x and %y are inputs, and %z is the output of a segmented block determined to run in TensorRT by the partitioning module. We currently record only the shapes of the inputs and outputs, but we are also interested in the types of the inputs and outputs after completion of the computation. See the diagram below for an example tensor-trace diagram.

                            +-------+  +-------+                       
                            |  %x   |  |  %y   |                       
                            | Int64 |  | Int64 |                       
                            +---|---+  +---|---+                       
                                |          |                           
+-------+  +-------+        +---|---+  +---|---+                       
|  %x   |  |  %y   |        |  %x   |  |  %y   |                       
| Int64 |  | Int64 |        | Int32 |  | Int32 |                       
+-------+  +-------+        +-------+  +-------+                       
   \            /              \            /                          
    \          /                \          /                           
 +----------------+          +----------------+                        
 |  Torch Engine  |    --->  | TensorRT Engine|                        
 +--------|-------+          +--------|-------+                        
          |                           |                                
     +--------+                  +--------+                            
     |   %z   |                  |   %z   |                            
     | Int64  |                  | Int32  |                            
     +--------+                  +----|---+                            
                                      |                                
                                 +----|---+                            
                                 |   %z   |                            
                                 | Int64  |                            
                                 +--------+

Extensions Required to Core API implementations

Key input-checking required
- Disallow users from specifying both autocast=True and require_full_compilation=True
Updates to Partitioning needed
- Augment dry-run pass to determine input + output data types in addition to input + output shapes
- (Needs Review) Insert aten::to operations to properly cast inputs to all engines
  - It is not sufficient to cast inputs for engines surrounding a TensorRT engine, because Inputs are sometimes modified in-place, not returned, and used later.

Data Structures

The SegmentedBlock data structure is already sufficient for determining the quantity of inputs and outputs across segmented blocks, but an additional data structure would be helpful in determining type constraints for those inputs/outputs. Specifically, during the course of the forward pass dry-run in partitioning, it would be helpful to store the output types of Torch blocks and ensure these are properly casted as inputs to subsequent TensorRT blocks.

Each TensorRT block will need 2 auxiliary Torch blocks surrounding it. The first Torch block casts the tensors to valid types for the TensorRT block, potentially making a copy so as not to modify user-provided inputs. The second Torch block takes the output of the TensorRT block and casts the outputs to the necessary type for the next block.

A straightforward way to complete all of the above is simply to track the data types in, and out, of each block when run in Torch, then perform casting for all blocks, to ensure inputs have the correct type. Then, for TensorRT-executed blocks, we can augment the cast to ensure the data types fed in are a valid TRT type. Then, each Torch block will begin with aten::to casts of all of its inputs, and each TensorRT block will be prepended with a Torch block (or section of a previous Torch block), casting its inputs to compatible type.

We may need to add up to 2 auxiliary Torch blocks in the code (one at the beginning of the graph, one at the end), to ensure casting is performed before the first TensorRT block, and after the last.

Details specific for TorchScript Support

See above for the TorchScript details

Details specific for FX support

Since fx2trt does not employ partitioning in the same way TorchScript does, this feature could potentially be extended to perform Python-level casting of inputs to TRT operators and fusions automatically.

Implementation Phases

Prototype - L (complete)

For Int64 only:
- Add infrastructure to track data types during Partitioning
- Add utilities for autocasting inputs to TensorRT segmented blocks

MVP `(1.4.0/1.5.0)` - M (complete)

For Int64 only:
- Add input-validation for option compatibility, discussed in the Extensions section above
- Add ability to specify dtype=torch.long for Python TorchScript and C++ APIs.

Extension Phase 1 - S

For Int64 only:
- Consider additional edge cases, such as users inputting data types different from the specified or inferred type.
- Investigate FX applications of autocasting

Extension Phase 2 - M

Generalize data type support to include Double, and other precisions
Add autocast optional argument to compile in Python TorchScript and C++ APIs.

peri044 · 2022-12-17T21:58:12Z

peri044
Dec 17, 2022
Collaborator

Questions For MVP:

In the extensions, you mentioned "Augment dry-run pass to determine input + output data types in addition to input + output shapes"
We record input dtypes for every segmentedBlock as of today https://github.com/pytorch/TensorRT/blob/master/core/partitioning/shape_analysis.cpp#L290.
However if there are long/double inputs to the block, we manually downcast them so that the input_types will have int32 or float32 which will be used when all these segmentedBlocks are converted to TensorRT engines. So, this is probably where additional introduction of aten::to ops need to happen. \
One thing to keep in mind is if autocast=True, min_block_size won't be respected as aten::to ops can lead to smaller TRT engines. We should issue a warning to users if that's the case
In the Proposed API/ UX section, you've mentioned "with type-casting for Int64-required ops added automatically.". Does this mean we perform INT64 casting regardless of autocast=True flag ?

1 reply

gs-olive Dec 19, 2022
Collaborator Author

Thank you for the detailed response, to address these:

I agree that the manual downcast is where additional aten::to operations need to be added, and I think the main concern with this manual downcast is that I have not seen a corresponding up-cast, except for the following, added in the Int64 <=> Int32 Fallback PR feat: support int64 <=> int32 auto conversion #1407:

TensorRT/core/partitioning/shape_analysis.cpp

Lines 232 to 236 in f43be5b

    
           if (t == at::kLong) { 
        
             // we add a cast operation to cast the type to Int64 
        
             auto cast_node = createCastNode(seg_block, i, true, target_device); 
        
             seg_block.g()->prependNode(cast_node); 
        
             seg_block.inputs()[i]->replaceAllUsesAfterNodeWith(cast_node, cast_node->outputs()[0]);

I am not sure if this applies in all of the cases we are concerned with, but I will investigate. The issue is that if a Torch block requiring Int64 is followed by a TensorRT block, which is then followed by a Torch block which also requires Int64, I am not certain whether we are currently casting the output of the TensorRT block back to Int64 for the subsequent usage.

I will look into how the autocast feature affects min_block_size, but I don't think it would be a major issue, as autocast should only be adding operations, and not removing them, so it should not end up altering any of the TensorRT blocks, only adding new Torch blocks or augmenting existing ones.
I believe no, we should not perform Int64 casting (aside from what is already done in truncate_long_and_double) if autocast=True is not specified.

peri044 · 2023-01-23T18:57:06Z

peri044
Jan 23, 2023
Collaborator

MVP is done correct ? @gs-olive

1 reply

gs-olive Jan 23, 2023
Collaborator Author

Yes, MVP is completed and merged into main (PR #1551)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Casting and Data-Type Tracking Through Torch-TRT Engines #1553

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Efficient Casting and Data-Type Tracking Through Torch-TRT Engines #1553

gs-olive Dec 15, 2022 Collaborator

Autocast

TL;DR

Goal(s)

Usecases

Proposed APIs / UX

Example Workflow

Limitations

Internal Implementation

Design

Extensions Required to Core API implementations

Data Structures

Details specific for TorchScript Support

Details specific for FX support

Implementation Phases

Prototype - L (complete)

MVP (1.4.0/1.5.0) - M (complete)

Extension Phase 1 - S

Extension Phase 2 - M

Replies: 2 comments · 2 replies

peri044 Dec 17, 2022 Collaborator

gs-olive Dec 19, 2022 Collaborator Author

peri044 Jan 23, 2023 Collaborator

gs-olive Jan 23, 2023 Collaborator Author

gs-olive
Dec 15, 2022
Collaborator

MVP `(1.4.0/1.5.0)` - M (complete)

Replies: 2 comments 2 replies

peri044
Dec 17, 2022
Collaborator

gs-olive Dec 19, 2022
Collaborator Author

peri044
Jan 23, 2023
Collaborator

gs-olive Jan 23, 2023
Collaborator Author