Automatic plugin generation using custom kernels and host code #3198

bowang007 · 2024-10-01T21:01:04Z

bowang007
Oct 1, 2024
Collaborator

TL;DR

We want to automatically generate the plugin and converters given the kernel and host code. Users can include a custom kernel in a TensorRT engine using Torch-TensorRT and users don't need to write the plugin themselves. Torch-TRT does everything for us.

Goal(s)

Allow users to use custom kernels Torch-TensorRT engines without the effort of writing tensorrt plugin. Increase the model performance with graph breaks in model.

Usecases

Automatic TensorRT plugin generation
Performance increases

Proposed APIs/UX

As what is demonstrated here, the workflow for a plugin is usually shown here

In this case, user would need to provide the kernel code and then write the plugin according to the needs. What we would do here is:

Introduce some code generation utilities in Torch-TensorRT. If we take a look at the tutorial shown above, we could find there is a Plugin example, which is also demonstrated in TensorRT repo here. This could be a template, once we provide the kernel code, Torch-TensorRT will analyze the input tensor shape, output tensor shape and etc. After getting all required information, the template could be used to generate the plugin according to each kernel. In this process, some techniques such as PyTorch fake tensor, inference in PyTorch could be applied here.
Introduce the tensor shape parsing system in Torch-TensorRT to analyze the required information to generate the plugin.

Example Workflow

We start by assuming that the user has some custom PyTorch operator that calls a custom kernel. This operator has be registered with PyTorch and has a fake tensor implementation

from typing import Any, Sequence

import numpy as np
import torch
import triton
import triton.language as tl
from torch.library import custom_op


# Defining the kernel to be run on the GPU
@triton.jit  # type: ignore
def circ_pad_kernel(
    X: torch.Tensor,
    all_pads_0: tl.int32,
    all_pads_2: tl.int32,
    all_pads_4: tl.int32,
    all_pads_6: tl.int32,
    orig_dims_0: tl.int32,
    orig_dims_1: tl.int32,
    orig_dims_2: tl.int32,
    orig_dims_3: tl.int32,
    Y: torch.Tensor,
    Y_shape_1: tl.int32,
    Y_shape_2: tl.int32,
    Y_shape_3: tl.int32,
    X_len: tl.int32,
    Y_len: tl.int32,
    BLOCK_SIZE: tl.constexpr,
) -> None:
    pid = tl.program_id(0)
    i = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)

    mask_y = i < Y_len

    i3 = i % Y_shape_3
    i2 = (i // Y_shape_3) % Y_shape_2
    i1 = (i // Y_shape_3 // Y_shape_2) % Y_shape_1
    i0 = i // Y_shape_3 // Y_shape_2 // Y_shape_1

    j0 = (i0 - all_pads_0 + orig_dims_0) % orig_dims_0
    j1 = (i1 - all_pads_2 + orig_dims_1) % orig_dims_1
    j2 = (i2 - all_pads_4 + orig_dims_2) % orig_dims_2
    j3 = (i3 - all_pads_6 + orig_dims_3) % orig_dims_3

    load_idx = (
        orig_dims_3 * orig_dims_2 * orig_dims_1 * j0
        + orig_dims_3 * orig_dims_2 * j1
        + orig_dims_3 * j2
        + j3
    )
    mask_x = load_idx < X_len

    x = tl.load(X + load_idx, mask=mask_x)

    tl.store(Y + i, x, mask=mask_y)


# The launch code wrapped to expose it as a custom operator in our namespace
@custom_op("torchtrt_ex::triton_circular_pad", mutates_args=())  # type: ignore[misc]
def triton_circular_pad(x: torch.Tensor, padding: Sequence[int]) -> torch.Tensor:
    out_dims = np.ones(len(x.shape), dtype=np.int32)
    for i in range(np.size(padding) // 2):
        out_dims[len(out_dims) - i - 1] = (
            x.shape[len(out_dims) - i - 1] + padding[i * 2] + padding[i * 2 + 1]
        )

    y = torch.empty(tuple(out_dims.tolist()), device=x.device)

    N = len(x.shape)
    all_pads = np.zeros((N * 2,), dtype=np.int32)
    orig_dims = np.array(x.shape, dtype=np.int32)
    out_dims = np.array(x.shape, dtype=np.int32)

    for i in range(len(padding) // 2):
        out_dims[N - i - 1] += padding[i * 2] + padding[i * 2 + 1]
        all_pads[N * 2 - 2 * i - 2] = padding[i * 2]
        all_pads[N * 2 - 2 * i - 1] = padding[i * 2 + 1]

    blockSize = 256
    numBlocks = (int((np.prod(out_dims) + blockSize - 1) // blockSize),)

    circ_pad_kernel[numBlocks](
        x,
        all_pads[0],
        all_pads[2],
        all_pads[4],
        all_pads[6],
        orig_dims[0],
        orig_dims[1],
        orig_dims[2],
        orig_dims[3],
        y,
        out_dims[1],
        out_dims[2],
        out_dims[3],
        int(np.prod(orig_dims)),
        int(np.prod(out_dims)),
        BLOCK_SIZE=256,
    )

    return y

@torch.library.register_fake("torchtrt_ex::triton_circular_pad")  # type: ignore[misc]
def _(x: torch.Tensor, padding: Sequence[int]) -> torch.Tensor:
    return torch.nn.functional.pad(x, padding, "circular")

To generate the plugin and the converter, the user would add an additional decorator

@torch_tensorrt.dynamo.custom_op()
@custom_op("torchtrt_ex::triton_circular_pad", mutates_args=())  # type: ignore[misc]
def triton_circular_pad(x: torch.Tensor, padding: Sequence[int]) -> torch.Tensor:

And this decorator would construct the plugin class, the plugin constructor class and the converter.

Limitations

Internal Implementation

Design

We need to generate 3 objects:

Plugin class

class CustomPlugin(trt.IPluginV2DynamicExt):  # type: ignore[misc]
    def __init__(
        self, field_collection: Optional[List[trt.PluginFieldCollection]] = None
    ):
        ...

    def get_output_datatype(
        self, index: int, input_types: List[trt.DataType]
    ) -> trt.DataType:
        ...

    def get_output_dimensions(
        self,
        output_index: int,
        inputs: List[trt.DimsExprs],
        exprBuilder: trt.IExprBuilder,
    ) -> trt.DimsExprs:
        ...

    def configure_plugin(
        self,
        inp: List[trt.DynamicPluginTensorDesc],
        out: List[trt.DynamicPluginTensorDesc],
    ) -> None:
        ...

    def serialize(self) -> bytes:
        ...

    def supports_format_combination(
        self, pos: int, in_out: List[trt.PluginTensorDesc], num_inputs: int
    ) -> bool:
        ...

    def enqueue(
        self,
        input_desc: List[trt.PluginTensorDesc],
        output_desc: List[trt.PluginTensorDesc],
        inputs: List[int],
        outputs: List[int],
        workspace: int,
        stream: int,
    ) -> None:
        ...

    def clone(self) -> Self:
        ...

Plugin Class

def __init__(
        self, field_collection: Optional[List[trt.PluginFieldCollection]] = None
    ):
        super().__init__()
        <ANY NON TENSOR INPUTS SHOULD BE AN ATTRIBUTE OF THE PLUGIN>

        setattr(<name of input>, <default value for that type>) 


        self.pads = []
        self.X_shape: List[int] = []
 
        self.num_outputs = 1 # Defined by schema 
        self.plugin_namespace = ""
        self.plugin_type = <NAME OF PLUGIN>
        self.plugin_version = "1"

        <GENERATE CODE FOR TAKING A FIELD COLLECTION CONTAINING THE NON TENSOR INPUTS AND SETTING AN ATTR> 
        ex.
        if field_collection is not None:
            assert field_collection[0].name == "pads"
            self.pads = field_collection[0].data

 def get_output_datatype(
        self, index: int, input_types: List[trt.DataType]
    ) -> trt.DataType:
        # WE CAN USE THE FAKE TENSOR IMPLEMENTATION TO FIGURE OUT THE EXPECTED OUTPUT DATA TYPE 
        with torch.fake_tensor():
             <GENERATE FAKE INPUTS OF TYPE INPUT_TYPES>
             fake_outputs = torch.ops.<custom_ns>.<custom_op>(*fake_inputs)

        return fake_outputs[index]

def get_output_dimensions(
        self,
        output_index: int,
        inputs: List[trt.DimsExprs],
        exprBuilder: trt.IExprBuilder,
    ) -> trt.DimsExprs:


       WE NEED TO FIND A WAY TO GO FROM FAKE TENSOR IMPL TO CONSTRUCTING A DIMSEXPR 
       THIS IS SOLVED IN SHAPE PROP IN PYTORCH WHERE SHAPE PROP CAN GIVE SYMINTS THAT ENCODE THE 
       SHAPE MAP. 

        return output_dims

    def configure_plugin(
        self,
        inp: List[trt.DynamicPluginTensorDesc],
        out: List[trt.DynamicPluginTensorDesc],
    ) -> None:
        <JUST SET ATTRS>

    def serialize(self) -> bytes:
        return <PKL OF ALL ATTRS>

def supports_format_combination(
        self, pos: int, in_out: List[trt.PluginTensorDesc], num_inputs: int
    ) -> bool:
        <EITHER USE PYTORCH FAKE TENSOR TO TELL OR HAVE THE USER TELL US THROUGH A SYSTEM LIKE ENFORCE_TENSOR_TYPES>

  def enqueue(
        self,
        input_desc: List[trt.PluginTensorDesc],
        output_desc: List[trt.PluginTensorDesc],
        inputs: List[int],
        outputs: List[int],
        workspace: int,
        stream: int,
    ) -> None:

    1. Use input/output_desc and create new Torch Tensors that are the right shape (might need cupy)
    2. Call the custom op with the torch tensors + any relevant attrs (set stream as needed)  
    3. Get results from the custom op into the output buffer  

    with torch.cuda.stream(stream):
        outputs = torch.ops.custom_op.my_custom_op(*input_tensors, self.pad)

def clone(self) -> Self:
        cloned_plugin = type(self)()
        cloned_plugin.__dict__.update(self.__dict__)
        return cloned_plugin

Plugin constructor class

class PluginCreator(trt.IPluginCreator):  # type: ignore[misc]
    def __init__(self):
        ...

    def create_plugin(
        self, name: str, field_collection: trt.PluginFieldCollection_
    ) -> CustomPlugin:
        ... 

    def deserialize_plugin(self, name: str, data: bytes) -> CircularPaddingPlugin:
        ...

TRT_PLUGIN_REGISTRY = trt.get_plugin_registry()
TRT_PLUGIN_REGISTRY.register_creator(CircularPaddingPluginCreator(), "")

Plugin Constructor Class

Mostly generic code:

def __init__(self):
    super().__init__()

    self.name = <NAME OF PLUGIN CLASS>
    self.plugin_namespace = "" # Reasonable constant value 
    self.plugin_version = "1" # Reasonable constant value 
    self.field_names = # Based on the code we generated for the plugin

def create_plugin(
      self, name: str, field_collection: trt.PluginFieldCollection_
  ) -> <PLUGIN CLASS>:
      return <PLUGIN CLASS>(field_collection)

def deserialize_plugin(self, name: str, data: bytes) -> CircularPaddingPlugin:
    dict = pkl.loads(data)
    deserialized = <PLUGIN CLASS>()
    deserialized.__dict__.update(dict)
    return deserialized

Converter

@dynamo_tensorrt_converter(
    <MY OP>
)  # type: ignore
def converter(
    ctx: ConversionContext,
    target: Target,
    args: Tuple[Argument, ...],
    kwargs: Dict[str, Argument],
    name: str,
):

    # How to retrieve a plugin if it is defined elsewhere (e.g. linked library)
    plugin_registry = trt.get_plugin_registry()
    plugin_creator = plugin_registry.get_plugin_creator(
        type="<PLUGIN_NAME>", version="1", plugin_namespace=""
    )
    assert plugin_creator, f"Unable to find <PLUGIN_NAME> creator"

    # Pass configurations to the plugin implementation
    field_configs = <TO BE GENERATED>
    plugin = plugin_creator.create_plugin(name=name, field_collection=field_configs)
    assert plugin, "Unable to create <PLUGIN_NAME>"

    <GENERATE LINK BETWEEN PLUGIN AND INPUTS>
       <GET INPUTS INTO LIST>
       <PASS TO PLUGIN> 
 
    return layer.get_output(0)

Converter Generator

We know the schema of the custom op since its a PyTorch custom op. This lets us generate the code to take node inputs and pack them for the plugin and format node outputs.

Extensions Required to Core API implementations

Data Structures

Details specific for TorchScript Support

Details specific for FX support

Implementation Phases

Prototype -

MVP `(<TARGET RELEASE VERSION>)`

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Use autogened plugins to pull in single op graph breaks like aten::embedding bag

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

HolyWu · 2024-10-05T09:30:17Z

HolyWu
Oct 5, 2024

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#migrating-plugins said:

IPluginV2DynamicExt is deprecated in TensorRT 10.0. Therefore, new plugins should target IPluginV3, and old ones should be refactored.

I think you can take the necessary changes from https://github.com/NVIDIA/TensorRT/blob/main/samples/python/python_plugin/circ_pad_plugin_multi_tactic.py.

0 replies

bowang007 · 2024-11-09T00:13:57Z

bowang007
Nov 9, 2024
Collaborator Author

elementwise/unary plugin work,
more complicated ops, handle the passed in attributes as field collection

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic plugin generation using custom kernels and host code #3198

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Automatic plugin generation using custom kernels and host code #3198

bowang007 Oct 1, 2024 Collaborator

TL;DR

Goal(s)

Usecases

Proposed APIs/UX

Example Workflow

Limitations

Internal Implementation

Design

Plugin Class

Plugin Constructor Class

Converter Generator

Extensions Required to Core API implementations

Data Structures

Details specific for TorchScript Support

Details specific for FX support

Implementation Phases

Prototype -

MVP (<TARGET RELEASE VERSION>)

Extension Phase 1 (<TARGET RELEASE VERSION>)

Extension Phase 2 (<TARGET RELEASE VERSION>)

Replies: 2 comments

HolyWu Oct 5, 2024

bowang007 Nov 9, 2024 Collaborator Author

bowang007
Oct 1, 2024
Collaborator

MVP `(<TARGET RELEASE VERSION>)`

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

HolyWu
Oct 5, 2024

bowang007
Nov 9, 2024
Collaborator Author