Follow the NVIDIA GA100 example. This is a 4-GA100 node connected with NVLinks.
Most of the attributes are self-explained:
{
"name": "NVIDIA A100(80GB)x4",
"device_count": 4, # how many devices in a node
"interconnect": {
"link": {
"name": "NVLink3",
"bandwidth_per_direction_byte": 25e9,
"bandwidth_both_directions_byte": 50e9,
"latency_second": 8.92e-6,
"flit_size_byte": 16,
"header_size_byte": 16,
"max_payload_size_byte": 256
},
"link_count_per_device": 12,
"topology": "FC" # currently support FC (fully-connected) and RING
},
"device": {
"frequency_Hz": 1410e6,
"compute_chiplet_count": 1,
"compute_chiplet": {
"physical_core_count": 128, # used for area model
"core_count": 128, # used for performance model
"process_node": "7nm", # currently support 7nm, 6nm, 5nm
"core": {
"sublane_count": 4,
"systolic_array": {
"array_width": 16,
"array_height": 16,
"data_type": "fp16",
"mac_per_cycle": 1
},
"vector_unit": {
"vector_width": 32,
"flop_per_cycle": 4, # 32*4=128 flops per cycle per vector unit
"data_type": "fp16",
"int32_count": 16, # the number of int32 ALUs, used for area model
"fp16_count": 0,
"fp32_count": 16,
"fp64_count": 8
},
"register_file": {
"num_reg_files": 1,
"num_registers": 16384,
"register_bitwidth":32,
"num_rdwr_ports":4
},
"SRAM_KB": 192
}
},
"memory_protocol": "HBM2e",
"_memory_protocol_list": [
"HBM2e",
"DDR4",
"DDR5",
"PCIe4",
"PCIe5"
],
"io": {
"process_node": "7nm",
"global_buffer_MB": 48,
"physical_global_buffer_MB": 48,
"global_buffer_bandwidth_per_cycle_byte": 5120,
"memory_channel_physical_count": 6, # used for area model
"memory_channel_active_count": 5, # used for performance model
"pin_count_per_channel": 1024,
"bandwidth_per_pin_bit": 3.2e9
},
"memory": {
"total_capacity_GB": 80
}
}
}
Transformer blocks have been provided as in transformer.py
, including Initial Computation (also called Prefill or Context stage) and Auto Regression (also called Decoding or Generation stage), with Tensor Parallelism support (automatically turned of if the system only has 1 device).
The user needs to provide these parameter:
d_model
: the hidden dimension, 12288 for GPT3n_heads
: the number of heads, 96 for GPT3device_count
: tensor parallelismdata_type
:int8
,fp16
, orfp32
The user can also build their own computational graph following the transformer.py
example using provided operators: matmul
, softmax
, layernorm
, gelu
, and allreduce
.
The user needs to define a new class
by inheriting Operator
the class and configure these fields:
__init__
: define the needed operators in the initial function__call__
: build the computational graph. The shape of Tensors will be automatically calculated and used for simulation.compile_and_simulate
: simulate all the operators and get the total latency as well as other runtimes.roofline_model
(optional): a roofline model analysis.run_on_gpu
(optional): run the computational graph on real-world GPUs with PyTorch.
First, read the hardware configuration and parse it to LLMCompass:
from design_space_exploration.dse import template_to_system, read_architecture_template
specs = read_architecture_template("PATH/TO/YOUR/JSON")
system = template_to_system(specs)
Next, initiate and instantiate an LLM as in this example:
model_auto_regression = TransformerBlockAutoRegressionTP(
d_model=12288,
n_heads=96,
device_count=1,
data_type=data_type_dict["fp16"],
)
_ = model_auto_regression(
Tensor([bs, 1, 12288], data_type_dict["fp16"]),
seq_len,
)
Finally, run the simulation
auto_regression_latency_simulated = model_auto_regression.compile_and_simulate(
system, "heuristic-GPU"
)