Skip to content

Chakra Converter

Joongun Park edited this page Sep 18, 2024 · 4 revisions

Introduction

The PyTorchConverter class is responsible for converting Chakra host and device execution traces from JSON format into the Chakra protobuf format. The converted traces are suitable for simulation or further analysis. This class handles parsing the JSON traces, converting nodes, establishing dependencies, and writing the final protobuf trace to an output file.

Conversion Process

The conversion process involves several key steps:

  1. Loading and Parsing JSON Trace: The converter loads the JSON trace and parses it to create a mapping of node IDs to PyTorchNode instances. It also extracts metadata and root nodes.

  2. Establishing Relationships: Parent-child relationships are established based on control dependencies present in the trace. The converter handles specific cases for GPU operations, record_param_comms, and NCCL nodes.

  3. Converting Nodes to Protobuf Format: Each PyTorchNode is converted into a ChakraNode (protobuf format). The converter handles node types, attributes, and special cases for communication operations.

  4. Dependency Conversion: Control dependencies are converted into data dependencies, which are necessary for simulation. This step ensures that the execution order is correctly represented.

  5. Removing Dangling Nodes: Nodes that are not connected (neither parents nor children) are removed to simplify the execution graph.

  6. Cyclic Dependency Check: The converter checks for cyclic dependencies to ensure the execution graph is a Directed Acyclic Graph (DAG), which is essential for simulation.

  7. Writing Protobuf Trace: The final execution trace, including metadata and nodes, is written into a protobuf file.

  8. Simulation (Optional): If simulation is enabled, the converter simulates the execution of the nodes to validate the correctness of the conversion and the dependencies.

Key Functions and Their Roles

  • convert(input_filename, output_filename, simulate): The main method to perform the conversion process from JSON to protobuf format.

  • load_json_execution_traces(input_filename): Loads the JSON execution traces from a file.

  • parse_json_trace(json_trace): Parses the JSON trace and creates PyTorchNode instances.

  • establish_parent_child_relationships(json_node_map, json_node_root_nids): Establishes parent-child relationships among nodes based on control dependencies.

  • convert_json_to_protobuf_nodes(json_node_map, protobuf_node_map): Converts PyTorchNode instances into protobuf ChakraNode instances.

  • convert_ctrl_dep_to_data_dep(json_node_map, protobuf_node_map, chakra_node): Converts control dependencies into data dependencies suitable for simulation.

  • remove_dangling_nodes(protobuf_node_map): Removes nodes that are not connected to the execution graph.

  • identify_cyclic_dependencies(protobuf_node_map): Checks for cyclic dependencies in the execution graph to ensure it is a DAG.

  • write_protobuf_execution_trace(output_filename, json_metadata, protobuf_node_map): Writes the converted execution trace into a protobuf file.

  • simulate_execution(json_node_map, protobuf_node_map, parent_to_children_map): Simulates the execution of the nodes to validate the correctness of the conversion and dependencies.

Important Considerations

  • Control vs. Data Dependencies: In the original Chakra host execution traces, control dependencies represent the caller-callee relationships. However, for simulation purposes, these need to be converted into data dependencies to reflect the actual execution order.

  • Inter-Thread Dependencies: The converter handles inter-thread dependencies to ensure accurate simulation of operations that span multiple threads.

  • Communication Operations: Special attention is given to communication operations like NCCL calls. The converter identifies collective communication types and handles attributes like communication size and process group names.

  • Cyclic Dependencies: The converter ensures that the final execution graph is acyclic. Cyclic dependencies can cause simulations to hang or fail, so they are detected and reported.

  • Dangling Nodes: Nodes that are neither parents nor children are considered dangling and are removed to simplify the execution graph.

Clone this wiki locally