All notable changes to this project will be documented in this file. The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Add trace format validator
- Added multiple trace filter classes and demos.
- Added enhanced trace call stack graph implementation.
- Added memory timeline view.
- Added support for trace parser customization.
- Added support for H100 traces.
- (Experimental) Support to read PyTorch Execution Trace and correlate it with PyTorch Profiler Trace.
- (Experimental) Added lightweight critical path analysis feature.
- (Experimental) Critical path analysis features: event attribution and
summary()
- (Experimental) Critical path analysis fixes: fixing async memcpy and adding GPU to CPU event based synchronization.
- (Experimental) Added save and restore feature for critical path graph.
- Add nccl collective fields to parser config
- Queue length analysis: Add feature to compute time blocked on a stream hitting max queue length.
- Add
kernel_backend
to parser config for Triton / torch.compile() support.
- Change test data path in unittests from relative path to real path to support running test within IDEs.
- Add a workaround for overlapping events when using ns resolution traces (pytorch/pytorch#122425)
- Better handling of CUDA sync events with steam = -1
- Fix ijson metadata parser for some corner cases
- Add an option for ns rounding and cover ijson loading with it.
- Updated Trace() api to specify a list of files and auto figure out ranks.
- Deprecated 'call_stack'; use 'trace_call_stack' and 'trace_call_graph' instead.
- Fixed issue #65 to handle floating point counter values in cupti_counter_analysis.
- Fixes bug in Critical path analysis relating to listing out the edges on the critical path.
- Updated critical path analysis with edge attribution.
- (Experimental) Added CUPTI Counter analyzer feature to parse kernel and operator level counter statistics.
- Improved loading time by parallelizing reads in the
create_rank_to_trace_dict
function.
- Fix unit of time in
get_gpu_kernel_breakdown
API. - Optimized multiprocessing strategy to handle OOMs when process pool is too large.
- Split requirements.txt into two files:
requirements.txt
andrequirements-dev.txt
. The former does not containkaleido
,coverage
as they are required for development purposes only.
- Coverage tests for graph visualization
- Dependency on Matplotlib for Venn diagram generation
- LICENSE type in setup.py
- Queue length summary handles missing data and negative values
- Use
.get
for key lookup in dictionary - Typos in README.md
- Initial release
- Test release
- Test release