diff --git a/Readme.txt b/Readme.txt index 9008165a..e83b410c 100644 --- a/Readme.txt +++ b/Readme.txt @@ -1,54 +1,199 @@ -ROC Profiler library. -Profiling with metrics and traces based on perfcounters (PMC) and traces (SPM). -Implementation is based on AqlProfile HSA extension. -Library supports GFX8/GFX9. +# ROC-profiler +ROC profiler library. Profiling with perf-counters and derived metrics. Library supports GFX8/GFX9. -The library source tree: - - doc - Documentation +HW specific low-level performance analysis interface for profiling of GPU compute applications. The +profiling includes HW performance counters with complex performance metrics. + +To use the rocProfiler API you need the API header and to link your application with roctracer .so librray: + - the API header: /opt/rocm/rocprofiler/include/rocprofiler.h + - the .so library: /opt/rocm/lib/librocprofiler64.so + +## Documentation +- ['rocprof' cmdline tool specification](doc/rocprof.md) +- ['rocprofiler' profiling C API specification](doc/rocprofiler_spec.md) + +## Metrics +[The link to profiler default metrics XML specification](test/tool/metrics.xml) + + +## Source tree +``` + - bin + - rocprof - Profiling tool run script + - doc - Documentation - inc/rocprofiler.h - Library public API - src - Library sources - core - Library API sources - util - Library utils sources - xml - XML parser - test - Library test suite + - tool - Profiling tool + - tool.cpp - tool sources + - metrics.xml - metrics config file - ctrl - Test controll - util - Test utils - simple_convolution - Simple convolution test kernel +``` + +## Build environment: +``` + export CMAKE_PREFIX_PATH=: + export CMAKE_BUILD_TYPE= # release by default + export CMAKE_DEBUG_TRACE=1 # to enable debug tracing +``` + +## To build with the current installed ROCM: +``` + - ROCm is required. + ROCr-runtime and roctracer are needed + + - Python is required. + The required modules: CppHeaderParser, argparse, sqlite3 + To install: + sudo pip install CppHeaderParser argparse sqlite3 + + - To build and install to /opt/rocm/rocprofiler + Please use release branches/tags of 'amd-master' branch for development version. + + export CMAKE_PREFIX_PATH=/opt/rocm/include/hsa:/opt/rocm + + cd .../rocprofiler + ./build.sh +``` + +## Internal 'simple_convolution' test run script: +``` + cd .../rocprofiler/build + make mytest + run.sh +``` + +## To enable error messages logging to '/tmp/rocprofiler_log.txt': +``` + export ROCPROFILER_LOG=1 +``` + +## To enable verbose tracing: +``` + export ROCPROFILER_TRACE=1 +``` + +## Profiling utility usage: +``` +rocprof [-h] [--list-basic] [--list-derived] [-i ] [-o ] + +Options: + -h - this help + --verbose - verbose mode, dumping all base counters used in the input metrics + --list-basic - to print the list of basic HW counters + --list-derived - to print the list of derived metrics with formulas + --cmd-qts - quoting profiled cmd-line [on] + + -i <.txt|.xml file> - input file + Input file .txt format, automatically rerun application for every pmc line: + + # Perf counters group 1 + pmc : Wavefronts VALUInsts SALUInsts SFetchInsts FlatVMemInsts LDSInsts FlatLDSInsts GDSInsts FetchSize + # Perf counters group 2 + pmc : VALUUtilization,WriteSize L2CacheHit + # Filter by dispatches range, GPU index and kernel names + # supported range formats: "3:9", "3:", "3" + range: 1 : 4 + gpu: 0 1 2 3 + kernel: simple Pass1 simpleConvolutionPass2 -Build environment: + Input file .xml format, for single profiling run: -$ export CMAKE_PREFIX_PATH=: -$ export CMAKE_BUILD_TYPE= # release by default -$ export CMAKE_DEBUG_TRACE=1 # 1 to enable debug tracing + # Metrics list definition, also the form ":" can be used + # All defined metrics can be found in the 'metrics.xml' + # There are basic metrics for raw HW counters and high-level metrics for derived counters + -To build with the current installed ROCM: + # Filter by dispatches range, GPU index and kernel names + -$ cd .../rocprofiler -$ export CMAKE_PREFIX_PATH=/opt/rocm/include/hsa:/opt/rocm -$ mkdir build -$ cd build -$ cmake .. -$ make + -o - output CSV file [.csv] + The output CSV file columns meaning in the columns order: + Index - kernels dispatch order index + KernelName - the dispatched kernel name + gpu-id - GPU id the kernel was submitted to + queue-id - the ROCm queue unique id the kernel was submitted to + queue-index - The ROCm queue write index for the submitted AQL packet + tid - system application thread id which submitted the kernel + grd - the kernel's grid size + wgr - the kernel's work group size + lds - the kernel's LDS memory size + scr - the kernel's scratch memory size + vgpr - the kernel's VGPR size + sgpr - the kernel's SGPR size + fbar - the kernel's barriers limitation + sig - the kernel's completion signal + ... - The columns with the counters values per kernel dispatch + DispatchNs/BeginNs/EndNs/CompleteNs - timestamp columns if time-stamping was enabled + + -d - directory where profiler store profiling data including thread treaces [/tmp] + The data directory is renoving autonatically if the directory is matching the temporary one, which is the default. + -t - to change the temporary directory [/tmp] + By changing the temporary directory you can prevent removing the profiling data from /tmp or enable removing from not '/tmp' directory. -To run the test: + --basenames - to turn on/off truncating of the kernel full function names till the base ones [off] + --timestamp - to turn on/off the kernel dispatches timestamps, dispatch/begin/end/complete [off] + Four kernel timestamps in nanoseconds are reported: + DispatchNs - the time when the kernel AQL dispatch packet was written to the queue + BeginNs - the kernel execution begin time + EndNs - the kernel execution end time + CompleteNs - the time when the completion signal of the AQL dispatch packet was received -$ cd .../rocprofiler/build -$ export LD_LIBRARY_PATH=.: # paths to ROC profiler and oher libraries -$ export HSA_TOOLS_LIB=librocprofiler64.so # ROC profiler library loaded by HSA runtime -$ export ROCP_TOOL_LIB=test/libtool.so # tool library loaded by ROC profiler -$ export ROCP_METRICS=metrics.xml # ROC profiler metrics config file -$ export ROCP_INPUT=input.xml # input file for the tool library -$ export ROCP_OUTPUT_DIR=./ # output directory for the tool library, for metrics results file 'results.txt' and trace files -$ + --ctx-limit - maximum number of outstanding contexts [0 - unlimited] + --heartbeat - to print progress heartbeats [0 - disabled] + --obj-tracking - to turn on/off kernels code objects tracking [on] + To support V3 code-object. -Internal 'simple_convolution' test run script: -$ cd .../rocprofiler/build -$ run.sh + --stats - generating kernel execution stats, file .stats.csv + + --roctx-trace - to enable rocTX application code annotation trace, "Markers and Ranges" JSON trace section. + --sys-trace - to trace HIP/HSA APIs and GPU activity, generates stats and JSON trace chrome-tracing compatible + --hip-trace - to trace HIP, generates API execution stats and JSON file chrome-tracing compatible + --hsa-trace - to trace HSA, generates API execution stats and JSON file chrome-tracing compatible + --kfd-trace - to trace KFD, generates API execution stats and JSON file chrome-tracing compatible + Generated files: ._stats.txt .json + Traced API list can be set by input .txt or .xml files. + Input .txt: + hsa: hsa_queue_create hsa_amd_memory_pool_allocate + Input .xml: + + + + -To enabled error messages logging to '/tmp/rocprofiler_log.txt': + --trace-start - to enable tracing on start [on] + --trace-period - to enable trace with initial delay, with periodic sample length and rate + Supported time formats: -$ export ROCPROFILER_LOG=1 +Configuration file: + You can set your parameters defaults preferences in the configuration file 'rpl_rc.xml'. The search path sequence: .:$HOME: + First the configuration file is looking in the current directory, then in your home, and then in the package directory. + Configurable options: 'basenames', 'timestamp', 'ctx-limit', 'heartbeat', 'obj-tracking'. + An example of 'rpl_rc.xml': + +``` -To enable verbose tracing: -$ export ROCPROFILER_TRACE=1 +## Known Issues: +- For workloads where the hip application might make more than 10 million HIP API calls, the application might crash with the error - "Profiling data corrupted" + - Suggested Workaround - Instead of profiling for the complete run, it is suggested to run profiling in parts by using the --trace-period option. +- When the same kernel is launched back to back multiple times on a GPU, the cache hit rate from rocprofiler is reported as 0% or very low. This also causes FETCH_SIZE to be not usable for repeatable kernel.