Releases: HPCToolkit/hpctoolkit
Release-2022.05.15
Bugfixes since 2022.04.15
- hide the symbols from XED, this avoid a problem interfering with
symbols from Intel gtpin - fix a typo in renamestruct.sh script
- fix a problem with fork() on Cray
Release-2022.04.15
HPCToolkit Release Notes
Enhancements:
- Hpcrun includes initial support for using the OpenMP OMPT interface for
profiling and tracing of OpenMP TARGET operations on AMD GPUs in code
generated by ROCM 5.1's clang-based AOMP compiler. - Hpcrun supports profiling of kernels on AMD GPUs using publicly available
hardware counters with AMD's rocprofiler API. - Hpcrun obtains binaries for code that executes on AMD GPUs using AMD's
Roctracer API instead of ROCm Debug API. - Hpcrun emits a better error message when an application unexpectedly closes
hpcrun's log file. - Hpcrun now uses an embedded implementation of an MD5 hash function for
naming CPU and GPU binaries revealed in memory. - Hpcstruct now supports caching of structure files from binaries it analyzes.
A cache greatly reduces the time to analyze binaries for executions as the
cache will almost always contain up to date analysis results for commonly
used shared platform libraries, e.g. libc, libm, as well as libraries for MPI.
When a binary changes, results in the cache are updated as needed. - Hpcstruct no longer pretty-prints its output by default. Omitting leading
blanks due to pretty printing reduced the output size by over 15%, which was
quite significant when analyzing multi-gigabyte binaries. - When applied to a measurements directory, hpcstruct will analyze only CPU and
GPU binaries that were measured in the execution using a mix of parallelism
and concurrency. Binaries that did not get any profile hits are not analyzed. - Hpcstruct's parallel efficiency has been improved. Changes that contributed
to that improvement include enhancements to parallelism in Dyninst’s
finalization of binary analysis and parallel assembly of hpcstruct's output
file. - Update hpcstruct to support analysis of CUDA binaries from 11.5+ to
accommodate change to NVIDIA's nvdisasm output format. - When measuring hardware counter metrics for kernels on AMD GPUs, disable
kernel measurement with Roctracer because it gives an incorrect timestamp
for the first kernel. The timestamp is wrong by a mile and destroys the
accuracy of kernel profiles and traces.
Bug fixes:
- Adjust tracing for ROCm GPU activities to correct alignment between CPU and
GPU timelines. - Fix use of Dyninst by hpcstruct so that it sees inlining info in Intel GPU
binaries.
Infrastructure improvements:
- Code for hpcrun's use of LD_AUDIT has been streamlined.
- Fixed recording of program path names as part of metadata in hpcrun's
output files.
Dependency changes:
- Deletions
- Mbedtls - superceded by internal MD5 hash implementation
- ROCm Debug API - obtain GPU binaries using Roctracer API instead
- Gotcha - unused and removed
- Additions
- Rocprofiler API - included for a spack '+rocm' install to provide access
to hardware counters on AMD GPUs - HSA - included for a spack '+rocm' install to support rocprofiler
- Rocprofiler API - included for a spack '+rocm' install to provide access
Known Issues:
- Profile measurements and traces for AMD GPUs, which are new for ROCm 5.1,
should be viewed with some skepticism.
Also, elapsed time for copies seem too large for executions that we've
measured. For a 96-thread run of miniqmc, the aggregate time for copies
reported by AMD's OMPT implementation for its GPUs was almost 100x longer
than the real time of the execution. If timestamps are incorrect for
OpenMP events on AMD GPUs, this will affect the accuracy of both profile
and trace views.
Furthermore, trace items for OpenMP events on AMD GPUs are known to
overlap. For that reason, having hpcviewer render them on a single
trace line, which it does, is problematic. As a result, overlapping
trace items will cause incorrect statistics in trace view. In such
cases, the profile view will accurately represent the aggregate values
reported by OMPT for AMD GPUs. - In some cases, attribution of exclusive metrics for BLOCKTIME and
CTXT SWTCH to call paths within the Linux kernel may be missing
even though inclusive costs for these metrics are attributed properly.
HPCViewer Release Notes
Enhancements:
- Improved call site icons
- Double buffering x and y axis in the trace view
- Simplify metric number in derived metrics
- Set maximum database history to 20
- Set the default GPU trace exposure to true
Release-2021.05.15
Primarily a bug-fix release on top of 2021.03.01. For a full
description, see the README.ReleaseNotes file.
Improvements for hpcviewer
- Improve the performance of hot-path operation by not re-revealing the tree path.
- Default window size is 1400x1000 or the screen size
- Trace view: Move depth field into a separate pane so users can change the depth easily
even when call stack view is not visible.
- Reduce memory consumption.
- Use Java XML parser to slightly improve XML parsing performance and avoid using
the old Apache xerces.
- Code clean-up, remove dead code and remove unused variables
- Issue 77: Add support for different color mapping policy in the trace view.
Default: procedure-name color instead of random color.
- Warn users when filtering is enabled
- Default is to build with Eclipse 4.19 (2021.03) except for Linux
ppc64le (built with Eclipse 4.16). Some fixes include improved dark color theme.
Bug fixes
hpcrun
CPU issues
- avoid deadlock by not sampling an openmp thread before it finishes
setting up TLS
- avoid having the UCX communication library used by MPI terminate
a program when an unwind fails rather than just dropping a
sample
- fix initialization of control knobs when a process forks but
does not exec
- add a timeout to interrupt a hung cuptiActivityFlushAll and so a program
can terminate and write out all performance data already collected.
Intel GPUs
- always dump Intel GPU binaries so we can extract kernel names
even if not using GTPin binary instrumentation
NVIDIA GPUs
- avoid introducing kernel serialization while using coarse-grain
measurement by monitoring CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL rather
than CUPTI_ACTIVITY_KIND_KERNEL
hpcstruct
- correct reconstruction of loop nests for Intel GPU binaries
hpcviewer
- Fix issue #80 and #81 (null pointer exception for empty databases)
- Fix issue #79 (CCT filter on the trace view, preserve tree expansion)
- Fix issue #73 (sort direction is not shown on Linux for the first appearance)
- Fix issue #75 (closing only a window in multiple windows mode)
- Fix issue #74 (no sort direction on Linux/GTK)
- Fix issue #85 (keyboard shortcut to minimize the window)
- Fix filtering CCT nodes for thread views
- Fix hot path to select the child node instead of the parent
- Fix merging GPU databases which contain aggregate and derived metrics
by deep copying the metric descriptors.
- Fix build script to include notarization for mac
- Fix storing recent open database: store the absolute path, not the relative one.
- Fix SWT resource leaks
- Fix flickering issue on Windows when splitting the hpcviewer window.
- Fix trace view’s color map changes to also refresh other panes and windows
- Fix Find dialog layout on Linux/GTK
- Fix merging GPU databases
- Fix a procedure-color mapping bug in the trace view
- Partial fix issue 42: Fix a performance bug when sorting a table
Release-2020.08
Support for nVidia GPU's and CUDA, including PC sampling on GPU.
New fnbounds server that is faster and has a much smaller memory
footprint.
New format for hpcviewer databases. Note: the new viewer supports
reading old databases, but the current hpcprof requires the latest
hpcviewer.
Hpcstruct supports thread-level parallelism, use '-j ' to
run with multiple threads.
Important bug fixes to improve powerpc unwinding.
Bug fixes to better handle DOE applications: adagio, kull, pytorch
Updates to the manual and man pages.
Release 2018.09
Visible Improvements
hpcrun
Significantly enhanced robustness of HPCToolkit's measurement infrastructure to better support profiling of highly multithreaded applications.
-
Overhauled initialization to support profiling of applications that create threads in constructors that must synchronize before main is entered.
-
Improved call stack unwinding on all platforms.
-
Improved support for collecting call path samples that include frames in the Linux kernel.
-
Refined handling for precise hardware events in the Linux perf sample source.
-
Refined scripts to avoid interactions with Darshan that can cause deadlock.
-
Note jobid in jsrun jobs.
hpcstruct
Improved program structure recovery of loops, inlined code, and outlined code using binary analysis of highly-optimized code.
-
Improved attribution of loops to source lines.
-
Made hpcstruct's output deterministic for irreducible loops.
-
Improved attribution for PLT stubs.
-
Improved name demangling.
hpcprof/hpcprof-mpi
Significantly enhanced robustness of hpcprof-mpi.
-
Emit a warning and proceed with analysis if measurement data is salvageable rather than aborting with a fatal error.
-
Tolerate missing load modules. Generate a placeholder if necessary and emit a warning rather than triggering a fatal error.
-
Tolerate cases where some ranks in hpcprof-mpi are not assigned any profiles to analyze.
-
Avoid unnecessary per-rank duplication of informational messages.
hpcviewer
- Calculate costs for inlined functions in bottom-up view and flat view as one would if they were actual functions.
Documentation
- Updated man pages and manuals.
Streamline user view
Migrated developer-centric functionality out of HPCToolkit's bin directory.
- Migrated hpcsummary to libexec/hpctoolkit.
- Removed support for creating DOT files from hpcstruct. Create a separate executable for developer use in libexec/hpctoolkit.
- Updated hpcproftt to remove stale command-line options. Migrate hpcproftt to libexec/hpctoolkit.
Bug Fixes
- Updated build system to automake 1.5.1 to handle newer Linux software stacks.
- Fixed latex2man script for perl 5.26.1.
- Fixed configuration to skip kernel sampling and disable support for BLOCKTIME for older Linux kernels.
- Fixed bugs related to handling Linux perf_events at runtime.
- Fixed race conditions that arise where samples arrive after shutting down a sample source or when monitoring ends while processing a sample.
- Corrected handling in HPCToolkit's measurement infrastructure for dlclose, which is frequently used by OpenMPI.
- Corrected support for libunwind to properly terminate unwinds on ARM when compilers put DWARF FDEs in .debug_frame rather than .eh_frame segments.
- Adjusted unwinder support for Power architectures to avoid libunwind.
- Adjusted support for integrating libunwind and binary analysis on x86_64 architectures.
- When measuring an execution, if hpcfnbounds quits wait for it to finish to avoid zombies.
- Corrected hpcprof-mpi to handle the corner case where an MPI rank is assigned no profiles to analyze.
- Added comprehensive error handling in hpcprof-mpi when writing files, especially to handle disk full or quota exceeded errors.
- Fix selection of an alternate output directory in hpcprof-mpi.
- Report an error if hpcstruct is run on anything other than an ELF binary.
- Correct handling for pseudo-roots such as in hpcviewer's flat view.
Release 2017.10
Principal Technical Improvements
Support for Linux perf events.
Linux perf events provides a powerful interface that supports measurement of both application execution and kernel activity. Using perf events, one can measure both hardware and software events. Using a processor’s hardware performance monitoring unit (PMU), the perf events interface can measure an execution using any hardware counter supported by the PMU.
Frequency-based sampling.
Rather than picking a sample period for a hardware counter, the Linux perf events interface enables one to specify the desired sampling frequency and have the kernel automatically select and adjust the period to try to achieve the desired sampling frequency.
Multiplexing.
Using multiplexing enables one to monitor more events in a single execution than the number of hardware counters a processor can support for each thread. The number of events that can be monitored in a single execution is only limited by the maximum number of concurrent events that the kernel will allow a user to multiplex using the perf events interface.
When more events are specified than can be monitored simultaneously using a thread’s hardware counters, the kernel will employ multiplexing and divide the set of events to be monitored into groups, monitor only one group of events at a time, and cycle repeatedly through the groups as a program executes.
Kernel sampling
Collect calling-context into the kernel using perf_events. It adds support for extending user-level program contexts with kernel calling contexts. The kernel call chains interpretation requires the value /proc/sys/kernel/kptr_restrict=0
and /proc/sys/kernel/perf_event_paranoid=1
(1 or 0).
Thread blocking.
When a program executes, a thread may block waiting for the kernel to complete some operation on its behalf. Example operations include waiting for a read operation to complete or having the kernel service a page fault or zero-fill a page.
On systems running Linux 4.3 or newer, one can use the perf events sample source to monitor how much time a thread is blocked and where the blocking occurs.
Improvements to call stack unwinding
Members of the project team fixed bugs identified by our testing of libunwind in the context of HPCToolkit's measurement infrastructure and helped refine libunwind to enable an external tool, e.g., HPCToolkit's hpcrun, to cache libunwind recipes for a procedure to avoid the need to recompute them on demand later.
hpctoolkit-externals includes a snapshot of libunwind as of 2 October 2017.
Improved binary analysis
This release of HPCToolkit benefits from refinements to Dyninst that improve hpcstruct's ability to reconstruct control flow graphs for procedures in the presence of jump tables.
hpctoolkit-externals includes Dyninst 9.3.2 supplemented with patches that include important but unreleased improvements.
Release 2017.06
Technical Improvements
-
Updated the ompt branch to provide better scalability to large thread counts as found on KNL and Power8. This branch, together with the LLVM OpenMP runtime library provides the OMP_IDLE metric to unify the presentation of worker and main threads in OpenMP regions.
-
Updated Dyninst to version 9.3.2 in hpctoolkit-externals, plus a patch for better binary analysis of functions that use jump tables.
-
Updated the use of atomic operations in hpcrun with C11 atomics.
-
Updated hpcstruct to handle a new ABI on Power/LE architectures with both internal and external interfaces for functions.
Bug Fixes
-
Improved analysis for call stack unwinding on x86-64, including a bug fix to track stack frame allocation and deallocation using the load effective address (LEA) instruction and an enhancement that improves call stack unwinding for procedures that realign their stack pointer upon function entry.
-
Fixed bug in hpcrun to correct data reinitialization after fork(). This bug prevented using hpcrun to profile programs launched with shell scripts.
-
Fixed bug in hpcstruct in getRealPath() that caused hpcstruct to sometimes report incorrect file names.
Known Problems
- Some types of applications on x86-64 architectures generate a significant number of 'partial unwinds,' making it harder to use the top-down view in hpcviewer. A partial workaround is to use the bottom-up and flat views in hpcviewer.
Release 2016.12
Notable Platform Changes
- Added support for Intel Knights Landing (KNL), treated as an x86-64 flavor.
- Added preliminary measurement, analysis, attribution, and GUI support for Power8/LE.
- Added preliminary measurement, analysis, and attribution support for ARM64.
Principal Technical Improvements
- Overhauled data structures for managing shared state (binary analysis results) in hpcrun to avoid mutual exclusion in the common case. This improves manycore scalability.
- Overhaul binary analysis to better attribute performance to highly-optimized code that involves inlined functions, inlined templates, outlined OpenMP functions.
Infrastructure Improvements in hpctoolkit-externals
- Removed dependence on a locally-modified copy of binutils and switched to use binutils 2.27, which supports Power8/LE and ARM64.
- Updated build infrastructure to use newer versions of autotools that that recognize the Power8 little-endian system type.
- autoconf 2.69.
- automake 1.15.
- libtool 2.4.6.
- Use boost version 1.59.0.
- Use dyninst version 9.3.0.
- Use elfutils version 0.167.
- Use libdwarf version 2016-11-24.
- Use libmonitor version from Sept 15, 2016.
- Use libunwind version from Feb 29, 2016.
Known Problems
- HPCToolkit's GUI's are not yet available for ARM64 platforms.
- Binary analysis on Power8/LE may fail to fully analyze routines that contain switch tables. This has several effects.
- Samples attributed to code regions in a routine that are overlooked by the binary analyzer will be attributed to the first source line of the enclosing routine.
- Loops in code regions that are overlooked will not be reported in hpcviewer.
pre-newarch
another set of cleanup edits for environment variables in the manual.