Skip to content

Releases: ovis-hpc/ldms

Release OVIS-4.3.5

14 Dec 17:00
Compare
Choose a tag to compare
This is the OVIS-4.3.5 G/A Release

This release includes the following features and fixes:

* Compatability with OVIS-4.3.3 and OVIS-4.3.4
* Support for the Maestro load balancer
* Allow root user to access ldmsd configuration objects
  regardless of euid/egid of the process
* Zap socket performance improvements
* Zap fabric performance and resiliency improvements
* Zap RDMA support for OmniPath
* Zap uGNI resiliency improvements
* Fix LDMS Streams Service data loss on process exit
* Metric set permission handling improvements
* Fixes for memory leaks and uninitialized data found by
  static analysis tools
* Numerous build and packaging improvements

Release OVIS-4.3.4

07 Nov 16:07
Compare
Choose a tag to compare
This is the OVIS-4.3.4 G/A Release

Significant testing on the socket, RDMA, and uGNI transports has been
done with Socket and uGNI scaling to three levels of aggregation and
30,000 sets in the aggregate.

The RDMA transport has been tested to a few thousands of sets.

The fabric transport should be considered Alpha and is suitable
for development, but not deployment at this time.

This release includes the following new features

* LDMS Transport performance statistics (ldmsd_controller xprt_stats command)
* Zap Thread utilization tracking (ldmsd_controller thread_stats command)
* uGNI resliency improvements to aid with resource error handling
* Packaging updates and github automation to help with tarball generation and release tagging
* A reference counting service has been implemented that supports 'named references'. In debug mode (when REF_TRACK is defined), references are tracked (function name, and line number) when they are taken and when they are released, and individual reference counts are kept for each name. This makes it easier to debug reference tracking during development.
* The new ref_t reference counting mechanism has been added to struct ldms_set and struct ldms_rbuf_desc in support of a robust set-delete capability
* An "end-to-end" protocol has been added for deleting metric sets. When an ldmsd deletes a set, each peer that has a memory handle on the set is notified. The set resources are not freed until all peers acknowledge that they have received the delete notification.
* A service (zap_zerr2errno) has been added to consistently map Zap errors to Unix errno
* Updates to the lustre2_client sampler to support newer version of Lustre

Release OVIS-4.3.4-beta.1

18 Oct 23:34
Compare
Choose a tag to compare
This is the OVIS-4.3.4 release tag

Release OVIS-4.3.4-alpha.1

21 Jul 14:41
Compare
Choose a tag to compare
This release includes the following updates and fixes:

* Packaging updates and github automation to help with tarball generation and release tagging
* Fixes for issues found by static analysis tools
* The JSON parser had a memory leak that on the socket transport could leak as much as 1MB per message
* A service (zap_zerr2errno) has been added to consistently map Zap errors to Unix errno
* A reference counting service has been implemented that supports 'named references'. In debug mode (when REF_TRACK is defined), references are tracked (function name, and line number) when they are taken and when they are released, and individual reference counts are kept for each name. This makes it easier to debug reference tracking during development.
* The new ref_t reference counting mechanism has been added to struct ldms_set and struct ldms_rbuf_desc in support of a robust set-delete capability
* An "end-to-end" protocol has been added for deleting metric sets. When an ldmsd deletes a set, each peer that has a memory handle on the set is notified. The set resources are not freed until all peers acknowledge that they have received the delete notification.
* LDMS transport 'telemetry' data has been added that tracks statistics on the primary transport operations DIR, LOOKUP, UPDATE, SEND, and RECV. The intent is to determine when/if an ldmsd becomes overloaded, underutilized, etc...
* Zap uGNI Transport fixes
  * Ensure socket is closed in uGNI transport
  * Destroy the Cdm in the uGNI transport
  * Refactor Zap uGNI disconnect path
  * Aggressively flush incomplete RdmaPost descriptors.
  * Add more detailed error handling in Zap uGNI
  * Added a thread to subscribe to and report errors on the uGNI transport.
  * Make certain that GNI_EpUnbind does not fail. This ensures that NTT resources held by the endpoint are released.

OVIS-4.3.3 G/A

09 Dec 19:17
Compare
Choose a tag to compare
Fix compilation warnings for `-O3 -Wall -Werror`

OVIS 4.3.3 Release Candidate 1

24 Nov 17:17
f0d3f79
Compare
Choose a tag to compare
Add are to ldms_set_hdr for compatible updates (#103)

Reserve an area in the set hdr to accomodate changes
that may affect this structure but still support backward
compatability.

OVIS-4.3.3-beta

04 Nov 17:40
Compare
Choose a tag to compare
OVIS-4.3.3-beta Pre-release
Pre-release

This is a release that track OVIS-4.3.3-beta

OVIS-4.3.2

18 Oct 18:50
Compare
Choose a tag to compare
OVIS-4.3.2

OVIS-4.3.1

10 Oct 20:34
Compare
Choose a tag to compare

OVIS Version 4.3.1

LDMS v4.3_beta release schedule and high level overview of new features

LDMS features

  • Metric sets are now removed by ldms_set_delete
  • ldms_xprt_dir now conveys set meta-data information including size, and set_info information
  • libfabrics LDMS transport plugin

LDMSD features

  • ldmsd stream service:
    • A publish/subscribe service in ldmsd that allows external programs to send data (events) over an LDMS Transport to ldmsd plugins
    • Improvements to prdcr performance
  • ldms_ls provides summary set size information as an aid to ldmsd aggregator memory configuration

New sampler plugins

  • SPANK slurm_notifier: a Slurm SPANK plugin that uses ldmsd_stream to notify subscribers (plugins) of job events (e.g. start/stop).
    • Used by slurm_sampler, papi_sampler, and syspapi_sampler
  • slurm_sampler:
    • Multi-tenant capable slurm job information sampler
  • PAPI Job Sampler (papi_sampler):
    • Collects hardware event counters per-process for all processes of a job
    • Receives configuration from a job's environment the slurm stream
  • PAPI System Sampler (syspapi_sampler):
    • Collects hardware event counters per-core, system wide
    • Uses libpfm for sampling and libpapi for event name to event-mask mapping
      • Allows consistent configuration to be used between syspapi and papi samplers.
        -Samples hardware performance counters on a per-core/uncore basis
  • IBM OCC sampler (ibm_occ)

New store plugins

  • slurm_store:
    • SOS store plugin that converts multi-tenant job information into a form more suitable for analysis
  • papi_store:
    • SOS store plugin that converts PAPI job information into a form more suitable for analysis

OVIS-4.2.3

13 Jun 20:54
Compare
Choose a tag to compare

This update fixes the following problem in 4.2.2:
The outstanding update message condition was being tested before the set matching condition. Hence, the sets that did not match the regex but were being updated as set group members were incorrectly marked as "outstanding update". The 4.2.2 release may get incorrect warnings which pollute the log file, but do not affect the collected data.