Skip to content

Releases: mercury-hpc/mercury

mercury 2.3.0rc4

30 Mar 22:54
v2.3.0rc4
Compare
Choose a tag to compare
mercury 2.3.0rc4 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

Added in rc4

  • [HG]
    • Add HG_Context_unpost() / HG_Core_context_unpost() for optional
      2-step context shutdown

Added in rc2

  • [HG Test]
    • Perf test now supports multi-client / multi-server workloads
    • Add BUILD_TESTING_UNIT and BUILD_TESTING_PERF CMake options
  • [NA OFI]
    • Add support for libfabric log redirection
      • Requires libfabric >= 1.16.0, disabled if FI_LOG_LEVEL is set
      • Add libfabric log subsys (off by default)
      • Bump FI_VERSION to 1.13 when log redirection is supported
  • [HG util]
    • Add HG_LOG_WRITE_FUNC() macro to pass func/line info
    • Add also module / no_return parameters to hg_log_write()
    • Remove HG_ATOMIC_VAR_INIT (deprecated)

Added in rc1

  • [HG]
    • Add support for multi-recv operations (OFI plugin only)
      • Currently disable multi-recv when auto SM is on
      • Posted recv operations are in that case decoupled from pool of RPC
        handles
      • Add release_input_early init info flag to attempt to release input
        buffers early once input is decoded
      • Add HG_Release_input_buf() to manually release input buffer.
      • Add also no_multi_recv init info option to force disabling
        multi-recv
    • Make use of subsys logs (cls, ctx, addr, rpc, poll) to control
      log output
    • Add init info struct versioning
  • [HG bulk]
    • Update to new logging system through bulk subsys log.
  • [HG proc]
    • Update to new logging system through proc subsys log.
  • [HG Test]
    • Refactor tests to separate perf tests from unit tests
    • Add NA/HG test common library
    • Add hg_rate / hg_bw_write and hg_bw_read perf tests
    • Install perf tests if BUILD_TESTING is ON
  • [NA]
    • Add support for multi-recv operations
      • Add NA_Msg_multi_recv_unexpected() and
        na_cb_info_multi_recv_unexpected cb info
      • Add flags parameter to NA_Op_create() and NA_Msg_buf_alloc()
      • Add NA_Has_opt_feature() to query multi recv capability
    • Remove int return type from NA callbacks and return void
    • Remove unused timeout parameter from NA_Trigger()
    • NA_Addr_free() / NA_Mem_handle_free() and NA_Op_destroy() now
      return void
    • na_mem_handle_t and na_addr_t types to no longer include pointer type
    • Add NA_PLUGIN_PATH env variable to optionally control plugin loading
      path
    • Add NA_DEFAULT_PLUGIN_PATH CMake option to control default plugin path
      (default is lib install path)
    • Add NA_USE_DYNAMIC_PLUGINS CMake option (OFF by default)
    • Bump NA library version to 4.0.0
  • [NA OFI]
    • Add support for multi-recv operations and use FI_MSG
    • Allocate multi-recv buffers using hugepages when available
    • Switch to using fi_senddata() with immediate data for unexpected msgs
      • NA_OFI_UNEXPECTED_TAG_MSG can be set to switch back to former
        behavior that uses tagged messages instead
    • Remove support for deprecated psm provider
    • Control CQ interrupt signaling with FI_AFFINITY (only used if thread is
      bound to a single CPU ID)
    • Enable cxi provider to use FI_WAIT_FD
    • Add NA_OFI_OP_RETRY_TIMEOUT and NA_OFI_OP_RETRY_PERIOD
      • Once NA_OFI_OP_RETRY_TIMEOUT milliseconds elapse, retry is stopped
        and operation is aborted (default is 120000ms)
      • When NA_OFI_OP_RETRY_PERIOD is set, operations are retried only
        every NA_OFI_OP_RETRY_PERIOD milliseconds (default is 0)
    • Add support for tcp with and without ofi_rxm
      • tcp defaults to tcp;ofi_rxm for libfabric < 1.18
    • Enable plugin to be built as a dynamic plugin
  • [NA UCX]
    • Attempt to disable UCX backtrace if UCX_HANDLE_ERRORS is not set
    • Add support for UCP_EP_PARAM_FIELD_LOCAL_SOCK_ADDR
      • With UCX >= 1.13 local src address information can now be specified
        on client to use specific interface and port
    • Set CM_REUSEADDR by default to enable reuse of existing listener addr
      after a listener exits abnormally
    • Attempt to reconnect EP if disconnected
      • This concerns cases where a peer would have reappeared after a
        previous disconnection
    • Enable plugin to be built as a dynamic plugin
  • [NA Test]
    • Update NA test perf to use multi-recv feature
    • Update perf test to use hugepages
    • Add support for multi-targets and add lookup test
    • Install perf tests if BUILD_TESTING is ON
  • [HG util]
    • Change return type of hg_time_less() to be bool
    • Add support for hugepage allocations
    • Use isb for cpu_spinwait on aarch64
    • Add mercury_dl to support dynamically loaded modules
    • Bump HG util version to 4.0.0

Bug fixes

Added in rc4

  • [NA OFI]
    • Add runtime version check
      • Ensure that runtime version is greater than min version
      • Replace prov/tcp compile check by runtime check
  • [NA SM]
    • Fix issue where an expected msg that is no longer posted arrives
      • In that particular case just drop the incoming msg

Added in rc3

  • [NA OFI]
    • Log redirection requires libfabric >= 1.16.0

Added in rc2

  • [HG/NA]
    • Ensure init info version is compatible
  • [NA OFI]
    • Fix handling of extra caps to not always follow advertised caps
    • Pass FI_COMPLETION to RMA ops as flag is currently not ignored
      (prov/opx tmp fix)
  • [CMake]
    • Ensure VERSION/SOVERSION is not set on MODULE libraries
    • Allow for in-source builds (RPM support)
    • Add missing DL lib dependency
    • Fix object target linking on CMake < 3.12
    • Ensure we build with PIC and PIE when available

Added in rc1

  • [HG]
    • Clean up and refactoring fixes
    • Fix race condition in hg_core_forward with debug enabled
    • Simplify RPC map and fix hashing for RPC IDs larger than 32-bit integer
    • Refactor context pools and cleanup
    • Fix potential leak on ack buffer
    • Ensure list of created RPC handles is empty before closing context
    • Bump pre-allocated requests to 512 to make use of 2M hugepages
    • Add extra error checking to prevent class mismatch
    • Fix potential race when sending one-way RPCs to ourself
  • [HG Bulk]
    • Add extra error checking to prevent class mismatch
  • [HG Test]
    • Refactor test_rpc to correctly handle timeout return values
  • [NA OFI]
    • Force sockets provider to use shared domains
      • This prevents a performance regression when multiple classes are
        being used (FI_THREAD_DOMAIN is therefore disabled for this provider)
    • Refactor unexpected and expected sends, retry of OFI operations, handling
      of RMA operations
    • Always include FI_DIRECTED_RECV in primary caps
    • Remove NA_OFI_SOURCE_MSG flag that was matching FI_SOURCE_ERR
    • Fix potential refcount race when sharing domains
    • Check domain's optimal MR count if non-zero
    • Fix potential double free of src_addr info
    • Refactor auth key parsing code to build without extension headers
    • Merge latest changes required for opx provider enablement
  • [NA SM]
    • Fix handling of 0-size messages when no receive has been posted
  • [NA UCX]
    • Fix handling of UCS return types to match NA types
  • [NA BMI]
    • Clean up and fix some coverity warnings
  • [NA MPI]
    • Clean up and fix some coverity warnings
  • [HG util]
    • Clean up logging and set log root to hg_all
      • hg_all subsys can now be set to turn on logging in all subsystems
    • Set log subsys to hg_all if log level env is set
    • Fixes to support WIN32 builds

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE
      to be set.

mercury 2.3.0rc3

10 Mar 18:24
v2.3.0rc3
Compare
Choose a tag to compare
mercury 2.3.0rc3 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

Added in rc2

  • [HG Test]
    • Perf test now supports multi-client / multi-server workloads
    • Add BUILD_TESTING_UNIT and BUILD_TESTING_PERF CMake options
  • [NA OFI]
    • Add support for libfabric log redirection
      • Requires libfabric >= 1.16.0, disabled if FI_LOG_LEVEL is set
      • Add libfabric log subsys (off by default)
      • Bump FI_VERSION to 1.13 when log redirection is supported
  • [HG util]
    • Add HG_LOG_WRITE_FUNC() macro to pass func/line info
    • Add also module / no_return parameters to hg_log_write()
    • Remove HG_ATOMIC_VAR_INIT (deprecated)

Added in rc1

  • [HG]
    • Add support for multi-recv operations (OFI plugin only)
      • Currently disable multi-recv when auto SM is on
      • Posted recv operations are in that case decoupled from pool of RPC handles
      • Add release_input_early init info flag to attempt to release input buffers early once input is decoded
      • Add HG_Release_input_buf() to manually release input buffer.
      • Add also no_multi_recv init info option to force disabling multi-recv
    • Make use of subsys logs (cls, ctx, addr, rpc, poll) to control log output
    • Add init info struct versioning
  • [HG bulk]
    • Update to new logging system through bulk subsys log.
  • [HG proc]
    • Update to new logging system through proc subsys log.
  • [HG Test]
    • Refactor tests to separate perf tests from unit tests
    • Add NA/HG test common library
    • Add hg_rate / hg_bw_write and hg_bw_read perf tests
    • Install perf tests if BUILD_TESTING is ON
  • [NA]
    • Add support for multi-recv operations
      • Add NA_Msg_multi_recv_unexpected() and na_cb_info_multi_recv_unexpected cb info
      • Add flags parameter to NA_Op_create() and NA_Msg_buf_alloc()
      • Add NA_Has_opt_feature() to query multi recv capability
    • Remove int return type from NA callbacks and return void
    • Remove unused timeout parameter from NA_Trigger()
    • NA_Addr_free() / NA_Mem_handle_free() and NA_Op_destroy() now return void
    • na_mem_handle_t and na_addr_t types to no longer include pointer type
    • Add NA_PLUGIN_PATH env variable to optionally control plugin loading path
    • Add NA_DEFAULT_PLUGIN_PATH CMake option to control default plugin path (default is lib install path)
    • Add NA_USE_DYNAMIC_PLUGINS CMake option (OFF by default)
    • Bump NA library version to 4.0.0
  • [NA OFI]
    • Add support for multi-recv operations and use FI_MSG
    • Allocate multi-recv buffers using hugepages when available
    • Switch to using fi_senddata() with immediate data for unexpected msgs
      • NA_OFI_UNEXPECTED_TAG_MSG can be set to switch back to former behavior that uses tagged messages instead
    • Remove support for deprecated psm provider
    • Control CQ interrupt signaling with FI_AFFINITY (only used if thread is bound to a single CPU ID)
    • Enable cxi provider to use FI_WAIT_FD
    • Add NA_OFI_OP_RETRY_TIMEOUT and NA_OFI_OP_RETRY_PERIOD
      • Once NA_OFI_OP_RETRY_TIMEOUT milliseconds elapse, retry is stopped and operation is aborted (default is 120000ms)
      • When NA_OFI_OP_RETRY_PERIOD is set, operations are retried only every NA_OFI_OP_RETRY_PERIOD milliseconds (default is 0)
    • Add support for tcp with and without ofi_rxm
      • tcp defaults to tcp;ofi_rxm for libfabric < 1.18
    • Enable plugin to be built as a dynamic plugin
  • [NA UCX]
    • Attempt to disable UCX backtrace if UCX_HANDLE_ERRORS is not set
    • Add support for UCP_EP_PARAM_FIELD_LOCAL_SOCK_ADDR
      • With UCX >= 1.13 local src address information can now be specified on client to use specific interface and port
    • Set CM_REUSEADDR by default to enable reuse of existing listener addr after a listener exits abnormally
    • Attempt to reconnect EP if disconnected
      • This concerns cases where a peer would have reappeared after a previous disconnection
    • Enable plugin to be built as a dynamic plugin
  • [NA Test]
    • Update NA test perf to use multi-recv feature
    • Update perf test to use hugepages
    • Add support for multi-targets and add lookup test
    • Install perf tests if BUILD_TESTING is ON
  • [HG util]
    • Change return type of hg_time_less() to be bool
    • Add support for hugepage allocations
    • Use isb for cpu_spinwait on aarch64
    • Add mercury_dl to support dynamically loaded modules
    • Bump HG util version to 4.0.0

Bug fixes

Added in rc3

  • [NA OFI]
    • Log redirection requires libfabric >= 1.16.0

Added in rc2

  • [HG/NA]
    • Ensure init info version is compatible
  • [NA OFI]
    • Fix handling of extra caps to not always follow advertised caps
    • Pass FI_COMPLETION to RMA ops as flag is currently not ignored (prov/opx tmp fix)
  • [CMake]
    • Ensure VERSION/SOVERSION is not set on MODULE libraries
    • Allow for in-source builds (RPM support)
    • Add missing DL lib dependency
    • Fix object target linking on CMake < 3.12
    • Ensure we build with PIC and PIE when available

Added in rc1

  • [HG]
    • Clean up and refactoring fixes
    • Fix race condition in hg_core_forward with debug enabled
    • Simplify RPC map and fix hashing for RPC IDs larger than 32-bit integer
    • Refactor context pools and cleanup
    • Fix potential leak on ack buffer
    • Ensure list of created RPC handles is empty before closing context
    • Bump pre-allocated requests to 512 to make use of 2M hugepages
    • Add extra error checking to prevent class mismatch
    • Fix potential race when sending one-way RPCs to ourself
  • [HG Bulk]
    • Add extra error checking to prevent class mismatch
  • [HG Test]
    • Refactor test_rpc to correctly handle timeout return values
  • [NA OFI]
    • Force sockets provider to use shared domains
      • This prevents a performance regression when multiple classes are being used (FI_THREAD_DOMAIN is therefore disabled for this provider)
    • Refactor unexpected and expected sends, retry of OFI operations, handling of RMA operations
    • Always include FI_DIRECTED_RECV in primary caps
    • Remove NA_OFI_SOURCE_MSG flag that was matching FI_SOURCE_ERR
    • Fix potential refcount race when sharing domains
    • Check domain's optimal MR count if non-zero
    • Fix potential double free of src_addr info
    • Refactor auth key parsing code to build without extension headers
    • Merge latest changes required for opx provider enablement
  • [NA SM]
    • Fix handling of 0-size messages when no receive has been posted
  • [NA UCX]
    • Fix handling of UCS return types to match NA types
  • [NA BMI]
    • Clean up and fix some coverity warnings
  • [NA MPI]
    • Clean up and fix some coverity warnings
  • [HG util]
    • Clean up logging and set log root to hg_all
      • hg_all subsys can now be set to turn on logging in all subsystems
    • Set log subsys to hg_all if log level env is set
    • Fixes to support WIN32 builds

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.

mercury 2.3.0rc2

10 Mar 17:38
v2.3.0rc2
Compare
Choose a tag to compare
mercury 2.3.0rc2 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

Added in rc2

  • [HG Test]
    • Perf test now supports multi-client / multi-server workloads
    • Add BUILD_TESTING_UNIT and BUILD_TESTING_PERF CMake options
  • [NA OFI]
    • Add support for libfabric log redirection
      • Requires libfabric >= 1.15.0, disabled if FI_LOG_LEVEL is set
      • Add libfabric log subsys (off by default)
      • Bump FI_VERSION to 1.13 when log redirection is supported
  • [HG util]
    • Add HG_LOG_WRITE_FUNC() macro to pass func/line info
    • Add also module / no_return parameters to hg_log_write()
    • Remove HG_ATOMIC_VAR_INIT (deprecated)

Added in rc1

  • [HG]
    • Add support for multi-recv operations (OFI plugin only)
      • Currently disable multi-recv when auto SM is on
      • Posted recv operations are in that case decoupled from pool of RPC handles
      • Add release_input_early init info flag to attempt to release input buffers early once input is decoded
      • Add HG_Release_input_buf() to manually release input buffer.
      • Add also no_multi_recv init info option to force disabling multi-recv
    • Make use of subsys logs (cls, ctx, addr, rpc, poll) to control log output
    • Add init info struct versioning
  • [HG bulk]
    • Update to new logging system through bulk subsys log.
  • [HG proc]
    • Update to new logging system through proc subsys log.
  • [HG Test]
    • Refactor tests to separate perf tests from unit tests
    • Add NA/HG test common library
    • Add hg_rate / hg_bw_write and hg_bw_read perf tests
    • Install perf tests if BUILD_TESTING is ON
  • [NA]
    • Add support for multi-recv operations
      • Add NA_Msg_multi_recv_unexpected() and na_cb_info_multi_recv_unexpected cb info
      • Add flags parameter to NA_Op_create() and NA_Msg_buf_alloc()
      • Add NA_Has_opt_feature() to query multi recv capability
    • Remove int return type from NA callbacks and return void
    • Remove unused timeout parameter from NA_Trigger()
    • NA_Addr_free() / NA_Mem_handle_free() and NA_Op_destroy() now return void
    • na_mem_handle_t and na_addr_t types to no longer include pointer type
    • Add NA_PLUGIN_PATH env variable to optionally control plugin loading path
    • Add NA_DEFAULT_PLUGIN_PATH CMake option to control default plugin path (default is lib install path)
    • Add NA_USE_DYNAMIC_PLUGINS CMake option (OFF by default)
    • Bump NA library version to 4.0.0
  • [NA OFI]
    • Add support for multi-recv operations and use FI_MSG
    • Allocate multi-recv buffers using hugepages when available
    • Switch to using fi_senddata() with immediate data for unexpected msgs
      • NA_OFI_UNEXPECTED_TAG_MSG can be set to switch back to former behavior that uses tagged messages instead
    • Remove support for deprecated psm provider
    • Control CQ interrupt signaling with FI_AFFINITY (only used if thread is bound to a single CPU ID)
    • Enable cxi provider to use FI_WAIT_FD
    • Add NA_OFI_OP_RETRY_TIMEOUT and NA_OFI_OP_RETRY_PERIOD
      • Once NA_OFI_OP_RETRY_TIMEOUT milliseconds elapse, retry is stopped and operation is aborted (default is 120000ms)
      • When NA_OFI_OP_RETRY_PERIOD is set, operations are retried only every NA_OFI_OP_RETRY_PERIOD milliseconds (default is 0)
    • Add support for tcp with and without ofi_rxm
      • tcp defaults to tcp;ofi_rxm for libfabric < 1.18
    • Enable plugin to be built as a dynamic plugin
  • [NA UCX]
    • Attempt to disable UCX backtrace if UCX_HANDLE_ERRORS is not set
    • Add support for UCP_EP_PARAM_FIELD_LOCAL_SOCK_ADDR
      • With UCX >= 1.13 local src address information can now be specified on client to use specific interface and port
    • Set CM_REUSEADDR by default to enable reuse of existing listener addr after a listener exits abnormally
    • Attempt to reconnect EP if disconnected
      • This concerns cases where a peer would have reappeared after a previous disconnection
    • Enable plugin to be built as a dynamic plugin
  • [NA Test]
    • Update NA test perf to use multi-recv feature
    • Update perf test to use hugepages
    • Add support for multi-targets and add lookup test
    • Install perf tests if BUILD_TESTING is ON
  • [HG util]
    • Change return type of hg_time_less() to be bool
    • Add support for hugepage allocations
    • Use isb for cpu_spinwait on aarch64
    • Add mercury_dl to support dynamically loaded modules
    • Bump HG util version to 4.0.0

Bug fixes

Added in rc2

  • [HG/NA]
    • Ensure init info version is compatible
  • [NA OFI]
    • Fix handling of extra caps to not always follow advertised caps
    • Pass FI_COMPLETION to RMA ops as flag is currently not ignored (prov/opx tmp fix)
  • [CMake]
    • Ensure VERSION/SOVERSION is not set on MODULE libraries
    • Allow for in-source builds (RPM support)
    • Add missing DL lib dependency
    • Fix object target linking on CMake < 3.12
    • Ensure we build with PIC and PIE when available

Added in rc1

  • [HG]
    • Clean up and refactoring fixes
    • Fix race condition in hg_core_forward with debug enabled
    • Simplify RPC map and fix hashing for RPC IDs larger than 32-bit integer
    • Refactor context pools and cleanup
    • Fix potential leak on ack buffer
    • Ensure list of created RPC handles is empty before closing context
    • Bump pre-allocated requests to 512 to make use of 2M hugepages
    • Add extra error checking to prevent class mismatch
    • Fix potential race when sending one-way RPCs to ourself
  • [HG Bulk]
    • Add extra error checking to prevent class mismatch
  • [HG Test]
    • Refactor test_rpc to correctly handle timeout return values
  • [NA OFI]
    • Force sockets provider to use shared domains
      • This prevents a performance regression when multiple classes are being used (FI_THREAD_DOMAIN is therefore disabled for this provider)
    • Refactor unexpected and expected sends, retry of OFI operations, handling of RMA operations
    • Always include FI_DIRECTED_RECV in primary caps
    • Remove NA_OFI_SOURCE_MSG flag that was matching FI_SOURCE_ERR
    • Fix potential refcount race when sharing domains
    • Check domain's optimal MR count if non-zero
    • Fix potential double free of src_addr info
    • Refactor auth key parsing code to build without extension headers
    • Merge latest changes required for opx provider enablement
  • [NA SM]
    • Fix handling of 0-size messages when no receive has been posted
  • [NA UCX]
    • Fix handling of UCS return types to match NA types
  • [NA BMI]
    • Clean up and fix some coverity warnings
  • [NA MPI]
    • Clean up and fix some coverity warnings
  • [HG util]
    • Clean up logging and set log root to hg_all
      • hg_all subsys can now be set to turn on logging in all subsystems
    • Set log subsys to hg_all if log level env is set
    • Fixes to support WIN32 builds

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.

mercury 2.3.0rc1

21 Feb 21:01
v2.3.0rc1
Compare
Choose a tag to compare
mercury 2.3.0rc1 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

  • [HG]
    • Add support for multi-recv operations (OFI plugin only)
      • Currently disable multi-recv when auto SM is on
      • Posted recv operations are in that case decoupled from pool of RPC handles
      • Add release_input_early init info flag to attempt to release input buffers early once input is decoded
      • Add HG_Release_input_buf() to manually release input buffer.
      • Add also no_multi_recv init info option to force disabling multi-recv
    • Make use of subsys logs (cls, ctx, addr, rpc, poll) to control log output
    • Add init info struct versioning
  • [HG bulk]
    • Update to new logging system through bulk subsys log.
  • [HG proc]
    • Update to new logging system through proc subsys log.
  • [HG Test]
    • Refactor tests to separate perf tests from unit tests
    • Add NA/HG test common library
    • Add hg_rate / hg_bw_write and hg_bw_read perf tests
    • Install perf tests if BUILD_TESTING is ON
  • [NA]
    • Add support for multi-recv operations
      • Add NA_Msg_multi_recv_unexpected() and na_cb_info_multi_recv_unexpected cb info
      • Add flags parameter to NA_Op_create() and NA_Msg_buf_alloc()
      • Add NA_Has_opt_feature() to query multi recv capability
    • Remove int return type from NA callbacks and return void
    • Remove unused timeout parameter from NA_Trigger()
    • NA_Addr_free() / NA_Mem_handle_free() and NA_Op_destroy() now return void
    • na_mem_handle_t and na_addr_t types to no longer include pointer type
    • Add NA_PLUGIN_PATH env variable to optionally control plugin loading path
    • Add NA_DEFAULT_PLUGIN_PATH CMake option to control default plugin path (default is lib install path)
    • Add NA_USE_DYNAMIC_PLUGINS CMake option (OFF by default)
    • Bump NA library version to 4.0.0
  • [NA OFI]
    • Add support for multi-recv operations and use FI_MSG
    • Allocate multi-recv buffers using hugepages when available
    • Switch to using fi_senddata() with immediate data for unexpected msgs
      • NA_OFI_UNEXPECTED_TAG_MSG can be set to switch back to former behavior that uses tagged messages instead
    • Remove support for deprecated psm provider
    • Control CQ interrupt signaling with FI_AFFINITY (only used if thread is bound to a single CPU ID)
    • Enable cxi provider to use FI_WAIT_FD
    • Add NA_OFI_OP_RETRY_TIMEOUT and NA_OFI_OP_RETRY_PERIOD
      • Once NA_OFI_OP_RETRY_TIMEOUT milliseconds elapse, retry is stopped and operation is aborted (default is 120000ms)
      • When NA_OFI_OP_RETRY_PERIOD is set, operations are retried only every NA_OFI_OP_RETRY_PERIOD milliseconds (default is 0)
    • Add support for tcp with and without ofi_rxm
      • tcp defaults to tcp;ofi_rxm for libfabric < 1.18
    • Enable plugin to be built as a dynamic plugin
  • [NA UCX]
    • Attempt to disable UCX backtrace if UCX_HANDLE_ERRORS is not set
    • Add support for UCP_EP_PARAM_FIELD_LOCAL_SOCK_ADDR
      • With UCX >= 1.13 local src address information can now be specified on client to use specific interface and port
    • Set CM_REUSEADDR by default to enable reuse of existing listener addr after a listener exits abnormally
    • Attempt to reconnect EP if disconnected
      • This concerns cases where a peer would have reappeared after a previous disconnection
    • Enable plugin to be built as a dynamic plugin
  • [NA Test]
    • Update NA test perf to use multi-recv feature
    • Update perf test to use hugepages
    • Add support for multi-targets and add lookup test
    • Install perf tests if BUILD_TESTING is ON
  • [HG util]
    • Change return type of hg_time_less() to be bool
    • Add support for hugepage allocations
    • Use isb for cpu_spinwait on aarch64
    • Add mercury_dl to support dynamically loaded modules
    • Bump HG util version to 4.0.0

Bug fixes

  • [HG]
    • Clean up and refactoring fixes
    • Fix race condition in hg_core_forward with debug enabled
    • Simplify RPC map and fix hashing for RPC IDs larger than 32-bit integer
    • Refactor context pools and cleanup
    • Fix potential leak on ack buffer
    • Ensure list of created RPC handles is empty before closing context
    • Bump pre-allocated requests to 512 to make use of 2M hugepages
    • Add extra error checking to prevent class mismatch
    • Fix potential race when sending one-way RPCs to ourself
  • [HG Bulk]
    • Add extra error checking to prevent class mismatch
  • [HG Test]
    • Refactor test_rpc to correctly handle timeout return values
  • [NA OFI]
    • Force sockets provider to use shared domains
      • This prevents a performance regression when multiple classes are being used (FI_THREAD_DOMAIN is therefore disabled for this provider)
    • Refactor unexpected and expected sends, retry of OFI operations, handling of RMA operations
    • Always include FI_DIRECTED_RECV in primary caps
    • Remove NA_OFI_SOURCE_MSG flag that was matching FI_SOURCE_ERR
    • Fix potential refcount race when sharing domains
    • Check domain's optimal MR count if non-zero
    • Fix potential double free of src_addr info
    • Refactor auth key parsing code to build without extension headers
    • Merge latest changes required for opx provider enablement
  • [NA SM]
    • Fix handling of 0-size messages when no receive has been posted
  • [NA UCX]
    • Fix handling of UCS return types to match NA types
  • [NA BMI]
    • Clean up and fix some coverity warnings
  • [NA MPI]
    • Clean up and fix some coverity warnings
  • [HG util]
    • Clean up logging and set log root to hg_all
      • hg_all subsys can now be set to turn on logging in all subsystems
    • Set log subsys to hg_all if log level env is set
    • Fixes to support WIN32 builds

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.

mercury 2.2.0

05 Aug 20:50
v2.2.0
Compare
Choose a tag to compare

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

  • [NA OFI]
    • Choose addr format dynamically based on user preferences
    • Add support for IPv6
    • Add support for FI_SOCKADDR_IB
    • Add support for FI_ADDR_STR and shm provider
    • Add support for FI_ADDR_OPX and opx provider
    • Add support for HPE cxi provider,
      init info format for cxi is:
      • NIC:PID (both or only one may be passed), NIC is cxi[0-9], PID is [0-510]
    • Use hwloc to select interface to use if NIC information is available
      (only supported by cxi at the moment)
    • Support device memory types and FI_HMEM for verbs and cxi providers
    • Add support for FI_THREAD_DOMAIN
      • Passing NA_THREAD_MODE_SINGLE will relax default FI_THREAD_SAFE
        thread mode and use FI_THREAD_DOMAIN instead.
    • Update min required version to libfabric 1.9
    • Improve debug output to print verbose FI info of selected provider
  • [NA UCX]
    • Use active messaging UCP_FEATURE_AM for unexpected messages (only), this
      allows for removal of address resolution and retry on first message to
      exchange connection IDs
    • Turn on mempool by default
    • Support device memory types
    • Bump min required version to 1.10
  • [NA PSM]
    • Add mercury NA plugin for the qlogic/intel PSM interface
      • Also support PSM2 (Intel OmniPath) through the PSM NA plugin
  • [NA SM]
    • Add support for 0-size messages
  • [NA]
    • Add na_addr_format init info
    • Add request_mem_device init info when GPU support is requested
    • Update NA_Mem_register() API call to support memory types (e.g., CUDA, ROCm, ZE) and devices IDs
    • Add na_loc module for hwloc detection
    • Remove na_uint, na_int, na_bool_t and na_size_t types
    • Use separate versioning for library and update to v3.0.0
  • [NA IP]
    • Refactor na_ip_check_interface() to only use getaddrinfo() and getifaddrs()
    • Add family argument to force detection of IPv4/IPv6 addresses
    • Add ip debug log
  • [NA Test]
    • Introduce new perf tests to measure msg latency, put / get bandwidth. These
      benchmarks produce results that are comparable with OSU benchmarks.
  • [HG util]
    • Add mercury_byteswap.h for bswap macros
    • Add mercury_inet.h for htonll and ntohll routine
    • Add mercury_param.h to use sys/param.h or MIN/MAX macros etc
    • Add alternative log names: err, warn, trace, dbg
    • Use separate versioning for library and update to v3.0.0
  • [HG bulk]
    • Add support for memory attributes through a new HG_Bulk_create_attr() routine (support CUDA, ROCm, ZE)
  • [HG]
    • Remove MERCURY_ENABLE_STATS CMake option and use 'diag' log subsys instead
      • Modify behavior of stats field to turn on diagnostics
      • Refactor existing counters (used only if debug is on)
    • Add checksum levels that can be manually controlled at runtime (disabled by default, HG_CHECKSUM_NONE level)
    • Update to mchecksum v2.0
    • Add HG_Set_log_func() and HG_Set_log_stream() to control log output

Bug fixes

  • [NA OFI]
    • Switch tcp provider to FI_PROGRESS_MANUAL
    • Prevent empty authorization keys from being passed
    • Check max MR key used when FI_MR_PROV_KEY is not set
    • New implementation of address management
      • Fix duplicate addresses on multithreaded lookups
      • Redefine address keys and raw addresses to prevent allocations
      • Use FI addr map to lookup by FI addr
      • Improve serialization and deserialization of addresses
    • Fix provider table and use EP proto
    • Refactor and clean up plugin initialization
      • Clean up ip and domain checking
      • Ensure interface name is not used as domain name for verbs etc
      • Use NA IP module and add missing NA_OFI_VERIFY_PROV_DOM for tcp provider
      • Rework handling of fi_info to open fabric/domain/endpoint
      • Separate fabric from domain and keep single domain per NA class
      • Refactor handling of scalable vs standard endpoints
    • Improve handling of retries after FI_EAGAIN return code
      • Abort retried ops after default 90s timeout
      • Abort ops to a target being retried after first NA_HOSTUNREACH error in CQ
  • [NA UCX]
    • Fix potential error not returned correctly on conn_insert()
    • Fix potential double free of worker_addr
    • Remove use of unified mode
    • Ensure address key is correctly reset
    • Fix hostname / net device parsing to allow for multiple net devices
  • [HG util]
    • Make sure we round up ms time conversion, this ensures that small timeouts
      do not result in busy spin.
    • Use sched_yield() instead of deprecated pthread_yield()
    • Fix 'none' log level not recognized
    • Fix external logging facility
    • Let mercury log print counters on exit when debug outlet is on
  • [HG proc]
    • Prevent call to save_ptr()/restore_ptr() during HG_FREE
  • [HG Bulk]
    • Remove some NA_CANCELED event warnings.
  • [HG]
    • Properly handle error when overflow bulk transfer is interrupted. Previously the RPC callback was triggered regarldless, potentially causing issues.
  • [CMake]
    • Correctly set INSTALL_RPATH for target libraries
    • Split mercury.pc pkg config file into multiple .pc files for
      mercury_util and na to prevent from overlinking against those libraries
      when using pkg config.

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.
  • [NA UCX]
    • NA_Addr_to_string() cannot be used on non-listening processes to convert a self-address to a string.

mercury 2.2.0rc6

27 Jun 21:32
v2.2.0rc6
Compare
Choose a tag to compare
mercury 2.2.0rc6 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

  • [NA OFI]
    • Choose addr format dynamically based on user preferences
    • Add support for IPv6
    • Add support for FI_SOCKADDR_IB
    • Add support for FI_ADDR_STR and shm provider
    • Add support for FI_ADDR_OPX and opx provider
    • Add support for HPE cxi provider,
      init info format for cxi is:
      • NIC:PID (both or only one may be passed), NIC is cxi[0-9], PID is [0-510]
    • Use hwloc to select interface to use if NIC information is available
      (only supported by cxi at the moment)
    • Support device memory types and FI_HMEM for verbs and cxi providers
    • Update min required version to libfabric 1.9
    • Improve debug output to print verbose FI info of selected provider
  • [NA UCX]
    • Use active messaging UCP_FEATURE_AM for unexpected messages (only), this
      allows for removal of address resolution and retry on first message to
      exchange connection IDs
    • Turn on mempool by default
    • Support device memory types
    • Bump min required version to 1.10
  • [NA PSM]
    • Add mercury NA plugin for the qlogic/intel PSM interface
      • Also support PSM2 (Intel OmniPath) through the PSM NA plugin
  • [NA]
    • Add na_addr_format init info
    • Add request_mem_device init info when GPU support is requested
    • Update NA_Mem_register() API call to support memory types (e.g., CUDA, ROCm, ZE) and devices IDs
    • Add na_loc module for hwloc detection
    • Remove na_uint, na_int, na_bool_t and na_size_t types
    • Use separate versioning for library and update to v3.0.0
  • [NA IP]
    • Refactor na_ip_check_interface() to only use getaddrinfo() and getifaddrs()
    • Add family argument to force detection of IPv4/IPv6 addresses
    • Add ip debug log
  • [HG util]
    • Add mercury_byteswap.h for bswap macros
    • Add mercury_inet.h for htonll and ntohll routine
    • Add mercury_param.h to use sys/param.h or MIN/MAX macros etc
    • Use separate versioning for library and update to v3.0.0
  • [HG bulk]
    • Add support for memory attributes through a new HG_Bulk_create_attr() routine (support CUDA, ROCm, ZE)
  • [HG]
    • Remove MERCURY_ENABLE_STATS CMake option and use 'diag' log subsys instead
      • Modify behavior of stats field to turn on diagnostics
      • Refactor existing counters (used only if debug is on)
    • Add checksum levels that can be manually controlled at runtime (disabled by default, HG_CHECKSUM_NONE level)
    • Update to mchecksum v2.0
    • Add HG_Set_log_func() and HG_Set_log_stream() to control log output

Bug fixes

  • [NA OFI]
    • Switch tcp provider to FI_PROGRESS_MANUAL
    • Prevent empty authorization keys from being passed
    • Check max MR key used when FI_MR_PROV_KEY is not set
    • New implementation of address management
      • Fix duplicate addresses on multithreaded lookups
      • Redefine address keys and raw addresses to prevent allocations
      • Use FI addr map to lookup by FI addr
      • Improve serialization and deserialization of addresses
    • Fix provider table and use EP proto
    • Refactor and clean up plugin initialization
      • Clean up ip and domain checking
      • Ensure interface name is not used as domain name for verbs etc
      • Use NA IP module and add missing NA_OFI_VERIFY_PROV_DOM for tcp provider
      • Rework handling of fi_info to open fabric/domain/endpoint
      • Separate fabric from domain and keep single domain per NA class
      • Refactor handling of scalable vs standard endpoints
    • Improve handling of retries after FI_EAGAIN return code
      • Abort retried ops after default 90s timeout
      • Abort ops to a target being retried after first NA_HOSTUNREACH error in CQ
  • [NA UCX]
    • Fix potential error not returned correctly on conn_insert()
    • Fix potential double free of worker_addr
    • Remove use of unified mode
    • Ensure address key is correctly reset
    • Fix hostname / net device parsing to allow for multiple net devices
  • [HG util]
    • Make sure we round up ms time conversion, this ensures that small timeouts
      do not result in busy spin.
    • Use sched_yield() instead of deprecated pthread_yield()
    • Fix 'none' log level not recognized
    • Fix external logging facility
    • Let mercury log print counters on exit when debug outlet is on
  • [HG proc]
    • Prevent call to save_ptr()/restore_ptr() during HG_FREE
  • [HG Bulk]
    • Remove some NA_CANCELED event warnings.
  • [HG]
    • Properly handle error when overflow bulk transfer is interrupted. Previously the RPC callback was triggered regarldless, potentially causing issues.
  • [CMake]
    • Correctly set INSTALL_RPATH for target libraries

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.
  • [NA UCX]
    • NA_Addr_to_string() cannot be used on non-listening processes to convert a self-address to a string.

mercury 2.2.0rc5

17 Jun 15:17
v2.2.0rc5
Compare
Choose a tag to compare
mercury 2.2.0rc5 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

  • [NA OFI]
    • Choose addr format dynamically based on user preferences
    • Add support for IPv6
    • Add support for FI_SOCKADDR_IB
    • Add support for FI_ADDR_STR and shm provider
    • Add support for FI_ADDR_OPX and opx provider
    • Add support for HPE cxi provider,
      init info format for cxi is:
      • NIC:PID (both or only one may be passed), NIC is cxi[0-9], PID is [0-510]
    • Use hwloc to select interface to use if NIC information is available
      (only supported by cxi at the moment)
    • Support device memory types and FI_HMEM for verbs and cxi providers
    • Update min required version to libfabric 1.9
    • Improve debug output to print verbose FI info of selected provider
  • [NA UCX]
    • Use active messaging UCP_FEATURE_AM for unexpected messages (only), this
      allows for removal of address resolution and retry on first message to
      exchange connection IDs
    • Turn on mempool by default
    • Support device memory types
    • Bump min required version to 1.10
  • [NA PSM]
    • Add mercury NA plugin for the qlogic/intel PSM interface
      • Also support PSM2 (Intel OmniPath) through the PSM NA plugin
  • [NA]
    • Add na_addr_format init info
    • Add request_mem_device init info when GPU support is requested
    • Update NA_Mem_register() API call to support memory types (e.g., CUDA, ROCm, ZE) and devices IDs
    • Add na_loc module for hwloc detection
    • Remove na_uint, na_int, na_bool_t and na_size_t types
    • Use separate versioning for library and update to v3.0.0
  • [NA IP]
    • Refactor na_ip_check_interface() to only use getaddrinfo() and getifaddrs()
    • Add family argument to force detection of IPv4/IPv6 addresses
    • Add ip debug log
  • [HG util]
    • Add mercury_byteswap.h for bswap macros
    • Add mercury_inet.h for htonll and ntohll routine
    • Add mercury_param.h to use sys/param.h or MIN/MAX macros etc
    • Use separate versioning for library and update to v3.0.0
  • [HG bulk]
    • Add support for memory attributes through a new HG_Bulk_create_attr() routine (support CUDA, ROCm, ZE)
  • [HG]
    • Remove MERCURY_ENABLE_STATS CMake option and use 'diag' log subsys instead
      • Modify behavior of stats field to turn on diagnostics
      • Refactor existing counters (used only if debug is on)
    • Add checksum levels that can be manually controlled at runtime (disabled by default, HG_CHECKSUM_NONE level)
    • Update to mchecksum v2.0

Bug fixes

  • [NA OFI]
    • Switch tcp provider to FI_PROGRESS_MANUAL
    • Prevent empty authorization keys from being passed
    • Check max MR key used when FI_MR_PROV_KEY is not set
    • New implementation of address management
      • Fix duplicate addresses on multithreaded lookups
      • Redefine address keys and raw addresses to prevent allocations
      • Use FI addr map to lookup by FI addr
      • Improve serialization and deserialization of addresses
    • Fix provider table and use EP proto
    • Refactor and clean up plugin initialization
      • Clean up ip and domain checking
      • Ensure interface name is not used as domain name for verbs etc
      • Use NA IP module and add missing NA_OFI_VERIFY_PROV_DOM for tcp provider
      • Rework handling of fi_info to open fabric/domain/endpoint
      • Separate fabric from domain and keep single domain per NA class
      • Refactor handling of scalable vs standard endpoints
    • Improve handling of retries after FI_EAGAIN return code
      • Abort retried ops after default 90s timeout
      • Abort ops to a target being retried after first NA_HOSTUNREACH error in CQ
  • [NA UCX]
    • Fix potential error not returned correctly on conn_insert()
    • Fix potential double free of worker_addr
    • Remove use of unified mode
    • Ensure address key is correctly reset
  • [HG util]
    • Make sure we round up ms time conversion, this ensures that small timeouts
      do not result in busy spin.
    • Use sched_yield() instead of deprecated pthread_yield()
    • Fix 'none' log level not recognized
    • Fix external logging facility
    • Let mercury log print counters on exit when debug outlet is on
  • [HG proc]
    • Prevent call to save_ptr()/restore_ptr() during HG_FREE
  • [CMake]
    • Correctly set INSTALL_RPATH for target libraries

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.
  • [NA UCX]
    • NA_Addr_to_string() cannot be used on non-listening processes to convert a self-address to a string.

mercury 2.2.0rc4

14 Jun 00:11
v2.2.0rc4
Compare
Choose a tag to compare
mercury 2.2.0rc4 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

  • [NA OFI]
    • Choose addr format dynamically based on user preferences
    • Add support for IPv6
    • Add support for FI_SOCKADDR_IB
    • Add support for FI_ADDR_STR and shm provider
    • Add support for FI_ADDR_OPX and opx provider
    • Add support for HPE cxi provider,
      init info format for cxi is:
      • NIC:PID (both or only one may be passed), NIC is cxi[0-9], PID is [0-510]
    • Use hwloc to select interface to use if NIC information is available
      (only supported by cxi at the moment)
    • Support device memory types and FI_HMEM for verbs and cxi providers
    • Update min required version to libfabric 1.9
    • Improve debug output to print verbose FI info of selected provider
  • [NA UCX]
    • Use active messaging UCP_FEATURE_AM for unexpected messages (only), this
      allows for removal of address resolution and retry on first message to
      exchange connection IDs
    • Turn on mempool by default
    • Support device memory types
    • Bump min required version to 1.10
  • [NA PSM]
    • Add mercury NA plugin for the qlogic/intel PSM interface
      • Also support PSM2 (Intel OmniPath) through the PSM NA plugin
  • [NA]
    • Add na_addr_format init info
    • Add request_mem_device init info when GPU support is requested
    • Update NA_Mem_register() API call to support memory types (e.g., CUDA, ROCm, ZE) and devices IDs
    • Add na_loc module for hwloc detection
    • Remove na_uint, na_int, na_bool_t and na_size_t types
    • Use separate versioning for library and update to v3.0.0
  • [NA IP]
    • Refactor na_ip_check_interface() to only use getaddrinfo() and getifaddrs()
    • Add family argument to force detection of IPv4/IPv6 addresses
    • Add ip debug log
  • [HG util]
    • Add mercury_byteswap.h for bswap macros
    • Add mercury_inet.h for htonll and ntohll routine
    • Add mercury_param.h to use sys/param.h or MIN/MAX macros etc
    • Use separate versioning for library and update to v3.0.0
  • [HG bulk]
    • Add support for memory attributes through a new HG_Bulk_create_attr() routine (support CUDA, ROCm, ZE)
  • [HG]
    • Remove MERCURY_ENABLE_STATS CMake option and use 'diag' log subsys instead
      • Modify behavior of stats field to turn on diagnostics
      • Refactor existing counters (used only if debug is on)
    • Add checksum levels that can be manually controlled at runtime (disabled by default, HG_CHECKSUM_NONE level)
    • Update to mchecksum v2.0

Bug fixes

  • [NA OFI]
    • Switch tcp provider to FI_PROGRESS_MANUAL
    • Prevent empty authorization keys from being passed
    • Check max MR key used when FI_MR_PROV_KEY is not set
    • New implementation of address management
      • Fix duplicate addresses on multithreaded lookups
      • Redefine address keys and raw addresses to prevent allocations
      • Use FI addr map to lookup by FI addr
      • Improve serialization and deserialization of addresses
    • Fix provider table and use EP proto
    • Refactor and clean up plugin initialization
      • Clean up ip and domain checking
      • Ensure interface name is not used as domain name for verbs etc
      • Use NA IP module and add missing NA_OFI_VERIFY_PROV_DOM for tcp provider
      • Rework handling of fi_info to open fabric/domain/endpoint
      • Separate fabric from domain and keep single domain per NA class
      • Refactor handling of scalable vs standard endpoints
    • Improve handling of retries after FI_EAGAIN return code
      • Abort retried ops after default 90s timeout
      • Abort ops to a target being retried after first NA_HOSTUNREACH error in CQ
  • [NA UCX]
    • Fix potential error not returned correctly on conn_insert()
    • Fix potential double free of worker_addr
    • Remove use of unified mode
    • Ensure address key is correctly reset
  • [HG util]
    • Make sure we round up ms time conversion, this ensures that small timeouts
      do not result in busy spin.
    • Use sched_yield() instead of deprecated pthread_yield()
    • Fix 'none' log level not recognized
    • Fix external logging facility
    • Let mercury log print counters on exit when debug outlet is on
  • [HG proc]
    • Prevent call to save_ptr()/restore_ptr() during HG_FREE
  • [CMake]
    • Correctly set INSTALL_RPATH for target libraries

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.
  • [NA UCX]
    • NA_Addr_to_string() cannot be used on non-listening processes to convert a self-address to a string.

mercury 2.2.0rc3

10 Jun 16:41
v2.2.0rc3
Compare
Choose a tag to compare
mercury 2.2.0rc3 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

  • [NA OFI]
    • Choose addr format dynamically based on user preferences
    • Add support for IPv6
    • Add support for FI_SOCKADDR_IB
    • Add support for FI_ADDR_STR and shm provider
    • Add support for FI_ADDR_OPX and opx provider
    • Add support for HPE cxi provider,
      init info format for cxi is:
      • NIC:PID (both or only one may be passed), NIC is cxi[0-9], PID is [0-510]
    • Use hwloc to select interface to use if NIC information is available
      (only supported by cxi at the moment)
    • Support device memory types and FI_HMEM for verbs and cxi providers
    • Update min required version to libfabric 1.9
    • Improve debug output to print verbose FI info of selected provider
  • [NA UCX]
    • Use active messaging UCP_FEATURE_AM for unexpected messages (only), this
      allows for removal of address resolution and retry on first message to
      exchange connection IDs
    • Turn on mempool by default
    • Support device memory types
    • Bump min required version to 1.10
  • [NA PSM]
    • Add mercury NA plugin for the qlogic/intel PSM interface
      • Also support PSM2 (Intel OmniPath) through the PSM NA plugin
  • [NA]
    • Add na_addr_format init info
    • Add request_mem_device init info when GPU support is requested
    • Update NA_Mem_register() API call to support memory types (e.g., CUDA, ROCm, ZE) and devices IDs
    • Add na_loc module for hwloc detection
    • Remove na_uint, na_int, na_bool_t and na_size_t types
    • Use separate versioning for library and update to v3.0.0
  • [NA IP]
    • Refactor na_ip_check_interface() to only use getaddrinfo() and getifaddrs()
    • Add family argument to force detection of IPv4/IPv6 addresses
    • Add ip debug log
  • [HG util]
    • Add mercury_byteswap.h for bswap macros
    • Add mercury_inet.h for htonll and ntohll routine
    • Add mercury_param.h to use sys/param.h or MIN/MAX macros etc
    • Use separate versioning for library and update to v3.0.0
  • [HG bulk]
    • Add support for memory attributes through a new HG_Bulk_create_attr() routine (support CUDA, ROCm, ZE)
  • [HG]
    • Remove MERCURY_ENABLE_STATS CMake option and use 'diag' log subsys instead
      • Modify behavior of stats field to turn on diagnostics
      • Refactor existing counters (used only if debug is on)
    • Add checksum levels that can be manually controlled at runtime (disabled by default, HG_CHECKSUM_NONE level)
    • Update to mchecksum v2.0

Bug fixes

  • [NA OFI]
    • Switch tcp provider to FI_PROGRESS_MANUAL
    • Prevent empty authorization keys from being passed
    • Check max MR key used when FI_MR_PROV_KEY is not set
    • New implementation of address management
      • Fix duplicate addresses on multithreaded lookups
      • Redefine address keys and raw addresses to prevent allocations
      • Use FI addr map to lookup by FI addr
      • Improve serialization and deserialization of addresses
    • Fix provider table and use EP proto
    • Refactor and clean up plugin initialization
      • Clean up ip and domain checking
      • Ensure interface name is not used as domain name for verbs etc
      • Use NA IP module and add missing NA_OFI_VERIFY_PROV_DOM for tcp provider
      • Rework handling of fi_info to open fabric/domain/endpoint
      • Separate fabric from domain and keep single domain per NA class
      • Refactor handling of scalable vs standard endpoints
    • Improve handling of retries after FI_EAGAIN return code
      • Abort retried ops after default 90s timeout
      • Abort ops to a target being retried after first NA_HOSTUNREACH error in CQ
  • [NA UCX]
    • Fix potential error not returned correctly on conn_insert()
    • Fix potential double free of worker_addr
    • Remove use of unified mode
    • Ensure address key is correctly reset
  • [HG util]
    • Make sure we round up ms time conversion, this ensures that small timeouts
      do not result in busy spin.
    • Use sched_yield() instead of deprecated pthread_yield()
    • Fix 'none' log level not recognized
    • Fix external logging facility
    • Let mercury log print counters on exit when debug outlet is on
  • [HG proc]
    • Prevent call to save_ptr()/restore_ptr() during HG_FREE
  • [CMake]
    • Correctly set INSTALL_RPATH for target libraries

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.
  • [NA UCX]
    • NA_Addr_to_string() cannot be used on non-listening processes to convert a self-address to a string.

mercury 2.2.0rc2

09 Jun 22:21
v2.2.0rc2
Compare
Choose a tag to compare
mercury 2.2.0rc2 Pre-release
Pre-release

Summary

This version brings bug fixes and updates to our v2.0.0 release.

New features

  • [NA OFI]
    • Choose addr format dynamically based on user preferences
    • Add support for IPv6
    • Add support for FI_SOCKADDR_IB
    • Add support for FI_ADDR_STR and shm provider
    • Add support for FI_ADDR_OPX and opx provider
    • Add support for HPE cxi provider,
      init info format for cxi is:
      • NIC:PID (both or only one may be passed), NIC is cxi[0-9], PID is [0-510]
    • Use hwloc to select interface to use if NIC information is available
      (only supported by cxi at the moment)
    • Support device memory types and FI_HMEM for verbs and cxi providers
    • Update min required version to libfabric 1.9
    • Improve debug output to print verbose FI info of selected provider
  • [NA UCX]
    • Use active messaging UCP_FEATURE_AM for unexpected messages (only), this
      allows for removal of address resolution and retry on first message to
      exchange connection IDs
    • Turn on mempool by default
    • Support device memory types
    • Bump min required version to 1.10
  • [NA PSM]
    • Add mercury NA plugin for the qlogic/intel PSM interface
      • Also support PSM2 (Intel OmniPath) through the PSM NA plugin
  • [NA]
    • Add na_addr_format init info
    • Update NA_Mem_register() API call to support memory types (e.g., CUDA, ROCm, ZE) and devices IDs
    • Add na_loc module for hwloc detection
    • Remove na_uint, na_int, na_bool_t and na_size_t types
    • Use separate versioning for library and update to v3.0.0
  • [NA IP]
    • Refactor na_ip_check_interface() to only use getaddrinfo() and getifaddrs()
    • Add family argument to force detection of IPv4/IPv6 addresses
    • Add ip debug log
  • [HG util]
    • Add mercury_byteswap.h for bswap macros
    • Add mercury_inet.h for htonll and ntohll routine
    • Add mercury_param.h to use sys/param.h or MIN/MAX macros etc
    • Use separate versioning for library and update to v3.0.0
  • [HG bulk]
    • Add support for memory attributes through a new HG_Bulk_create_attr() routine (support CUDA, ROCm, ZE)
  • [HG]
    • Remove MERCURY_ENABLE_STATS CMake option and use 'diag' log subsys instead
      • Modify behavior of stats field to turn on diagnostics
      • Refactor existing counters (used only if debug is on)
    • Add checksum levels that can be manually controlled at runtime (disabled by default, HG_CHECKSUM_NONE level)
    • Update to mchecksum v2.0

Bug fixes

  • [NA OFI]
    • Switch tcp provider to FI_PROGRESS_MANUAL
    • Prevent empty authorization keys from being passed
    • Check max MR key used when FI_MR_PROV_KEY is not set
    • New implementation of address management
      • Fix duplicate addresses on multithreaded lookups
      • Redefine address keys and raw addresses to prevent allocations
      • Use FI addr map to lookup by FI addr
      • Improve serialization and deserialization of addresses
    • Fix provider table and use EP proto
    • Refactor and clean up plugin initialization
      • Clean up ip and domain checking
      • Ensure interface name is not used as domain name for verbs etc
      • Use NA IP module and add missing NA_OFI_VERIFY_PROV_DOM for tcp provider
      • Rework handling of fi_info to open fabric/domain/endpoint
      • Separate fabric from domain and keep single domain per NA class
      • Refactor handling of scalable vs standard endpoints
    • Improve handling of retries after FI_EAGAIN return code
      • Abort retried ops after default 90s timeout
      • Abort ops to a target being retried after first NA_HOSTUNREACH error in CQ
  • [NA UCX]
    • Fix potential error not returned correctly on conn_insert()
    • Fix potential double free of worker_addr
    • Remove use of unified mode
    • Ensure address key is correctly reset
  • [HG util]
    • Make sure we round up ms time conversion, this ensures that small timeouts
      do not result in busy spin.
    • Use sched_yield() instead of deprecated pthread_yield()
    • Fix 'none' log level not recognized
    • Fix external logging facility
    • Let mercury log print counters on exit when debug outlet is on
  • [HG proc]
    • Prevent call to save_ptr()/restore_ptr() during HG_FREE
  • [CMake]
    • Correctly set INSTALL_RPATH for target libraries

⚠️ Known Issues

  • [NA OFI]
    • [tcp/verbs;ofi_rxm] Using more than 256 peers requires FI_UNIVERSE_SIZE to be set.
  • [NA UCX]
    • NA_Addr_to_string() cannot be used on non-listening processes to convert a self-address to a string.