Releases: LLNL/scr
v3.1.0
What's Changed
- shorten resource manager names by @adammoody in #391
- cmake: install python scripts by @adammoody in #396
- add pyfe to pythonpath by @adammoody in #397
- Shorten job launcher names by @adammoody in #394
- avoid import * by @adammoody in #399
- call getjobid method on resource manager by @adammoody in #400
- allow runproc command to be a string by @adammoody in #406
- readme on how to add new resource manager by @adammoody in #408
- add class for scr_flush_file command by @adammoody in #410
- add class for scr_log_event command by @adammoody in #411
- Watchflush by @adammoody in #412
- prerun: style edits by @adammoody in #414
- split command strings with shlex by @adammoody in #417
- Scrhalt by @adammoody in #418
- define functions to return scr dirs in scr_env by @adammoody in #419
- use makedirs exist_ok in scr_halt by @adammoody in #420
- note cmakelists in resource manager and job launcher readmes by @adammoody in #421
- style edits in scr_run, remove unused code by @adammoody in #422
- move fetch from init to have_restart by @adammoody in #480
- Python versions of the SCR scripts by @adammoody in #483
- Add basic Flux support by @adammoody in #484
- Branch for update SCR Config default setup value by @hbchen1984 in #486
- Add shared file testing to test_api by @mcfadden8 in #487
- file descriptors can be zero by @adammoody in #490
- examples: fix include stdint for unit64_t by @adammoody in #491
- examples: make read_checkpoint symmetric with write_checkpoint by @adammoody in #492
- Shared files in global cache by @adammoody in #495
- doc: example to configure SINGLE and XOR by @adammoody in #496
- Parameterize sleep period when polling async flush by @adammoody in #498
- drop debug message from testing async flush by @adammoody in #499
- Removed cmake code duplication in dist by @mcfadden8 in #502
- print commands executed by run_test.sh by @adammoody in #503
- Move axl mpi to axl by @mcfadden8 in #501
- Rename before removal when cleaning up from CI test to work more reliably with NFS by @mcfadden8 in #504
- test_api: usage output missing some newlines by @ofaaland in #509
- slurm.py: Handle case where no end time exists by @ofaaland in #515
- Copy python scripts into build directory by @ofaaland in #516
- SCR_ADD_TEST: prepare to add python script support by @ofaaland in #517
- scr_common.py: Add choose_bindir() to use build binaries by @ofaaland in #518
- Add run_test_py.sh by @ofaaland in #519
- scr_run.py: exit if no jobid found by @ofaaland in #521
- flux launcher: get jobspec using runproc by @ofaaland in #523
- add --endtime to scr_env.py by @ofaaland in #525
- SCR_ADD_TEST: run python based tests for flux by @ofaaland in #526
- Update jobid and endtime based on flux v0.46.1 by @ofaaland in #527
- link to ecp components via cmake config targets by @adammoody in #514
- add scrConfig to dist by @adammoody in #528
- Update min cmake version to 3.14.5 by @mcfadden8 in #512
- add SCR_FETCH_BYPASS config param by @adammoody in #533
- doc: highlight restart with a different number of ranks by @adammoody in #534
- python: default valid to True in complete calls by @adammoody in #538
- doc: alphabetize resource managers in run section by @adammoody in #539
- Added option to bootstrap to build with static libraries by @mcfadden8 in #543
- cmake: link static target to static dependencies by @adammoody in #536
- measure time waiting for async flush by @adammoody in #544
- Docs: Update to use yaml files by @CamStan in #545
- Docs: Fix theme rendering issue by @CamStan in #551
- doc: add flux to build.rst by @ofaaland in #546
- flux: save string form of jobid by @ofaaland in #547
- scr_run.py: validate launcher by @ofaaland in #548
- flux launcher should detect failed jobs by @ofaaland in #549
- scr_run.py: exit with non-zero exit code on failure by @ofaaland in #550
- flux: remove 'mini' command from tests by @jameshcorbett in #552
- tests: fix argument passing to scr_run.py by @jameshcorbett in #554
- yapf: format python files by @adammoody in #555
- scripts: simplify resource manager function names by @adammoody in #556
- python: run docformatter by @adammoody in #557
- scripts: convert to be more pythonic by @adammoody in #558
- scripts: refresh const and check_node by @adammoody in #561
- Docs: Update deps and enable reproducible builds by @CamStan in #562
- scripts: move scripts to libexec by @adammoody in #564
- scripts: renaming and minor refactoring by @adammoody in #565
- scripts: renaming to drop scr references by @adammoody in #566
- scripts: rename scr_const to config by @adammoody in #567
- scripts: rename scr_flux to scr_fluxrun by @adammoody in #568
- scripts: restore scr_run scripts and refresh tests by @adammoody in #569
- scripts: raise exceptions rather than exit from run.py by @adammoody in #570
- scripts: add scr_srun bash example by @adammoody in #571
- scripts: create separate directory to hold node tests by @adammoody in #572
- scripts: add remoteexec directory by @adammoody in #573
- scripts: add RemoteExecResult to standardize rexec output by @adammoody in #574
- Remove perl by @adammoody in #575
- scripts: consolidate hostlist manipulation logic to hostlist.py by @adammoody in #576
- update python test cases after scrjob refactoring by @adammoody in #577
- add --min-nodes to scr_should_exit command by @adammoody in #581
- Test fixes by @jameshcorbett in #582
- use ISO time format in scr_halt by @adammoody in #583
- Remove early build of er that may have been placed in here for debugg… by @mcfadden8 in #586
- Docs: update to latest doc deps by @CamStan in #587
- Flux cleanup by @jameshcorbett in #585
- Add example Flux jobscripts by @jameshcorbett in #589
- simulate different work kernels in test_api by @adammoody in #537
- Added User facing Python Documentations. by @hariharan-devarajan in #593
- Update version numbers for v3.1.0 release by @CamStan in #594
- Fix docs dependency typo by @CamStan in #595
New Contributors
- @jameshcorbett made their first contribution in #552
- @hariharan-devarajan made their first contribution in #593
Full Changelog: v3.0...v3.1.0
v3.0.1
This release provides a few performance improvements over v3.0:
- raises default
SCR_MPI_BUF_SIZE
from 128KiB to 1MiB - raises default
SCR_FILE_BUF_SIZE
from 1MiB to 32MiB - adds new
SCR_FLUSH_ASYNC_USLEEP
to configure sleep time while waiting on async flush to complete, and lowers default sleep time from 10 seconds to 1000 microseconds - update to ER-v0.3.0
- increases file buffer size from 1MiB to 32MiB
- update AXL-v0.7.0
- increase file buffer size from 1MiB to 32MiB
- disables writing to an
_AXL
temporary file to avoid slow rename step on some file systems
v3.0
- Added Python bindings for the SCR library
- Supports Python 2 and 3
- Implemented in the
scr.py
module (import scr
) - Uses the C Foreign Function Interface (CFFI) to wrap calls to libscr
- To use the Python bindings, first install SCR, then follow the steps in the python/README.md
- Improved support for large datasets and shared access to files. Applications can now configure SCR to bypass the cache and access datasets on the global file system:
- For datasets that are too large to fit in cache or for systems that have no cache available, SCR can use the global file system. This improves portability so that applications can use SCR on any cluster.
- Since bypass mode is more general, it is enabled by default. To use cache, one must disable bypass mode by setting (
SCR_CACHE_BYPASS=0
). - For applications that write shared files, SCR can use bypass mode during the SCR Checkpoint/Output API.
- For applications that write datasets as a file-per-process but require shared access to files during restart, one can write to cache but set
SCR_GLOBAL_RESTART=1
. This rebuilds and flushes cached datasets duringSCR_Init
. It also enables bypass mode for restart so that the application can read its dataset from the global file system using the SCR Restart API.
- Applications can now instruct SCR to load a specific checkpoint by naming it in the
SCR_CURRENT
parameter before callingSCR_Init
. - Restart loop:
- SCR now supports a loop around
SCR_Have_restart
,SCR_Start_restart
, andSCR_Complete_restart
. If an application detects a problem during its restart, it can passvalid=0
toSCR_Complete_restart
. SCR will then load the next most recent checkpoint, which the application can query with another call toSCR_Have_restart
. This process can be continued until either a checkpoint is read successfully or all checkpoints have been exhausted.
- SCR now supports a loop around
SCR_Need_checkpoint
now returns false unless one has set one ofSCR_CHECKPOINT_INTERVAL/SECONDS/OVERHEAD
- Restored watchdog support on SLURM systems
- New build options:
- Added support for static-only builds with
-DBUILD_SHARED_LIBS=OFF
- Added CMake options to disable portions of the build including
-DENABLE_EXAMPLES=[ON/OFF]
and-DENABLE_TESTS=[ON/OFF]
- Added support to specify the number of trailing underscores for Fortran bindings with
-DENABLE_FORTRAN_TRAILING_UNDERSCORES=[AUTO/ON/OFF]
- Added support for static-only builds with
- New API calls:
SCR_Config(const char* config)
to set and query SCR configuration parameters beforeSCR_Init()
, and query parameters afterSCR_Init()
.SCR_Configf(const char* config, ...)
a version ofSCR_Config
that supports printf-style formatting.SCR_Current(const char* name)
enables an application that reads its checkpoint without using the SCR Restart API to inform SCR about which checkpoint it loaded so that SCR can still track the proper ordering of checkpointsSCR_Delete(const char* name)
to ask SCR to delete a datasetSCR_Drop(const char* name)
to ask SCR to drop a dataset from the index without deleting the underlying data files
- Improved flush methods
- Added IBM BB API (https://github.com/IBM/CAST), e.g.,
SCR_FLUSH_TYPE=BBAPI
- Added pthreads, e.g.,
SCR_FLUSH_TYPE=PTHREAD
- Added support for multiple outstanding asynchronous flushes
- Initial support for
scr_poststage
of BBAPI transfers after completion of allocation (beta)
- Added IBM BB API (https://github.com/IBM/CAST), e.g.,
- New redundancy scheme:
- Reed-Solomon encoding (
SCR_COPY_TYPE=RS
) allows a configurable number of failures per group, from 1 to N-1 where N is the set size. UseSCR_SET_SIZE
to specify the group size andSCR_SET_FAILURES
to specify the number of failures per group.
- Reed-Solomon encoding (
- SCR configuration parameters now support interpolation of environment variables in configuration files, e.g.,
>>: cat .scrconf SCR_CACHE_BASE=$BBPATH
- Default path for SCR system configuration file moved from
/etc/scr.conf
to<install>/etc/scr.conf
- SCR now preserves file metadata including atime, mtime, uid, gid, and mode bits
- New logging options:
- text file - written to the SCR prefix directory (
SCR_LOG_TXT_ENABLE=1
) - syslog - one can configure the syslog prefix, facility, and level to be used (
SCR_LOG_SYSLOG_ENABLE=1
)
- text file - written to the SCR prefix directory (
- Apps can now configure SCR to maintain a sliding window of checkpoints on the parallel file system with an
SCR_PREFIX_SIZE
parameter. After flushing a new checkpoint, SCR will delete older checkpoints - Default cache and control directories have been moved from
/tmp
to/dev/shm
on Linux systems - Assists for application developers when integrating the SCR API
- A new
SCR_CACHE_PURGE
parameter configures SCR to delete datasets from cache in new runs - A new
SCR_PREFIX_PURGE
parameter similarly deletes datasets from the prefix directory in new runs - Added internal checks to warn developers about incorrect API usage
- A new
- Refactored code base to use ECP-VeloC components https://github.com/ecp-veloc/
- Improves code modularity and reuse
- Improved testing
- New release tarball packages source for SCR and many of its components to simplify direct builds, e.g.,
wget https://github.com/LLNL/scr/releases/download/v3.0/scr-v3.0.tgz tar -xzf scr-v3.0.tgz cd scr-v3.0 mkdir build cd build cmake -DCMAKE_INSTALL_PREFIX=../install -DSCR_RESOURCE_MANAGER=SLURM ../ make -j install
v3.0rc2
This is the second release candidate for v3.0. This adds the following new features and bug fixes on top of items listed in v3.0rc1.
New Features:
- Added support for multiple outstanding asynchronous flushes
- Added support to set
SCR_PREFIX
throughSCR_Config
- Enable queries with
SCR_Config
afterSCR_Init
has been called - Changed
SCR_Config
behavior so that SCR assumes default values for all parameters on each run, rather than reading theapp.conf
file to use values set bySCR_Config
in a previous run SCR_Need_checkpoint
now returns false unless one has set one ofSCR_CHECKPOINT_INTERNAL/SECONDS/OVERHEAD
- Added support to specify the number of trailing underscores for Fortran bindings with
-DENABLE_FORTRAN_TRAILING_UNDERSCORES=[AUTO/ON/OFF]
- Restored watchdog support on SLURM systems
- Initial support for
scr_poststage
of BBAPI transfers after completion of allocation (beta) - Added support for static-only builds with
-DBUILD_SHARED_LIBS=OFF
- Added CMake options to disable portions of the build including
-DENABLE_EXAMPLES=[ON/OFF]
and-DENABLE_TESTS=[ON/OFF]
- Release tarball scr-top has been refactored to merge SCR and its immediate dependencies into a single library (libscr) for a faster build and a simplified link step
Bug fixes:
- Auto define store descriptors for default cache and control directories
- Use proper cache directories during scavenge when control directory and cache directory are different
- Update
SCR_FLUSH_ASYNC_TYPE=PTHREAD
to allow asynchronous flush - Enable use of
=
characters inSCR_Config
values
v3.0rc1
This is release candidate for v3.0.
- Improved support for large datasets and shared access to files. Applications can now configure SCR to bypass the cache and access datasets on the global file system:
- Since bypass mode is more general, it is enabled by default. To use cache, one must disable bypass mode (
SCR_CACHE_BYPASS=0
). - For datasets that are too large to fit in cache or for systems that have no cache available, SCR can use the global file system. This improves portability so that applications can use SCR on any cluster.
- For applications that write shared files, SCR can use bypass mode during the SCR Checkpoint/Output API.
- For applications that write datasets as a file-per-process but require shared access to files during restart, one can write to cache but enable
SCR_GLOBAL_RESTART
. This rebuilds and flushes cached datasets duringSCR_Init
. It also enables bypass mode for restart, so an application can read its dataset from the global file system using the SCR Restart API.
- Since bypass mode is more general, it is enabled by default. To use cache, one must disable bypass mode (
- Applications can now instruct SCR to load a specific checkpoint by naming it in the
SCR_CURRENT
parameter before callingSCR_Init
. - Restart loop:
- SCR now supports a loop around
SCR_Have_restart
,SCR_Start_restart
, andSCR_Complete_restart
. If an application detects a problem during its restart, it can passvalid=0
toSCR_Complete_restart
. SCR will then load the next most recent checkpoint, which the application can query with another call toSCR_Have_restart
.
- SCR now supports a loop around
- New API calls:
SCR_Config(const char* config)
to set and query SCR configuration parameters beforeSCR_Init()
SCR_Current(const char* name)
enables an application that reads its checkpoint without using the SCR Restart API to inform SCR about which checkpoint it loaded so that SCR can still track the proper ordering of checkpointsSCR_Delete(const char* name)
to ask SCR to delete a datasetSCR_Drop(const char* name)
to ask SCR to drop a dataset from the index without deleting the underlying data files
- New flush methods
- Added IBM BB API (https://github.com/IBM/CAST), e.g.,
SCR_FLUSH_TYPE=BBAPI
- Added pthreads, e.g.,
SCR_FLUSH_TYPE=PTHREAD
- Added IBM BB API (https://github.com/IBM/CAST), e.g.,
- New redundancy scheme:
- Reed-Solomon encoding (
SCR_COPY_TYPE=RS
) allows a configurable number of failures per group, from 1 to N-1 where N is the set size. UseSCR_SET_SIZE
to specify the group size andSCR_SET_FAILURES
to specify the number of failures per group.
- Reed-Solomon encoding (
- SCR configuration parameters now support interpolation of environment variables in configuration files, e.g.,
>>: cat .scrconf SCR_CACHE_BASE=$BBPATH
- SCR now preserves file metadata including atime, mtime, uid, gid, and mode bits
- New logging options:
- text file - written to the SCR prefix directory (
SCR_LOG_TXT_ENABLE=1
) - syslog - one can configure the syslog prefix, facility, and level to be used (
SCR_LOG_SYSLOG_ENABLE=1
)
- text file - written to the SCR prefix directory (
- Apps can now configure SCR to maintain a sliding window of checkpoints on the parallel file system with an
SCR_PREFIX_SIZE
parameter. After flushing a new checkpoint, SCR will delete older checkpoints - Default cache and control directories have been moved from
/tmp
to/dev/shm
on Linux systems - Assists for application developers when integrating the SCR API
- A new
SCR_CACHE_PURGE
parameter configures SCR to delete datasets from cache in new runs - A new
SCR_PREFIX_PURGE
parameter similarly deletes datasets from the prefix directory in new runs - Added internal checks to warn developers about incorrect API usage
- A new
- Added Python bindings for SCR library (beta)
- Implemented in an
scr.py
module (import scr
) - Uses C Foreign Function Interface (CFFI) to wrap C functions in libscr
- Supports Python 2 and 3
- Implemented in an
- Refactored code base to use ECP-VeloC components https://github.com/ecp-veloc/
- Improves code modularity and reuse
- Improved testing
- scr-top package (https://github.com/llnl/scr-top) includes source for SCR and its ECP-VeloC components to simplify direct installs, e.g.,
tar -xzf scr-top-v3.0rc1.tgz cd scr-top-v3.0rc1 mkdir build install cd build cmake -DCMAKE_INSTALL_PREFIX=../install -DSCR_RESOURCE_MANAGER=SLURM ../ make install
SCR v2.0.0
🎉 SCR Version 2.0 🎉
This release marks a milestone is SCR's long history of bringing dependable, scalable, file set management to multiple HPC platforms.
Some highlights include:
- Support for multiple platform specific hardware technologies, including Cray DataWarp
- Portability across many HPC centers via scheduler integration
- Scalable checkpoint resilience and restart capabilities
SCR v1.2.2
Updates the SCR command SCR_Route_file
to always be successful. In the case where SCR_Route_file
is called outside of a start/complete pair or when SCR is disabled, the original file path is simply copied to the return string.
SCR v1.2.1
This release includes a refresh of the SCR documentation which accompanies version 1.2.0:
- NEW: SCR user documentation is now live and always up-to-date at scr.rtfd.io.
- We've updated the various in-repo references to and copies of the user manual.
SCR v1.2.0
This release includes many new features for SCR. Details can be found in the latest user manual: SCRv1.2-User-Manual.pdf.
New API Features:
char * SCR_Get_version (void)
. SCR's version information also appears in thescr.h
header file.- We now have support for arbitrary file set outputs (not just checkpoints). Users can call
int SCR_Start_output (const char* name, int flags)
andint SCR_Complete_output (int valid)
to wrap both checkpoint and arbitrary output write phases of their applications. Theflags
parameter should be used to describe the file set:SCR_FLAG_NONE
,SCR_FLAG_CHECKPOINT
,SCR_FLAG_OUTPUT
. These flags can be combined with bit-wise or,|
, for a single file set. - SCR has added functions for marking a restart phase of an application. Users can call
int SCR_Have_restart (int* flag, char* name)
to check if a checkpoint is available for application restart. Users can the useint SCR_Start_restart (char* name)
andint SCR_Complete_restart (int valid)
to mark the restart phase of the application. - SCR now allows for user-defined directories. That is, users are able to dictate the layout of their files on the PFS; SCR no longer requires that all files for a checkpoint exist in the same directory.
Other Changes:
- We have upgraded our build system to CMake. This includes some preliminary testing available via
make test
. - SCR now supports Cray Datawarp burst buffer architectures! Users can trigger static linking (default on Cray systems) via the CMake option
-DSCR_LINK_STATIC=ON
. - The SCR Spack package has been updated to support all options and configurations, including some smart defaults for Cray machines.
- We now support building SCR on a Mac.
- We now have initial implementations of the SCR command line interface for interacting with the LSF and PMIx resource managers.
- We've changed the behavior of SCR during a restart if
SCR_Finalize()
was called.
SCR v1.2.0 Release Candidate 1
Pre-release of Version 1.2.0.