Skip to content

Commit

Permalink
update some man pages
Browse files Browse the repository at this point in the history
  • Loading branch information
adammoody committed Sep 28, 2015
1 parent e83ee5a commit dd4a388
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 35 deletions.
27 changes: 15 additions & 12 deletions man/scr.1.in
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
.TH scr 1 "@META_DATE@" "@META_ALIAS@" "@META_NAME@"

.SH DESCRIPTION
The Scalable Checkpoint / Restart (SCR) library provides an interface that
codes may use to write out and read in application-level checkpoints in a
scalable fashion. In the current implementation, checkpoint files are
cached in local storage (hard disk or RAM disk) on the compute nodes.
This technique provides scalable aggregate bandwidth and uses storage
resources that are fully dedicated to the job. This approach addresses
the two common drawbacks of checkpointing a large-scale application to
a shared parallel file system, namely, limited bandwidth and file system
contention. In fact, on current platforms, SCR scales linearly with the
number of compute nodes. It has been benchmarked as high as 720GB/s
on 1094 nodes of Atlas, which is nearly two orders of magnitude faster
than the parallel file system.
The Scalable Checkpoint / Restart (SCR) library enables MPI applications
to utilize distributed storage on Linux clusters to attain high file I/O
bandwidth for checkpointing and restarting large-scale jobs. With SCR,
jobs run more efficiently, recompute less work upon a failure, and reduce
load on critical shared resources such as the parallel file system.

SCR caches checkpoint files in storage local to the compute nodes, and it
applies a redundancy scheme such that files can be recovered in the event
of a failure. When a failure occurs, the current run is killed. If there
are sufficient spare nodes and time remaining in the resource allocation,
another run may be restarted from the cached checkpoint files. Otherwise,
the cached checkpoint files are flushed to the parallel file system, and
the resource allocation is released. With this approach, file I/O
bandwidth scales linearly with the number of compute nodes. On large
clusters, SCR is often 100x to 1000x faster than the parallel file system.

.SH SEE ALSO
\fISCR User Manual\fR
Expand Down
48 changes: 25 additions & 23 deletions man/scr_index.1.in
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
.TH scr_index 1 "@META_DATE@" "@META_ALIAS@" "@META_NAME@"
.SH NAME
scr_index \- manages status of checkpoints stored on parallel file system
scr_index \- manages status of datasets stored on parallel file system

.SH SYNOPSIS
.B "scr_index [options]"

.SH DESCRIPTION
The \fBscr_index\fR command manages checkpoints stored on the parallel
The \fBscr_index\fR command manages datasets stored on the parallel
file system.

.SH OPTIONS
Expand All @@ -15,10 +15,13 @@ file system.
List contents of index (default behavior).
.TP
.BI "-a, --add " DIR
Add checkpoint directory to index.
Add dataset directory to index.
.TP
.BI "-r, --remove " DIR
Remove checkpoint directory from index. Does not delete corresponding files.
Remove dataset directory from index. Does not delete corresponding files.
.TP
.BI "-c, --current " DIR
Specify directory as restart directory.
.TP
.BI "-p, --prefix"
Specify prefix directory (defaults to current working directory).
Expand All @@ -28,39 +31,38 @@ Print usage.

.LP
When adding a directory to the index, the command checks
that the checkpoint directory on the parallel file system contains a
complete and valid set of checkpoint files, it rebuilds missing
checkpoint files if possible, and it writes an \fIscr_summary.txt\fR
that the dataset directory on the parallel file system contains a
complete and valid set of files, it rebuilds missing
files if possible, and it writes a \fIsummary\fR
file, which is used by the SCR library to fetch and redistribute files
to appropriate ranks upon a restart.

One may invoke the command outside of a SLURM job allocation, which is
useful to check and rebuild a checkpoint set in which \fBscr_postrun\fR
useful to check and rebuild a dataset set in which \fBscr_postrun\fR
may have failed to complete its internal call to \fBscr_index\fR.

When listing checkpoints, the internal SCR checkpoint id is shown,
along with the checkpoint directory name. In addition, the time
the checkpoint was flushed is shown as well as a set of flags.
The flags are shown as a series of columns consisting of single characters.
The first column represents whether the checkpoint is complete.
An 'x' in the first column implies the checkpoint is incomplete while 'c' means the checkpoint is complete.
An 'f' in the second column indicates that SCR has marked the checkpoint as failed.
During a restart, SCR will not attempt to fetch a checkpoint that is marked as incomplete or failed.
When listing dataset, the internal SCR dataset id is shown,
along with a flag denoting whether the dataset is valid,
the time the dataset was flushed, and the dataset directory name.
The dataset marked as \fBcurrent\fR is denoted with a * symbol.
During a restart, SCR will attempt to fetch the most recent
checkpoint starting from the current dataset.
SCR will not attempt to fetch any dataset marked as invalid.

.SH EXAMPLES
.TP
(1) List checkpoint direcotries in index file:
(1) List dataset direcotries in index file:
.nf
>> scr_index --list
FLAGS FLUSHED CKPT DIRECTORY
c- 2011-04-11T17:59:27 18 scr.2011-04-11_17:59:27.427417.18
c- 2011-04-11T17:59:24 12 scr.2011-04-11_17:59:24.427417.12
c- 2011-04-11T17:59:21 6 scr.2011-04-11_17:59:21.427417.6
DSET VALID FLUSHED DIRECTORY
* 18 YES 2015-09-28T16:46:22 scr.dataset.18
12 YES 2015-09-28T16:43:40 scr.dataset.12
6 YES 2015-09-28T16:43:02 scr.dataset.6
.fi
.TP
(2) Add checkpoint directory to index file, rebuild missing checkpoint files if necessary, and create summary file:
(2) Add dataset directory to index file, rebuild missing files if necessary, and create summary file:
.nf
>> scr_index --add scr.2008-10-20_14:22:10.23167.50
>> scr_index --add scr.dataset.20
.fi

.SH SEE ALSO
Expand Down

0 comments on commit dd4a388

Please sign in to comment.