Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usability Improvements #80

Open
34 tasks
wvaske opened this issue Nov 4, 2024 · 0 comments
Open
34 tasks

Usability Improvements #80

wvaske opened this issue Nov 4, 2024 · 0 comments

Comments

@wvaske
Copy link
Contributor

wvaske commented Nov 4, 2024

After the 1.0 submission we found that usability of the benchmark can be greatly improved. This issue will track the 'sub-issues' we intend to address for the 2.0 release.

Please add any items in comments and I will update this top level comment. Feel free to attend the sub-working group meeting (bi-weekly Wednesday morning starting on Nov 20th). Join the MLPerf Storage working group for the invite or message me.

Tasks

Rules Document

  • Define filesystem caching rules in detail
  • Define system json schema and creation process
  • Define allowed time between runs
  • Define rules that use local SSD for caching data
  • Define rules for hyperconverged and local cache

benchmark[.py | .sh] script

  • Unique names for files and directories with structure for benchmark, accelerator, count, run-sequence, run-number
  • Better installer that manages dependencies
  • Containerization
    • Ease of Deployment of Benchmark (just get it working)
    • Cgroups and resource limits (better cache management)
  • Flush Cache before a run
  • Validate inputs for –closed runs (eg: don’t allow runs against datasets that are too small)
  • Reportgen should run validation against outputs
  • Add better system.json creation to automate the system description for consistency
    • Add json schema checker for system documents that submitters create
  • Automate execution of multiple runs
  • Add support for code changes in closed to supported categories [ data loader, s3 connector, etc]
    • Add patches directory that gets applied before execution
  • Add runtime estimation and --what-if or --dry-run flag
  • Automate selection of minimum required dataset
  • Determine if batch sizes in MLPerf Training are representative of batch sizes for realistically sized datasets
  • Split system.json into automatically capturable (clients) and manual (storage)
  • Define system.json schema and add schema checker to the tool for reportgen
  • Add report-dir csv of results from tests as they are run
  • Collect versions of all prerequisite packages for storage and dlio

DLIO Improvements

  • Reduce verbosity of logging
  • Add callback handler for custom monitoring
    • SPECStorage uses a “PRIME_MON_SCRIPT” environment variable that will execute at different times
    • Checkpoint_bench uses RPC to call execution which can be wrapped externally
  • Add support for DIRECTIO
  • Add seed for dataset creation so that distribution of sizes is the same for all submitters (file 1 = mean + x bytes, file 2 = mean + y bytes, etc)
  • Determine if global barrier for each batch matches industry behavior

Results Presentation

  • Better linking and presentation of system diagrams (add working links to system diagrams to supplementals)
  • Define presentation and rules for hyperconverged or systems with local cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant