AU doesn't meet expectation #55

hanyunfan · 2024-03-20T16:30:42Z

I am new, just had the storage benchmark run the 1st time ever and got this line:

Training_au_meet_expectation = fail.

My questions are:

Will this block me from submitting for the next round?
Need some help to understand AU better, to my understanding, this is a calculated number, so why does it fail?
How to adjust to make it correct.

shp776 · 2024-03-21T02:10:44Z

Hi. I am also a user who is repeating trial & error by changing parameter values.
How many accelerators did you set per host?
I could pass the AU standard only when I set 1 accelerator per host...

hanyunfan · 2024-03-21T14:11:16Z

@shp776 thanks. I set 1 accelerator and run on 1 node, here is my run command:

./benchmark.sh run -s localhost -w unet3d -g h100 -n 1 -r resultsdir -p dataset.num_files_train=1200 -p dataset.data_folder=unet3d_data

hanyunfan · 2024-03-22T15:54:07Z

./benchmark.sh run -s localhost -w unet3d -g h100 -n 1 -r resultsdir -p dataset.num_files_train=1200 -p dataset.data_folder=unet3d_data -p reader.read_threads=16

hanyunfan · 2024-03-22T16:01:06Z

[METRIC] ==========================================================
[METRIC] Training Accelerator Utilization [AU] (%): 99.3492 (0.0111)
[METRIC] Training Throughput (samples/second): 20.8836 (0.1170)
[METRIC] Training I/O Throughput (MB/second): 2919.7243 (16.3633)
[METRIC] train_au_meet_expectation: success
[METRIC] ==========================================================
./benchmark.sh run -s localhost -w unet3d -g h100 -n 1 -r resultsdir -p dataset.num_files_train=1200 -p dataset.data_folder=unet3d_data -p reader.read_threads=8

hanyunfan · 2024-03-22T16:06:47Z

read_threads=6
[METRIC] ==========================================================
[METRIC] Training Accelerator Utilization [AU] (%): 99.3732 (0.0089)
[METRIC] Training Throughput (samples/second): 20.8935 (0.1140)
[METRIC] Training I/O Throughput (MB/second): 2921.1066 (15.9404)
[METRIC] train_au_meet_expectation: success
[METRIC] ==========================================================

shp776 · 2024-03-25T05:08:24Z

Hi, @hanyunfan
I want to know What value did you set for the below parameter In the second step(datagen).

-n, --num-parallel Number of parallel jobs used to generate the dataset

Thank you!

FileSystemGuy · 2024-03-25T06:04:05Z

Hi,Since the benchmark score does not include the time it takes to generate the dataset files, you can set that parameter to anything you want to. I like 16 or 32, for example, because it usually makes the generation phase take less time.Thanks,CurtisOn Mar 24, 2024, at 22:08, shp776 ***@***.***> wrote: Hi, @hanyunfan I want to know What value did you set for the below parameter In the second step(datagen). -n, --num-parallel Number of parallel jobs used to generate the dataset Thank you! —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

hanyunfan · 2024-03-25T14:33:18Z

Hi, @hanyunfan I want to know What value did you set for the below parameter In the second step(datagen).

-n, --num-parallel Number of parallel jobs used to generate the dataset

Thank you!

This number doesn't really matter, you can use the default one, it just opens 8 or 16 parallel threads to process the data.

shp776 · 2024-03-26T02:20:00Z

@FileSystemGuy , @hanyunfan Thank you very much guys, your advice was very helpful to me.
I have one more question.

-m, --client-host-memory-in-gb Memory available in the client where benchmark is run

I want to know if it is right that the above parameter`s value should be as close as possible to my DRAM memory size in order to maximize storage performance.
I'm not sure, but I think I saw this in the MLPerf-Storage presentation by Balmau
(If Dataset does not fit with memory (ex. Dataset= 2 * system memory), disk access occurs frequently and training time has increased by three times from this.)

From what I tested, the larger the above parameter value, the larger the result (-param dataset.num_files_train) in the datasize stage(step 1.)

Is there anything you can tell me about this? : )

hanyunfan · 2024-03-28T14:51:44Z

@shp776 looks like that's the design, you should set it to something equal to your testing system memory, if you set larger, more files or larger files will be generated to meet the 5x rule. So, you will see more files, this is common. Finial results for anything larger than 5x memory size should be similar, because they all removed the client cache effect. So, if you set it with a larger value, you only increased your testing time, not the throughput at the end, not seem worth it.

# calculate required minimum samples given host memory to eliminate client-side caching effects

storage/benchmark.sh

Line 217 in 88e4f59

    
           min_samples_host_memory=$(echo "$num_client_hosts * $client_host_memory_per_host_in_gb * $HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / $record_length" | bc)

$HOST_MEMORY_MULTIPLIER is 5 by default here, so looks like it will generate 5x data.

storage/benchmark.sh

Line 51 in 88e4f59

HOST_MEMORY_MULTIPLIER=5

hanyunfan · 2024-12-05T15:48:42Z

Hi Huihuo@zhenghh04 When we test 8 GPUs in one client, the AU start to fail again even with 128 read threads, besides reader threads, are there any other parameters we can adjust to keep the GPU busy?

zhenghh04 · 2024-12-05T15:54:53Z

Which workload were you testing? You might also need to make sure the affinity setting of your threads, usually with –cpu-bind option in your MPI command. Huihuo From: hanyunfan ***@***.***> Date: Thursday, December 5, 2024 at 9:49 AM To: mlcommons/storage ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [mlcommons/storage] AU doesn't meet expectation (Issue #55) Hi ***@***.*** When we test 8 GPUs in one client, the AU start to fail again even with 128 read threads, besides reader threads, are there any other parameters we can adjust to keep the GPU busy? — Reply to this email directly, view it on GitHub<#55 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABMCS3IMR2HQ63NH3AXKESD2EBYXHAVCNFSM6AAAAABTCYQGVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRQGY4DENJRGI>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

hanyunfan · 2024-12-05T15:57:29Z

Which workload were you testing? You might also need to make sure the affinity setting of your threads, usually with –cpu-bind option in your MPI command. Huihuo From: hanyunfan @.> Date: Thursday, December 5, 2024 at 9:49 AM To: mlcommons/storage @.> Cc: Subscribed @.> Subject: Re: [mlcommons/storage] AU doesn't meet expectation (Issue #55) Hi @. When we test 8 GPUs in one client, the AU start to fail again even with 128 read threads, besides reader threads, are there any other parameters we can adjust to keep the GPU busy? — Reply to this email directly, view it on GitHub<#55 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABMCS3IMR2HQ63NH3AXKESD2EBYXHAVCNFSM6AAAAABTCYQGVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRQGY4DENJRGI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

It is unet3d, thanks a lot. I will give -cpu-bind a try, not really familiar with mpich, but I will give it a try. Thanks again. -Frank

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AU doesn't meet expectation #55

AU doesn't meet expectation #55

hanyunfan commented Mar 20, 2024

shp776 commented Mar 21, 2024

hanyunfan commented Mar 21, 2024

hanyunfan commented Mar 22, 2024

hanyunfan commented Mar 22, 2024

hanyunfan commented Mar 22, 2024

shp776 commented Mar 25, 2024

FileSystemGuy commented Mar 25, 2024 via email

hanyunfan commented Mar 25, 2024

shp776 commented Mar 26, 2024

hanyunfan commented Mar 28, 2024

hanyunfan commented Dec 5, 2024

zhenghh04 commented Dec 5, 2024 via email

hanyunfan commented Dec 5, 2024

AU doesn't meet expectation #55

AU doesn't meet expectation #55

Comments

hanyunfan commented Mar 20, 2024

shp776 commented Mar 21, 2024

hanyunfan commented Mar 21, 2024

hanyunfan commented Mar 22, 2024

hanyunfan commented Mar 22, 2024

hanyunfan commented Mar 22, 2024

shp776 commented Mar 25, 2024

FileSystemGuy commented Mar 25, 2024 via email

hanyunfan commented Mar 25, 2024

shp776 commented Mar 26, 2024

hanyunfan commented Mar 28, 2024

hanyunfan commented Dec 5, 2024

zhenghh04 commented Dec 5, 2024 via email

hanyunfan commented Dec 5, 2024