Replies: 1 comment
-
micro-batch-size is the sample numbers for every iteration. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
![Image](https://private-user-images.githubusercontent.com/11450151/375938106-bb1f93d3-f9b6-4edd-971f-ecff7319b9aa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5OTc2MjQsIm5iZiI6MTczODk5NzMyNCwicGF0aCI6Ii8xMTQ1MDE1MS8zNzU5MzgxMDYtYmIxZjkzZDMtZjliNi00ZWRkLTk3MWYtZWNmZjczMTliOWFhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA4VDA2NDg0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWViMGVkMTZiMmYzZmU1YmZiNDg1YWMyYTE3MzkyZjlmYjg5ODNkZjU2ODQ5NzM5MTk2ZGY2YzViNGJiYTVlYmImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.bmBD66YU-qV_svRyl9ni63NNinr0UUcF70kwQWnrykc)
![Image](https://private-user-images.githubusercontent.com/11450151/375939435-10ea2227-fba6-417a-99e3-be70da0254de.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5OTc2MjQsIm5iZiI6MTczODk5NzMyNCwicGF0aCI6Ii8xMTQ1MDE1MS8zNzU5Mzk0MzUtMTBlYTIyMjctZmJhNi00MTdhLTk5ZTMtYmU3MGRhMDI1NGRlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA4VDA2NDg0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTliZTE2ZDc1NmI0MWM3N2E2MTljM2VjNWQ5YjE1YWNmZjg0NDNmYWU3YzkxY2QwODJjNTYzMDIwNTE4Mzg2ZTAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.wvYWtudZYFRt5pIyjZ8iNreetGB4hC4QLHUwxpZuUJY)
I am testing how is micro-batch-size influencing the throughput per GPU with a constant global-batch-size.
The result shows that as the micro-batch-size increases, the throughput per GPU(TFLOP/s/GPU) also increases.
I have done some test with a 400M transformer based model on 2 A40 GPUS, and only use data parallelism. Here are some training Arguments
With different test I only change the micro-batch-size , trained on 100 iterations with seq_len =1024 and global-batch-size =24 . Here are some result with different micro-batch-size
I print the log every 5 iterations and compute the averaged throughput per GPU.
For each Iteration , the total computational complexity is the same , but throughput per GPU increases as the micro-batch-size increases. I know that may related to the GPU cache load or arithmetic intensity but not quite clear. Can anyone provide some in-depth explanations?
Beta Was this translation helpful? Give feedback.
All reactions