Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which is used for BERT training benchmark #84

Open
LeoZhao-Intel opened this issue May 22, 2019 · 10 comments
Open

Which is used for BERT training benchmark #84

LeoZhao-Intel opened this issue May 22, 2019 · 10 comments

Comments

@LeoZhao-Intel
Copy link

Which script is used for BERT training benchmark, I see there are 2 kind of script, one is for pre-train, e.g train.py, the other is for fine tuning, e.g. run_classify.py.
Which one is used for benchmark?

@LeoZhao-Intel
Copy link
Author

@luotao1

@luotao1
Copy link
Collaborator

luotao1 commented May 22, 2019

We use run_classify.py.

@LeoZhao-Intel
Copy link
Author

LeoZhao-Intel commented May 23, 2019

@luotao1 For ParallelExecutor, how to calculate benchmark result by your QA team? is it "speed * CPU_NUM" or just speed?

@LeoZhao-Intel
Copy link
Author

@luotao1 For ParallelExecutor, how to calculate benchmark result by your QA team? is it "speed * CPU_NUM" or just speed?

@luotao1 Any feedback on this question?

@luotao1
Copy link
Collaborator

luotao1 commented May 28, 2019

  • 1 CPU_NUM: speed xxx
  • 16 CPU_NUM: speed xxx

We don't use speed * CPU_NUM, which is for throughput.

@LeoZhao-Intel
Copy link
Author

then how to measure if speed is comparable with V100 ?
e.g. V100: BS=1 speed 3.4steps/s,
Xeon: BS=1 8 CPU_NUM: speed 0.43 steps/s

Are they identical?

@luotao1
Copy link
Collaborator

luotao1 commented May 28, 2019

It is not identical.
BS=1 CPU_NUM=8: speed 0.43 steps/s, means: BS=1 CPU_NUM=1, speed 0.43/8 steps/s?
And the speed may be not linear with CPU_NUM increases.
You can give the result: BS=1 CPU_NUM=ALL

@LeoZhao-Intel
Copy link
Author

LeoZhao-Intel commented May 28, 2019

Yes, speed is not linear with CPU_NUM, but I checked code, and find this speed reflects iteration execution time, not really processed samples. It means:
for each iteration, the processed samples is actually batchsize * CPU_NUM.
I can confirm this.

So my question is for cpu vs. GPU, we may not compare data directly on speed output from log, given CPU_NUM is a virtual concept to use CPU multi-cores , and used to utilize data parallelism, while GPU need discrete card to extend multi-node. This speed is more like latency,

We can give different speed with different CPU_NUM, but how to compare them with GPU fairly, that is what I want to ask.

@luotao1
Copy link
Collaborator

luotao1 commented May 29, 2019

but how to compare them with GPU fairly, that is what I want to ask.

how about compute samples/s to compare between CPU and GPU?

@LeoZhao-Intel
Copy link
Author

I see this calculation logic in benchmark run.sh by use samples/s, it counts both CPU_NUM, BS.
I think it makes more sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants