-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize inference performance of ERNIE INT8 on CPU #275
Comments
We are currently working with the following two models: Initial performance results using the former (for FP32, 4 inputs, |
I have just noticed the above results were mixed. They are updated now. |
FP32 results with @GaoWei8's fix (PaddlePaddle/Paddle#20972): |
Attached the baseline profile results:
|
|
Schedule plan of ERNIE INT8 optimization:
|
|
@luotao1 I got outputs like below: My question is: am I using the correct file or is there any process for calculating accuracy? |
|
Below are our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings
on CLX 6248. INT8 tests were run with the With the current With additional commits from PRs
the latency for INT8 (with FC quantized only) was: With additional optimizations possible to implement:
the latency for INT8 (with FC quantized only) was: Other optimizations, like quantization of more operators and more fuses, are also investigated. |
I repeat my question. Should output labels have a floating point type? |
The output float number is the logits of a 3-label classification task. and the label file consists of class_id of the true label. import sys
import numpy as np
all, acc = 0, 0
for i in sys.stdin:
a, b, c, d = i.strip().split('\t')
pred = np.array(map(float, (a, b, c))).argmax()
if pred == int(d):
acc += 1
all += 1
print(float(acc)/all) run with:
|
@Meiyim you are right. Thank you for the explanation |
I have updated the results in #275 (comment) above. Results with additionally quantized |
@bingyanghuang @wojtuss Could you tell us why mkldnn update to 1.1 could have such great improvement (250ms -> 180ms)? We really want to know what happens in mkldnn 1.1? |
@luotao1 ,
We are investigating that. |
@luotao1 , |
The latest FP32 results for the current clean
After merging the PRs PaddlePaddle/Paddle#21746 and PaddlePaddle/Paddle#21754, the FP32 results are basically the same:
|
@GaoWei8 Please have a double-check on fp32 performance. |
The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148: 4-dimensional input (fp32_model, test_ds): |
@GaoWei8 Could you please paste your profile log of FP32 baseline in this issue? Let's try to align our baseline. |
@bingyanghuang
20 threads profile:
|
The latest FP32 results for the clean develop branch (c7b03d308c, Jan 2nd) on CLX 6248
|
Here come our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings
on CLX 6248. With the current develop branch (ad0dfb1) and the origin FP32 model, the latency was: INT8 results with |
We have finally resolved the accuracy issue for Ernie INT8 run. Background: Solution: |
With the fix for FC INT8 (PR PaddlePaddle/Paddle#22404, branch
|
Now, paddle ERNIE fp32 inference on CPU performance is ass below:
single thread: 251.464 m
20 threads:29.8818 ms
Our goal is to prove that with INT8 real kernel, ERNIE can get the performance gain.
@Sand3r- @wojtuss Please update your benchmark progress here.
@wzzju @luotao1 Please track the status here.
The text was updated successfully, but these errors were encountered: