Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement is not achieved with AVX #8

Open
sharon-k opened this issue Apr 24, 2018 · 7 comments
Open

Performance improvement is not achieved with AVX #8

sharon-k opened this issue Apr 24, 2018 · 7 comments

Comments

@sharon-k
Copy link

My processor- Intel(R) Core(TM) i7-3740QM supports the AVX instructions set. I created 2 environments with Anaconda 4.5.0:

  • tf_avx: has tf installed with tensorflow-windows-wheel/1.5.0/py36/CPU/avx/tensorflow-1.5.0-cp36-cp36m-win_amd64.whl

  • tf_wo_simd: has tf installed with pip install tensorflow==1.2.0
    (the reason I selected this version is to ensure that I'm installing a version without SIMD. When it's activated, I can see the warnings about the SIMD printed by tf)

I ran the same code, evaluating a simple network with two fully connected layers, on each of the envs. I couldn't see the time improvement between the two. I will continue to say that it comes after I try it on a complicated network with few conv layers, the improvement wasn't seen there too.

Did I miss something?
Thank you for your help

@fo40225
Copy link
Owner

fo40225 commented Apr 24, 2018

Indeed.

Currently using cmake to build Tensorflow on windows, its SIMD configuration has some issue that did not apply to sub module.

You can use this script to compare the speed.

https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

In my observation, there is only a slight difference in the results of using AVX or not on windows, and did not reach the improvement in the following table.

https://www.tensorflow.org/performance/performance_guide#comparing_compiler_optimizations

@sharon-k
Copy link
Author

Thank you for your reply!

Are there plans to address this issue with cmake? Is there an open bug report? If not, could you shortly describe it - maybe I'd be able to help... :)

@fo40225
Copy link
Owner

fo40225 commented Apr 26, 2018

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake/external

Those cmake files didn't pass tensorflow_WIN_CPU_SIMD_OPTIONS to build the library.

@fo40225
Copy link
Owner

fo40225 commented May 4, 2018

fo40225/tensorflow@0a95e35

I'm trying to build with those change, but still no performance difference from sse2 to avx2.

Seem this issue is caused by another reason.

@TheRedMudder
Copy link

TheRedMudder commented Jun 13, 2018

In my testing, I found that GPU with no AVX2 for the CPU out performed the GPU using AVX2 in speed! I thought AVX2 will give performance gains for operations that only have CPU implementations, but in my test AVX2 marginally decreased performance when the GPU flag was used. Of course AVX2 did improve performance when no GPU flag was used. Have you seen similar results?

Benchmark Test

NO GPU + AVX2 -3rd Place NO GPU + NoAVX2 - 4th Place
AVX No AVX
GPU + AVX2 -2nd Place GPU + No AVX2 - 1st Place
AVX No AVX

Test Specifications

The benchmarks were run on a Windows 10 system that has an Intel i5-8400 with 16 GB RAM and NVIDIA GeForce GTX 1080 Ti with 11 GB dedicated memory. The code and video it was processing was the same, but with different supports for AVX2 and GPU in TensorFlow. YOLOv2 with TensorFlow as the backend was used for benching with the following command: python flow --model cfg/yolo.cfg --load bin/yolo.weights --demo ../video/Ron.mp4 --gpu .8 --saveVideo

Questions

  • I understand why AVX2 outperforms no AVX2 when no GPU is used, but why does AVX2 cause a marginal performance decrease when GPU is used?
  • Do you guys see have similar marginal performance decreases when using AVX2 with GPU compared to not using AVX2 with GPU?

Also, thanks @fo40225 for this repo, it is an awesome time saver!
Edits: Spelling

@fo40225
Copy link
Owner

fo40225 commented Jun 14, 2018

@TheRedMudder I saw "GPU + AVX2" case contains a "out of vram" error. You should recheck your benchmark script.

If you can add my sse2 version to benchmark, it will more clarify the result (sse2 vs avx2, official vs custom build).

@GuyTraveler
Copy link

Any updates regarding this issue? I was testing the inference speed difference between the multiple optimized binaries for Windows and at best noticed a 20ms improvement which certainly does not measure up to the expectations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants