Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better AMD GPUs support through ROCm/HIP #115

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

GZGavinZhao
Copy link

  • Enable ROCm/HIP GPU acceleration
  • Update .gitignore for build cache

Copy link
Owner

@TianZerL TianZerL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PkgConfig is for Linux.
The headers and libs of VapourSynth need to be manually set and will be checked in ThirdPartyForVS.cmake.

If you want to set it automatically, better do it in ThirdPartyForVS.cmake and do some "if else" check to make sure that it works on all supported platforms and we can set it manually if there is no pkg-config or the library cannot be found automatically.

@GZGavinZhao
Copy link
Author

Thanks for the review! I'll address them once I fix the performance issue. I have some very bad benchmark results here:

> ./build/bin/Anime4KCPP_CLI -B
Benchmark test under 8-bit integer input and serial processing...

CPU score:
 DVD(480P->960P): 71.4286 FPS
 HD(720P->1440P): 41.0959 FPS
 FHD(1080P->2160P): 18.2927 FPS

OpenCL score: (pID = 0, dID = 0)
 DVD(480P->960P): 1000 FPS
 HD(720P->1440P): 333.333 FPS
 FHD(1080P->2160P): 166.667 FPS

CUDA score: (dID = 0)
 DVD(480P->960P): 62.5 FPS
 HD(720P->1440P): 24.3902 FPS
 FHD(1080P->2160P): 11.1111 FPS

This benchmark is ran on AMD Radeon RX Vega 64 (gfx900). A similar benchmark result is also reproduced on AMD Radeon RX6600M (gfx1032). The build flag I used is cmake -GNinja -B build -S . -DCMAKE_BUILD_TYPE=Release -DEnable_HIP=ON -DEnable_OpenCL=ON -DMaximum_Optimization=ON. ROCm version is 5.5.1.

There's no way that ROCm runs this much slower than OpenCL. I'll continue to investigate this issue. The HIP code is an automatic translation from CUDA to HIP using the hipify-perl tool, so I don't know if that could be an issue.

@GZGavinZhao
Copy link
Author

Fortunately I think the benchmark results are misleading. I did a real world test by up-scaling a 1080P 4-minute episode of One Room Season 3 Episode 1. Flag used is -q -C avc1 -t 16 -T 16 -x -X -M <cuda|opencl>. Total processing time with the OpenCL backend took 4.39018 minutes, and the ROCm backend took 3.26867 minutes.

I profiled the benchmark and saw that the majority of the time is spent on hipStreamCreate and hipStreamSynchronize. I think what happened is that for a single image, ROCm performed badly because of the overhead of streams (does this also appear with CUDA vs OpenCL backend?), but when it's video processing streams becomes a benefit perhaps due to better parallelization.

@TianZerL
Copy link
Owner

TianZerL commented Dec 9, 2023

The creation and destruction of streams on CUDA should be low cost. I am using the dynamic "steam" on CUDA, which will create and destory "stream" in each processing and make the code simpler. Maybe it is better to use a static "stream" in ROCM.

There is actually some "warm up" before benchmarking, which make the result of CUDA normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants