Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Won't get the GPU to get utilized on MacBook with M3 Max and 128 GB RAM. #946

Open
7 tasks
Gabbelgu opened this issue Oct 8, 2024 · 13 comments
Open
7 tasks

Comments

@Gabbelgu
Copy link

Gabbelgu commented Oct 8, 2024

Describe the bug
I won't get the GPU to get utilized on my MacBook.
Other apps like LLM can utilize up to 70 GB RAM for the graphic processor.

To Reproduce
Steps to reproduce the behavior:
I've enabled CoreML, Max. Number of Threads = 18, GFPGAN and the other processors.
Same problem with Max. Number of Threads = 3, GFPGAN and the other processors..
Same problem with Max. Number of Threads = 8, GFPGAN and the other processors..

My configuration is:

MacBook Pro 16" 2023
M3 Max
128 GB RAM
Python 3.11
The rate is quite low like 1 to 2s / frames, and it mostly hangs up, not going forward for 3-5s, then recalculates to 1-2s / frames.

Details
What OS are you using?

  • Linux
  • Linux in WSL
  • Windows
  • [ x] Mac

Are you using a GPU?

  • No. CPU FTW
  • NVIDIA
  • AMD
  • Intel
  • [ x] Mac

Which version of roop unleashed are you using?
4.3.1

Screenshots
If applicable, add screenshots to help explain your problem.

@BrZHub
Copy link

BrZHub commented Oct 10, 2024

I had the same issue on a Macbook Air M2 24GB
Framerate was about 2sec per frame.
I upgraded the onnxruntime to 1.19.2 and now it does about 20 frames per second.

Just remove these two lines in requirements.txt:

onnxruntime==1.17.1; sys_platform == 'darwin' and platform_machine != 'arm64'
onnxruntime-silicon==1.16.3; sys_platform == 'darwin' and platform_machine == 'arm64'

And add this one:

onnxruntime==1.19.2; sys_platform == 'darwin'

And performance should be a lot better

@Gabbelgu
Copy link
Author

Gabbelgu commented Oct 10, 2024

Thank you, I tried it with removing the two lines and adding the one line in the requirements.txt but it is not working for me.

Bildschirmfoto 2024-10-10 um 23 50 58 Bildschirmfoto 2024-10-10 um 23 42 12

@codecowboy
Copy link

codecowboy commented Oct 29, 2024

@BrZHub > I upgraded the onnxruntime to 1.19.2 and now it does about 20 frames per second

Can you explain how you upgraded the runtime? python -m pip install onnxruntime ==1.19.2 ? I'm on an M1Pro with 16GB which is also doing about 2 frames / second. It also seems like platform_machine == arm64 would be fairly important?

@codecowboy
Copy link

@C0untFloyd Any chance you could provide some guidance here? Am happy to do some testing and add to the wiki - have got lots of time on my hands

@BrZHub
Copy link

BrZHub commented Oct 30, 2024

@BrZHub > I upgraded the onnxruntime to 1.19.2 and now it does about 20 frames per second

Can you explain how you upgraded the runtime? python -m pip install onnxruntime ==1.19.2 ? I'm on an M1Pro with 16GB which is also doing about 2 frames / second. It also seems like platform_machine == arm64 would be fairly important?

My requirements.txt file looks like this:

--extra-index-url https://download.pytorch.org/whl/cu118

numpy==1.26.4
gradio==4.44.0
fastapi<0.113.0
opencv-python-headless==4.9.0.80
onnx==1.17.0
insightface==0.7.3
albucore==0.0.16
psutil==5.9.6
torch==2.1.2+cu118; sys_platform != 'darwin'
torch==2.1.2; sys_platform == 'darwin'
torchvision==0.16.2+cu118; sys_platform != 'darwin'
torchvision==0.16.2; sys_platform == 'darwin'
onnxruntime==1.19.2; sys_platform == 'darwin'
onnxruntime-gpu==1.17.1; sys_platform != 'darwin'
tqdm==4.66.4
ftfy
regex
pyvirtualcam

It changed onnx and onnxruntime.
It installs the dependencies listed in this file when you start runMacOS.sh
So it probably overrides anything you install manually using "pip install"

On the settings page I set the provider to "coreml"
image

If i run this test clip and swap all faces without adding any additional filters it runs an average of 11.5 FPS:

Processing clip.trim_12-39-03.mp4 took 55.71 secs, 11.52 frames/s

clip.trim.mp4

After looking at this further and looking at CPU/GPU usage, I'm not actually sure it's using CoreML, but there is no chart to see if it is using the NPU...
But upgrading the ONNX libraries did increase the performance by 5x on my machine.. (15" MacBook Air M2)
So there might be more gains to make.

@codecowboy
Copy link

codecowboy commented Oct 30, 2024

Many thanks. What do you have your no of execution threads set to in settings? I'm not sure if that is referring to the cpu or gpu.
I've now tried editing requirements.txt as per yours but don't see a performance increase.

I also wondered if we could make use of https://pypi.org/project/onnxruntime-coreml/ somehow.

See also https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider.html

My python is pretty rusty but happy to collaborate with someone on this.

@codecowboy
Copy link

Have done a bit of digging and the following is placed in a number of files which load the models:


# replace Mac mps with cpu for the moment
            self.devicename = self.plugin_options["devicename"].replace('mps', 'cpu')

My guess is that no use is being made of the GPU or at least the Metal layer. I don't have a deep enough understanding of how CoreML works to know how that all fits together

@C0untFloyd
Copy link
Owner

Any chance you could provide some guidance here? Am happy to do some testing and add to the wiki - have got lots of time on my hands

Sorry I'm currently very short on time and I don't own a Mac.

self.devicename = self.plugin_options["devicename"].replace('mps', 'cpu')

You could comment out every line where this is done, see if it makes a difference. I sadly don't remember why there is this fallback to cpu. If it's working this could be easily made into a config setting.

@codecowboy
Copy link

Thanks. I’ll create a fork and let you know if I get it working.

@Gabbelgu
Copy link
Author

Thanks all for your comments and ideas.

Thanks. I’ll create a fork and let you know if I get it working.

I can do tests with your fork on my macbook if it helps.

@tookdes
Copy link

tookdes commented Nov 26, 2024

Any chance you could provide some guidance here? Am happy to do some testing and add to the wiki - have got lots of time on my hands

Sorry I'm currently very short on time and I don't own a Mac.

self.devicename = self.plugin_options["devicename"].replace('mps', 'cpu')

You could comment out every line where this is done, see if it makes a difference. I sadly don't remember why there is this fallback to cpu. If it's working this could be easily made into a config setting.

#269

It appears to be because onnxruntime just cannot support devices other than CUDA, such as MPS. I tested removing the MPS to CPU replacement code on Mac M4, with results shown below.

onnxruntime_inference_collection.py", line 32, in get_ort_device_type
    raise Exception("Unsupported device type: " + device_type)
Exception: Unsupported device type: mps

It seems that unless the onnxruntime issue is resolved, Mac devices won't be able to use CoreML acceleration for roop.

@codecowboy
Copy link

codecowboy commented Nov 26, 2024

That's not actually the case. There is a coreml execution provider it's just that the code as it is doesn't really make use of it. Newer versions of onnxruntime also directly support apple silicon but the packages in this repo are pinned to earlier versions.

I've been experimenting with all this and converting some of the models to coreml and also forcing coreml where I can in the existing code. I've seen slight improvements in frame rate but nothing spectacular yet. In addition there are allegedly speed gains to be made in the cv2 code by using UMat instead of Mat. I will be trying all this out on an adhoc basis so don't hold your breath but I'll report back if I make significant progress. In the meantime I'm using a GPU cloud instance with an NVIDIA card.

@rdastartupguy
Copy link

@BrZHub > I upgraded the onnxruntime to 1.19.2 and now it does about 20 frames per second
Can you explain how you upgraded the runtime? python -m pip install onnxruntime ==1.19.2 ? I'm on an M1Pro with 16GB which is also doing about 2 frames / second. It also seems like platform_machine == arm64 would be fairly important?

My requirements.txt file looks like this:

--extra-index-url https://download.pytorch.org/whl/cu118
numpy==1.26.4
gradio==4.44.0
fastapi<0.113.0
opencv-python-headless==4.9.0.80
onnx==1.17.0
insightface==0.7.3
albucore==0.0.16
psutil==5.9.6
torch==2.1.2+cu118; sys_platform != 'darwin'
torch==2.1.2; sys_platform == 'darwin'
torchvision==0.16.2+cu118; sys_platform != 'darwin'
torchvision==0.16.2; sys_platform == 'darwin'
onnxruntime==1.19.2; sys_platform == 'darwin'
onnxruntime-gpu==1.17.1; sys_platform != 'darwin'
tqdm==4.66.4
ftfy
regex
pyvirtualcam

It changed onnx and onnxruntime. It installs the dependencies listed in this file when you start runMacOS.sh So it probably overrides anything you install manually using "pip install"

On the settings page I set the provider to "coreml" image

If i run this test clip and swap all faces without adding any additional filters it runs an average of 11.5 FPS:

Processing clip.trim_12-39-03.mp4 took 55.71 secs, 11.52 frames/s

clip.trim.mp4

After looking at this further and looking at CPU/GPU usage, I'm not actually sure it's using CoreML, but there is no chart to see if it is using the NPU... But upgrading the ONNX libraries did increase the performance by 5x on my machine.. (15" MacBook Air M2) So there might be more gains to make.

Alright I confirm, changing onnx and onnxruntime versions does enable CoreML capability and FPS hits 10 to 15 on m2 pro. However, this seems to work only on first run. The second video reverts back to CPU and a crawling 0.7 fps to 2 fps max. Restarting the app enables CoreML again. Strange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants