Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarking v0.14.4 release - discussion about performence issues / high latency #1238

Open
mgarbade opened this issue Aug 9, 2024 · 21 comments
Labels
type:support Support issue

Comments

@mgarbade
Copy link
Contributor

mgarbade commented Aug 9, 2024

Plugin Version or Commit ID

v0.14.4

Unity Version

2022.3.34f1

Your Host OS

Windows 10 Pro

Target Platform

Android

Description

I just downloaded the precompiled 0.14.4 release and ran it through our standardized latency benchmark (explained here). It looks like the performance of the plugin keeps on deteriorating on our benchmarking devices (Samsung Tab S7 and A8).

As can be seen from the table below, the latency is almost double compared to a release in 2021 (0.6.2). We will now start to investigate the problem. My gut feeling is that the plugin deteriorated over the time because there were no robust benchmark to test for performance, so nobody could "prove" that it actually became slower. There are probably many things that could have impacted the performance.

2024-08-09 17_09_40-Epic Pen Toolbar

My next step will be to compile and benchmark the pure mediapipe v0.10.14 and post the results here.
Since I already benchmarked an older version of the pure mediapipe (probably v0.8.6) and the latency was ok (A8 = 198 +- 14 ms and S7 = 175 +- 10 ms), so I don't think the problem lies with mediapipe.

I hope this github issue can become a lively discussion about possible reasons.

Code to Reproduce the issue

download the v0.14.4 release (precompiled)

Additional Context

I hope people weigh in on the discussion. The latest plugin has some really interesting features (multi person pose estimation) but unacceptable performance.

@mgarbade mgarbade added the type:support Support issue label Aug 9, 2024
@mgarbade mgarbade changed the title asdf benchmarking v0.14.4 release - discussion about performence issues / high latency Aug 9, 2024
@homuler
Copy link
Owner

homuler commented Aug 10, 2024

I can't say for sure without knowing the details of the benchmark, but since we cannot pass the input data on GPU directly through the Task API for now, there should be some performance degradation as a result ( see #1076 )

@mgarbade
Copy link
Contributor Author

I will tell you all the details of the benchmark in more detail. I hope that will help installing in these numbers.
The point is, that it measures the latency end-to-end, i.e. given a change in the input image, how many ms until we see the change in the pose tracking skeleton.

Here are the big "culprits" I can see for now

  • when the plugin had a strong refactoring ("new sample app") in 2021, there was the first notable performance degradation
  • changing / updating Unity might also contribute to performance degradation ( ~30ms higher latency on Samsung S7, when using Unity 2022 instead of 2021), though that finding might still have a simple explanation. I noticed that in Unity 2022 they introduced a 30 FPS limit apparently. I will benchmark the effect of that throttle again and publish the results here.

May I ask, if you have a method of quantifying the performance / latency of the plugin? Or do you just make a qualitative assessment of the plugin's speed?
What devices do you ususally test the plugin on?

I hope we can contribute to improving the plugins performance in the near future, as the latest release has really interesting features, but sadly is too slow for production.

@homuler
Copy link
Owner

homuler commented Aug 12, 2024

how many ms until we see the change in the pose tracking skeleton.

Could you show me in the code exactly from which point to which point the time is being measured?
I want to rule out the possibility that you are measuring the response time of the sample application, rather than the performance of the plugin.

@mgarbade
Copy link
Contributor Author

I'm using a high speed camera (1000 FPS) to measure changes on a test device (e.g. Samsung S7) vs changes in a mirror (image of the setup and explanation here). I'm sorry if the explanation is hard to grasp.

Yes, I'm measuring the performance of the sample application. There is no other way to measure the speed of the plugin on Android without writing custom code. So in the end I'm not sure where the problem lies with the plugin or the sample as they are coupled. That's what I'm trying to find out.

@homuler
Copy link
Owner

homuler commented Aug 12, 2024

I mean, please indicate the start and end points of the measurement in the code (= line numbers).
You don't need to explain it in natural language.

For example,
from


to

@mgarbade
Copy link
Contributor Author

There is no "measurement in code". Here is a 12 s extract of the slow-motion video (shot with 1000 FPS) that shows the measurement:

  • top: mirror
  • bottom left: Samsung Galaxy Tab A8
  • bottom right: Samsung Galaxy Tab S7

In slow motion you can clearly see the end-to-end latency and you can measure it by counting frames.

The only change to the code is the color and thickness of the skeleton lines, since it is needed by our automatic evaluation code.

In the above video, both devices have the same apk installed:

  • pure 0.14.4 release of the "MediaPipeUnityPlugin"
  • PoseTracking sample scene
  • running mode: sync
  • model complexity: light

@homuler
Copy link
Owner

homuler commented Aug 13, 2024

It might be correct as a benchmarking of the sample app, but the sample app itself is not implemented with the best latency in mind. Therefore, I don't think it is appropriate to conclude that the plugin's performance has degraded based on a specific metric of the sample app.

If the performance of your application has noticeably deteriorated due to the plugin update, there may be an issue with the plugin itself. If that is the case, please open a new issue.

@homuler
Copy link
Owner

homuler commented Aug 13, 2024

It’s a bit cumbersome to explain everything, so I'll just comment on one point:
I believe the latency has increased in the process of sending the input image to MediaPipe (before it starts processing). However, this is an intentional change to avoid spikes. For applications where latency is a concern, there might be better implementations.

@mgarbade
Copy link
Contributor Author

There is no well defined / robust way to benchmark the plugins latency / performance other than benchmarking the pure sample app.

Having a minimal working example of using the plugin in the most performant way possible is the holy grail here. It's not easy to rewrite the samples from scratch.

We tried to port the old samples (from the 0.6.2 release) to a newer release, but the performance was still bad. Since this was the work of a colleague, I cannot confidently exclude any errors during the import. Actually in my struggle to create more performant samples, that will be my first step...

I will follow up with some more latency measurements. Any supervision / tips from your side would be highly appreciated. For example, what was the "intentional change to avoid spikes"? Is there any specific point in code you are referring to or is it more about an architectural change?

@mgarbade
Copy link
Contributor Author

mgarbade commented Aug 15, 2024

Here are some "in code" latency measurements: 0.6.2 vs 0.14.4

For 0.14.4:

from
MediaPipeUnityPlugin/Assets/MediaPipeUnity/Samples/Scenes/Pose Landmark Detection/PoseLandmarkerRunner.cs

Line 79 in 9251ba5

to
MediaPipeUnityPlugin/Assets/MediaPipeUnity/Samples/Scenes/Pose Landmark Detection/PoseLandmarkerRunner.cs

Line 112 in 9251ba5

For 0.6.2:

from

to just before this line:

RenderAnnotation(screenController, poseTrackingValue);

Here are the results (average + standard deviation, all numbers in Milliseconds)
2024-08-15 13_53_15-Epic Pen Content Surface_ __ _DISPLAY2

-> on Samsung A8 0.6.2 is 4x faster then 0.14.4
-> on Samsung S7 0.6.2 is 3x faster then 0.14.4

@mgarbade
Copy link
Contributor Author

And here a comparison of the end-to-end latency (measured with mirror and high speed camera) between

  • pure Mediapie, 0.10.14, Android example
  • MediaPipeUnityPlugin (MPUP), 0.14.4, PoseTrackingScene

2024-08-15 14_01_56-Epic Pen Content Surface_ __ _DISPLAY1

-> pure Mediapipe is roughly 2x faster then MPUP on any measurement

@homuler
Copy link
Owner

homuler commented Aug 15, 2024

Here are some "in code" latency measurements: 0.6.2 vs 0.14.4

To reiterate, the method of generating input images is different.
In the older implementation, the input images were copied within a single frame, resulting in better latency, but due to the longer processing time within the frame, some devices could not maintain FPS.
In the newer implementation, the input images are copied over several frames to avoid blocking the main thread, which has worsened latency.
However, with the latest changes, copying is now done via the GPU, which I believe has led to some improvements.

Having a minimal working example of using the plugin in the most performant way possible is the holy grail here.

I don’t think there’s a need to intentionally slow it down, but I believe it’s important for the sample app to be easy to understand. Moreover, I don’t think the definition of ‘the most performant’ is self-evident in the first place.

pure Mediapie, 0.10.14, Android example

Will you share the link?

@homuler
Copy link
Owner

homuler commented Aug 15, 2024

In my opinion, the performance of the plugin should be measured by the time from when the input texture is acquired to when the output result is received.

@mgarbade
Copy link
Contributor Author

mgarbade commented Aug 15, 2024

pure Mediapie, 0.10.14, Android example

Will you share the link?

here the link to the pure Mediapipe 0.10.14 sample

and the mediapipe version is specified here

@mgarbade
Copy link
Contributor Author

To reiterate, the method of generating input images is different.
In the older implementation, the input images were copied within a single frame, resulting in better latency, but due to the longer processing time within the frame, some devices could not maintain FPS.
In the newer implementation, the input images are copied over several frames to avoid blocking the main thread, which has worsened latency.
However, with the latest changes, copying is now done via the GPU, which I believe has led to some improvements.

That's nice to hear. I'm about to benchmark the "GPU copy feature" in ab78d5c asap and share the results here.

I'm suprised that copying the image within a single frame (frame budget probably ~33ms) would cause problems on some devices. We also test a weak device (Samsung A8) and a strong device (Samsung S7) here and both seem to work better with the "copy in a single frame" approach (assuming that is what's happening in 0.6.2)

From my experience: Copying an image should be in the range of 1-3 ms whereas running an image through a TFLite model on android takes 10ms (S7) to 20ms (A8). These numbers were gathered with the official adb-tflite-benchmark tool.
So since running of the neural network is an order of magnitude more expensive then copying the image, I think copying the image within a single frame should be feasable.

Anyway, I'll get back with more numbers soon.

@mgarbade
Copy link
Contributor Author

So it's a bit hard for me to benchmark the new feature. I just tested the 0.15.0 release. It's definitely much faster, probably twice as fast but I don't trust the numbers yet, since the evaluation is automatic and the pose is flickering a lot, as can be seen from the following video:

https://drive.google.com/file/d/1vl0tUBc3qGyuEDOQO5cyx07EiBdaAMKh/view?usp=sharing

My hunch is that

  • the timestamps are that go into mediapipe are wrong / not increasing monotonically.
  • might be due to some integer division, when computing the timestamps
    In a quick check in editor mode, I noticed that many packets to into mediapipe with the exact same timestamp.
    Maybe one has to use microseconds instead of milliseconds for the timestamps.

Also: The sync mode is not working anymore, I could only test the async mode. I guess sync mode is just dead code / left over. Thats a pity since the sync mode can be very beneficial on strong devices and is also good for benchmarking.

@homuler
Copy link
Owner

homuler commented Sep 7, 2024

The sync mode is not working anymore, I could only test the async mode.

When the running mode is IMAGE or VIDEO, inference is executed synchronously.
While there does appear to be some lag when in VIDEO mode, I have not yet confirmed whether this is an issue on MediaPipe's side or a mistake in my implementation.

@homuler
Copy link
Owner

homuler commented Sep 7, 2024

Since there was no delay when copying the input image on the CPU, it seems to be an issue with my implementation. Maybe this is a situation we've encountered before.

@mgarbade
Copy link
Contributor Author

mgarbade commented Sep 17, 2024

Have you thought about potential issues with the timestamps (as mentioned earlier)?

I don't know why else the detection of the arms would flicker up and down in the slo-mo video. This has nothing to do with latency.

I currently don't have much time looking into it, since our startup lost funding and I have to look for a new job.

@homuler
Copy link
Owner

homuler commented Sep 18, 2024

This has nothing to do with latency.

If so, I would like you to create a separate issue.
Regarding the suggestion that it might be a timestamp problem, I haven't investigated it because the likelihood is low. At the very least:

@wangwenfeng0
Copy link

@mgarbade Recently, when testing the 0.15.0 plugin on the Android system of RK3588, the tracking delay is still very serious. Have you found a better optimization solution to improve the detection speed? Looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:support Support issue
Projects
None yet
Development

No branches or pull requests

3 participants