Per-engine compositors #156

veeenu · 2024-02-24T17:33:53Z

I come to you, dear community, with a PR that would infuse a merciless sense of dread even in the hearts of the greatest practitioners of Zen meditation alive.

This PR is close to a full rewrite of hudhook's core. The plan is to carefully review every single file and running extensive tests before smashing the green button, but the quality has in my opinion improved greatly.

Changes

Unified renderer

We stay the course with the single imgui renderer as of the previous PR #143.

The single renderer allows us to work on imgui specific circumstances from a single unified spot. This way, UI rendering issues and new features don't have to be spread over four different renderers, and we will be able in the future to leverage rendering concurrency if we so wish.

This is currently not so; the imgui frame is constructed and rendered right in the middle of the Present hooks, and blocking those calls. Luckily hudhook and imgui are pretty slick and can do all that in all four renderers without ever dipping below a solid 100fps on my machine (RTX 3080).

Per-engine compositors

I took a deep plunge in the dark seas of rendering engines and built four full-screen quad renderers from scratch, one per supported engine, which take the off-screen rendered resource and composite it on top of the appropriate backbuffer.

The DirectX 12 compositor works seamlessly.
The DirectX 11 compositor needs to open the off-screen resource as a shared resource.
The DirectX 9 and OpenGL3 renderers do not have an easy means of sharing memory with DirectX 12, so a very bad fix was necessary: after the DirectX 12 rendering is done, the resource is staged and mapped, its data is pulled on the CPU side and reuploaded to the DirectX 9 / OpenGL3 compositors as their own kind of texture. This is all manners of awful because we move from the GPU to the CPU back to the GPU which has insane latency in theory. In practice, I never dipped below 100fps in all my tests despite of this, and the DirectX 9 and OpenGL 3 games I tested don't seem to bat an eye at all.
If you are using one of these renderers, PLEASE test this branch to confirm that there is no performance penalty.
If you are willing to dig for a solution that doesn't leave the GPU, I'd be forever grateful.

Asynchronous input processing

There is now a Pipeline object (its design is still a bit dirty, I want to work on it) that is responsible for managing the lifecycle of the rendering engine, the imgui context, and among those operations will also manage the window procedure. The window procedure proper is now dummy and only ships its parameters to a mpsc::channel.

This has the advantage that the window events are now processed asynchronously just before a frame gets rendered, which should be fine on our end, but the drawback that an event cannot be blocked immediately should the render loop wish to do that, as it won't be able to decide right away whether to consume the events without forwarding them. This should not be a big deal as these decisions sort of depend on the state before the rendering anyway, rather than on any single specific event, and the previous model is not more fine grained than that.

If you had issues with blocking input, please give this branch a go, it might just inadvertently fix that. Or make things worse. It did not make things worse for my use cases.

Wine support

I have tested on-and-off the library on Wine with Proton, and it seems like it's working pretty well. There's still work to be done to figure things out, but we're getting there.

Todo list

Refactor the Pipeline external API; it is very similar in all renderers and the current solution looks very bad
Refactor the window procedures as they are identical across renderers and can be hidden away
Review the entire codebase
Add documentation and comments where needed

Affected issues

Closes DirectComposition is not implemented on Wine #151
Errors on Resident Evil 2 and Testing on Other Games #155 needs review
Crashes while unhooking #139 needs review
DX9 ImGui unresponsive #137 needs review
Encountered an issue while using the Example Dx11 Hook with Dx11Game Rune Factory 5 #130 needs review
Allow DirectX hooking using existing device pointer #107 needs review
[DX9] Some games do not reset at window resizing #103 needs review
Bug: D3D12CreateDevice on D3D11 Hook #96 needs review
Game input not registering correctly #79 needs review
Performance issues in some games in native windowed borderless mode #78 needs review

Conclusion

All in all, I am very happy about the results of this PR and have been working tirelessly towards it. There are some outstanding issues, but all manageable. I will keep making inconsequential changes while I wait for your feedback, but my plan is to wrap this up sooner rather than later and finally release 0.6.0.

cc @Jakobzs @vars1ty @vsylva @camas

vars1ty · 2024-02-25T09:10:02Z

Tested in Star Stable Online with the example OGL3 DLL, built with MSVC. All tested via Wine.

Average FPS without, and with UI displayed:

Before: ~202 FPS
After: ~60 FPS

So does tank a bit, worse FPS than the legacy render engine, but does now work on Wine unlike the last refactor.

camas · 2024-02-25T13:21:42Z

src/pipeline.rs

+        // TODO find a better alternative than allocating each frame
+        let message_queue = self.rx.try_iter().collect::<Vec<_>>();
+
+        message_queue.into_iter().for_each(|PipelineMessage(hwnd, umsg, wparam, lparam)| {
+            imgui_wnd_proc_impl(hwnd, umsg, wparam, lparam, self);
+        });


Pass directly to for_each?

Suggested change

// TODO find a better alternative than allocating each frame

let message_queue = self.rx.try_iter().collect::<Vec<_>>();

message_queue.into_iter().for_each(|PipelineMessage(hwnd, umsg, wparam, lparam)| {

imgui_wnd_proc_impl(hwnd, umsg, wparam, lparam, self);

});

self.rx.try_iter().for_each(|PipelineMessage(hwnd, umsg, wparam, lparam)| {

imgui_wnd_proc_impl(hwnd, umsg, wparam, lparam, self);

});

Unfortunately not: self.rx.try_iter() already mutably borrows self, and imgui_wnd_proc_impl would also have to mutably borrow self for the same lifetime which is not possible.

Similarly, allocating a vector in Pipeline and draining/extending it there with the contents of try_iter is also not going to work for the same reason, barring using RefCell which has overhead anyway -- maybe less so than allocating a vector outright, but at the cost of making the code look a lot uglier.

I have concocted a solution for this:

Store an allocated Vec in a OnceCell.

Pull it out of the OnceCell before this section and use it as a backing storage for rx.try_iter().

Drain it immediately (preserving the capacity) and pass the contents to the wnd proc implementation.

Store the drained Vec back in the OnceCell.

This is a bit ugly, but works around the lifetime issues with little to no performance overhead -- OnceCell::take and OnceCell::set are simple moves, memory is only ever reallocated when needed, a generous buffer could be pre-allocated, and even a large vector won't be that much of a memory waste.

veeenu · 2024-02-26T19:23:50Z

This PR feels more like the development of a redemption arc rather than good, conscientious software engineering, but here goes.

The performance penalty incurred by @vars1ty is not acceptable, so I don't see this copy-buffers solution as viable.

Reviewing my own code, I realized that the full screen quad renderers which I wrote from scratch are literally 30-40 lines away each from being complete, fully-fledged dear imgui renderers. The difference is literally just uploading data to vertex/index buffers and adding render commands according to an enum match in a loop.

In the process, I have acquired enough knowledge about the various rendering engines and GPUs that I now feel confident maintaining all of these things by myself.

In fact, I went back and checked the renderers' code from 0.5.0 and I shuddered. What the heck was that garbage, @veeenu !

We will thus go back to the previous model, skipping the current PR: one rendering engine per hook type.

The way forward: considerations

The windowing event behavior was much different across hooks and required a lot more code in the legacy way compared to the current solution. All of that behavior salad is now already accounted for, and improved upon, by the pipeline module.
Reverting to per-renderer renderers will come at the cost of not being able to do concurrent rendering in opengl3 and dx9, which just don't have the capability (that I know of) of easily working asynchronously. For the time being, neither the dx11 and dx12 renderers will have that capability, though that is true of this current PR as well; but it is still potentially viable to do so at some point in the future, should that need arise.
I'm hoping that all in all the new, from-scratch renderers are going to be a big, net decrease in code, and an overall simplification and streamlining of behavior, to the point where there might be no need to touch them at all for maintenance and having four of them won't be burdensome at all.
The imgui context will be managed externally by Pipeline (like as of this PR) and only mutably passed to the renderers for a limited lifetime, which is a great improvement in ergonomics and separation of concerns compared to the legacy approach where the renderer created and owned the context altogether.

Conclusions

More updates to come soon. I have a plan well laid out in my mind to make good out of this absolute mess that I have willingly put myself in the midst of. February is coming to an end, which means that if everything goes according to plan, I will have a solid solution by the time it will be both feasible and advisable for me to finally go out and touch grass. 🌱

veeenu · 2024-02-29T10:01:32Z

Superseded by #158.

veeenu added 30 commits February 6, 2024 08:51

Heck dcomp ep1

6bf6a35

Non-dcomp DirectX 12 compositor

422c1af

Refactor

722fb15

Replace swap chain with texture render target

ee0b318

Just a checkpoint

2ed07fa

Appropriate resource barrier

14a1a74

I don't even know anymore man

9a352ea

Dx11 compositor working

f847e43

Cleanup and fmt

d823f33

Remove command queue from state

dc80900

Render engine cleanup

2248fec

More cleanup

f131e88

Let's just start over

6491a2a

New render engine

d102934

Implement render engine parts minus render loop

0353ec9

Add barrier helpers

fb3158d

Render draw data and sync

f47d730

New Dx12 compositor

8efa843

Error blob cleanup

275e44b

A bloody mess but it works somehow

d2577fa

Oh my god

46afcd2

Dx11 works as well

f6d889a

Clippy cleanup

6b7f408

Cleanup

53cb57d

Dx9 attempt

614819e

Harnesses: resize and debug interface

1798181

Compute shader attempt

c51dfc0

Compute compositor almost there

8f7dacc

Add demo imgui window examples

c057ac4

Cleanup test harnesses

2b59b5a

veeenu added 10 commits February 24, 2024 10:31

Remove compute compositor

b2266c1

Dx9 shows something

99ed141

Got the copy pitch right

50b8e96

Remove unimportant stuff from the dx9 compositor

f3d2a0a

Refactor

f3b6ce7

There is a weird crash in the dx12 harness but it's not urgent

fe0de85

I need a vacation

b81835a

Add OpenGL3 demo

4d49d6b

Change ImguiRenderLoop to accept imgui::Context again

8482e2e

Readd render engine

8e6def6

veeenu added 2 commits February 25, 2024 11:20

Prepping for GL_EXT_memory_objects_win32

322617e

Backpedal

133e4e4

camas reviewed Feb 25, 2024

View reviewed changes

veeenu mentioned this pull request Feb 29, 2024

Revamped per-engine renderers #158

Merged

veeenu closed this Feb 29, 2024

veeenu deleted the heck-dcomp branch March 2, 2024 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-engine compositors #156

Per-engine compositors #156

veeenu commented Feb 24, 2024

vars1ty commented Feb 25, 2024 •

edited

Loading

camas Feb 25, 2024

veeenu Feb 25, 2024

veeenu Feb 26, 2024

veeenu commented Feb 26, 2024

veeenu commented Feb 29, 2024

Per-engine compositors #156

Per-engine compositors #156

Conversation

veeenu commented Feb 24, 2024

Changes

Unified renderer

Per-engine compositors

Asynchronous input processing

Wine support

Todo list

Affected issues

Conclusion

vars1ty commented Feb 25, 2024 • edited Loading

camas Feb 25, 2024

Choose a reason for hiding this comment

veeenu Feb 25, 2024

Choose a reason for hiding this comment

veeenu Feb 26, 2024

Choose a reason for hiding this comment

veeenu commented Feb 26, 2024

The way forward: considerations

Conclusions

veeenu commented Feb 29, 2024

vars1ty commented Feb 25, 2024 •

edited

Loading