Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-engine compositors #156

Closed
wants to merge 42 commits into from
Closed

Per-engine compositors #156

wants to merge 42 commits into from

Conversation

veeenu
Copy link
Owner

@veeenu veeenu commented Feb 24, 2024

I come to you, dear community, with a PR that would infuse a merciless sense of dread even in the hearts of the greatest practitioners of Zen meditation alive.

immagine

This PR is close to a full rewrite of hudhook's core. The plan is to carefully review every single file and running extensive tests before smashing the green button, but the quality has in my opinion improved greatly.

2024-02-24_164233
2024-02-24_164331
2024-02-24_164442
2024-02-24_164050

Changes

Unified renderer

We stay the course with the single imgui renderer as of the previous PR #143.

The single renderer allows us to work on imgui specific circumstances from a single unified spot. This way, UI rendering issues and new features don't have to be spread over four different renderers, and we will be able in the future to leverage rendering concurrency if we so wish.

This is currently not so; the imgui frame is constructed and rendered right in the middle of the Present hooks, and blocking those calls. Luckily hudhook and imgui are pretty slick and can do all that in all four renderers without ever dipping below a solid 100fps on my machine (RTX 3080).

Per-engine compositors

I took a deep plunge in the dark seas of rendering engines and built four full-screen quad renderers from scratch, one per supported engine, which take the off-screen rendered resource and composite it on top of the appropriate backbuffer.

  • The DirectX 12 compositor works seamlessly.
  • The DirectX 11 compositor needs to open the off-screen resource as a shared resource.
  • The DirectX 9 and OpenGL3 renderers do not have an easy means of sharing memory with DirectX 12, so a very bad fix was necessary: after the DirectX 12 rendering is done, the resource is staged and mapped, its data is pulled on the CPU side and reuploaded to the DirectX 9 / OpenGL3 compositors as their own kind of texture. This is all manners of awful because we move from the GPU to the CPU back to the GPU which has insane latency in theory. In practice, I never dipped below 100fps in all my tests despite of this, and the DirectX 9 and OpenGL 3 games I tested don't seem to bat an eye at all.
    If you are using one of these renderers, PLEASE test this branch to confirm that there is no performance penalty.
    If you are willing to dig for a solution that doesn't leave the GPU, I'd be forever grateful.

Asynchronous input processing

There is now a Pipeline object (its design is still a bit dirty, I want to work on it) that is responsible for managing the lifecycle of the rendering engine, the imgui context, and among those operations will also manage the window procedure. The window procedure proper is now dummy and only ships its parameters to a mpsc::channel.

This has the advantage that the window events are now processed asynchronously just before a frame gets rendered, which should be fine on our end, but the drawback that an event cannot be blocked immediately should the render loop wish to do that, as it won't be able to decide right away whether to consume the events without forwarding them. This should not be a big deal as these decisions sort of depend on the state before the rendering anyway, rather than on any single specific event, and the previous model is not more fine grained than that.

If you had issues with blocking input, please give this branch a go, it might just inadvertently fix that. Or make things worse. It did not make things worse for my use cases.

Wine support

I have tested on-and-off the library on Wine with Proton, and it seems like it's working pretty well. There's still work to be done to figure things out, but we're getting there.

Todo list

  • Refactor the Pipeline external API; it is very similar in all renderers and the current solution looks very bad
  • Refactor the window procedures as they are identical across renderers and can be hidden away
  • Review the entire codebase
  • Add documentation and comments where needed

Affected issues

Conclusion

All in all, I am very happy about the results of this PR and have been working tirelessly towards it. There are some outstanding issues, but all manageable. I will keep making inconsequential changes while I wait for your feedback, but my plan is to wrap this up sooner rather than later and finally release 0.6.0.

cc @Jakobzs @vars1ty @vsylva @camas

@vars1ty
Copy link
Contributor

vars1ty commented Feb 25, 2024

Tested in Star Stable Online with the example OGL3 DLL, built with MSVC. All tested via Wine.

Average FPS without, and with UI displayed:

  • Before: ~202 FPS
  • After: ~60 FPS

So does tank a bit, worse FPS than the legacy render engine, but does now work on Wine unlike the last refactor.

Comment on lines +91 to +96
// TODO find a better alternative than allocating each frame
let message_queue = self.rx.try_iter().collect::<Vec<_>>();

message_queue.into_iter().for_each(|PipelineMessage(hwnd, umsg, wparam, lparam)| {
imgui_wnd_proc_impl(hwnd, umsg, wparam, lparam, self);
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass directly to for_each?

Suggested change
// TODO find a better alternative than allocating each frame
let message_queue = self.rx.try_iter().collect::<Vec<_>>();
message_queue.into_iter().for_each(|PipelineMessage(hwnd, umsg, wparam, lparam)| {
imgui_wnd_proc_impl(hwnd, umsg, wparam, lparam, self);
});
self.rx.try_iter().for_each(|PipelineMessage(hwnd, umsg, wparam, lparam)| {
imgui_wnd_proc_impl(hwnd, umsg, wparam, lparam, self);
});

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not: self.rx.try_iter() already mutably borrows self, and imgui_wnd_proc_impl would also have to mutably borrow self for the same lifetime which is not possible.

Similarly, allocating a vector in Pipeline and draining/extending it there with the contents of try_iter is also not going to work for the same reason, barring using RefCell which has overhead anyway -- maybe less so than allocating a vector outright, but at the cost of making the code look a lot uglier.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concocted a solution for this:

  • Store an allocated Vec in a OnceCell.
  • Pull it out of the OnceCell before this section and use it as a backing storage for rx.try_iter().
  • Drain it immediately (preserving the capacity) and pass the contents to the wnd proc implementation.
  • Store the drained Vec back in the OnceCell.

This is a bit ugly, but works around the lifetime issues with little to no performance overhead -- OnceCell::take and OnceCell::set are simple moves, memory is only ever reallocated when needed, a generous buffer could be pre-allocated, and even a large vector won't be that much of a memory waste.

@veeenu
Copy link
Owner Author

veeenu commented Feb 26, 2024

This PR feels more like the development of a redemption arc rather than good, conscientious software engineering, but here goes.

The performance penalty incurred by @vars1ty is not acceptable, so I don't see this copy-buffers solution as viable.

Reviewing my own code, I realized that the full screen quad renderers which I wrote from scratch are literally 30-40 lines away each from being complete, fully-fledged dear imgui renderers. The difference is literally just uploading data to vertex/index buffers and adding render commands according to an enum match in a loop.

In the process, I have acquired enough knowledge about the various rendering engines and GPUs that I now feel confident maintaining all of these things by myself.

In fact, I went back and checked the renderers' code from 0.5.0 and I shuddered. What the heck was that garbage, @veeenu !

We will thus go back to the previous model, skipping the current PR: one rendering engine per hook type.

The way forward: considerations

  • The windowing event behavior was much different across hooks and required a lot more code in the legacy way compared to the current solution. All of that behavior salad is now already accounted for, and improved upon, by the pipeline module.

  • Reverting to per-renderer renderers will come at the cost of not being able to do concurrent rendering in opengl3 and dx9, which just don't have the capability (that I know of) of easily working asynchronously. For the time being, neither the dx11 and dx12 renderers will have that capability, though that is true of this current PR as well; but it is still potentially viable to do so at some point in the future, should that need arise.

  • I'm hoping that all in all the new, from-scratch renderers are going to be a big, net decrease in code, and an overall simplification and streamlining of behavior, to the point where there might be no need to touch them at all for maintenance and having four of them won't be burdensome at all.

  • The imgui context will be managed externally by Pipeline (like as of this PR) and only mutably passed to the renderers for a limited lifetime, which is a great improvement in ergonomics and separation of concerns compared to the legacy approach where the renderer created and owned the context altogether.

Conclusions

More updates to come soon. I have a plan well laid out in my mind to make good out of this absolute mess that I have willingly put myself in the midst of. February is coming to an end, which means that if everything goes according to plan, I will have a solid solution by the time it will be both feasible and advisable for me to finally go out and touch grass. 🌱

@veeenu
Copy link
Owner Author

veeenu commented Feb 29, 2024

Superseded by #158.

@veeenu veeenu closed this Feb 29, 2024
@veeenu veeenu deleted the heck-dcomp branch March 2, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DirectComposition is not implemented on Wine
3 participants