Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build batches across phases in parallel. #17764

Merged
merged 7 commits into from
Feb 13, 2025

Conversation

pcwalton
Copy link
Contributor

Currently, invocations of batch_and_prepare_binned_render_phase and batch_and_prepare_sorted_render_phase can't run in parallel because they write to scene-global GPU buffers. After PR #17698, batch_and_prepare_binned_render_phase started accounting for the lion's share of the CPU time, causing us to be strongly CPU bound on scenes like Caldera when occlusion culling was on (because of the overhead of batching for the Z-prepass). Although I eventually plan to optimize batch_and_prepare_binned_render_phase, we can obtain significant wins now by parallelizing that system across phases.

This commit splits all GPU buffers that
batch_and_prepare_binned_render_phase and
batch_and_prepare_sorted_render_phase touches into separate buffers for each phase so that the scheduler will run those phases in parallel. At the end of batch preparation, we gather the render phases up into a single resource with a new collection phase. Because we already run mesh preprocessing separately for each phase in order to make occlusion culling work, this is actually a cleaner separation. For example, mesh output indices (the unique ID that identifies each mesh instance on GPU) are now guaranteed to be sequential starting from 0, which will simplify the forthcoming work to remove them in favor of the compute dispatch ID.

On Caldera, this brings the frame time down to approximately 9.1 ms with occlusion culling on.

Screenshot 2025-02-08 210720

@pcwalton pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 10, 2025
@pcwalton pcwalton requested a review from superdump February 10, 2025 00:19
Currently, invocations of `batch_and_prepare_binned_render_phase` and
`batch_and_prepare_sorted_render_phase` can't run in parallel because
they write to scene-global GPU buffers. After PR bevyengine#17698,
`batch_and_prepare_binned_render_phase` started accounting for the
lion's share of the CPU time, causing us to be strongly CPU bound on
scenes like Caldera when occlusion culling was on (because of the
overhead of batching for the Z-prepass). Although I eventually plan to
optimize `batch_and_prepare_binned_render_phase`, we can obtain
significant wins now by parallelizing that system across phases.

This commit splits all GPU buffers that
`batch_and_prepare_binned_render_phase` and
`batch_and_prepare_sorted_render_phase` touches into separate buffers
for each phase so that the scheduler will run those phases in parallel.
At the end of batch preparation, we gather the render phases up into a
single resource with a new *collection* phase. Because we already run
mesh preprocessing separately for each phase in order to make occlusion
culling work, this is actually a cleaner separation. For example, mesh
output indices (the unique ID that identifies each mesh instance on GPU)
are now guaranteed to be sequential starting from 0, which will simplify
the forthcoming work to remove them in favor of the compute dispatch ID.

On Caldera, this brings the frame time down to approximately 9.1 ms with
occlusion culling on.
@pcwalton pcwalton force-pushed the parallel-batch-and-prepare branch from 08de0fe to c1f9764 Compare February 10, 2025 00:46
@@ -185,6 +190,10 @@ fn main() {
.set(RenderPlugin {
allow_copies_from_indirect_parameters: true,
..default()
})
.set(PbrPlugin {
allow_copies_from_indirect_parameters: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be reused from the RenderPlugin? Rather than having to set it in two places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate as to how that would work? The problem is that the PbrPlugin can't reach into the RenderPlugin to check its value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now but it's a rather particular option and I might suggest subsuming it some kind of broader "debug renderer" setting if we ever have a second instance of this kind of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. Or maybe a "debug flags"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and switched this to a RenderDebugFlags so that we can have more of them.

Copy link
Contributor

@tychedelia tychedelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. Thanks for creating the debug flags, a lot cleaner. Being able to lean on the ECS scheduler for this is nice and clean.

@superdump superdump added this pull request to the merge queue Feb 13, 2025
@alice-i-cecile alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 13, 2025
Merged via the queue into bevyengine:main with commit 0ede857 Feb 13, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants