Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use global binding arrays for bindless resources. #17898

Merged
merged 9 commits into from
Feb 21, 2025

Conversation

pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Feb 17, 2025

Currently, Bevy's implementation of bindless resources is rather unusual: every binding in an object that implements AsBindGroup (most commonly, a material) becomes its own separate binding array in the shader. This is inefficient for two reasons:

  1. If multiple materials reference the same texture or other resource, the reference to that resource will be duplicated many times. This increases wgpu validation overhead.

  2. It creates many unused binding array slots. This increases wgpu and driver overhead and makes it easier to hit limits on APIs that wgpu currently imposes tight resource limits on, like Metal.

This PR fixes these issues by switching Bevy to use the standard approach in GPU-driven renderers, in which resources are de-duplicated and passed as global arrays, one for each type of resource.

Along the way, this patch introduces per-platform resource limits and bumps them from 16 resources per binding array to 64 resources per bind group on Metal and 2048 resources per bind group on other platforms. (Note that the number of resources per binding array isn't the same as the number of resources per bind group; as it currently stands, if all the PBR features are turned on, Bevy could pack as many as 496 resources into a single slab.) The limits have been increased because wgpu now has universal support for partially-bound binding arrays, which mean that we no longer need to fill the binding arrays with fallback resources on Direct3D 12. The #[bindless(LIMIT)] declaration when deriving AsBindGroup can now simply be written #[bindless] in order to have Bevy choose a default limit size for the current platform. Custom limits are still available with the new
#[bindless(limit(LIMIT))] syntax: e.g. #[bindless(limit(8))].

The material bind group allocator has been completely rewritten. Now there are two allocators: one for bindless materials and one for non-bindless materials. The new non-bindless material allocator simply maintains a 1:1 mapping from material to bind group. The new bindless material allocator maintains a list of slabs and allocates materials into slabs on a first-fit basis. This unfortunately makes its performance O(number of resources per object * number of slabs), but the number of slabs is likely to be low, and it's planned to become even lower in the future with wgpu improvements. Resources are de-duplicated with in a slab and reference counted. So, for instance, if multiple materials refer to the same texture, that texture will exist only once in the appropriate binding array.

To support these new features, this patch adds the concept of a bindless descriptor to the AsBindGroup trait. The bindless descriptor allows the material bind group allocator to probe the layout of the material, now that an array of BindGroupLayoutEntry records is insufficient to describe the group. The #[derive(AsBindGroup)] has been heavily modified to support the new features. The most important user-facing change to that macro is that the struct-level uniform attribute, #[uniform(BINDING_NUMBER, StandardMaterial)], now reads #[uniform(BINDLESS_INDEX, MATERIAL_UNIFORM_TYPE, binding_array(BINDING_NUMBER)], allowing the material to specify the binding number for the binding array that holds the uniform data.

To make this patch simpler, I removed support for bindless ExtendedMaterials, as well as field-level bindless uniform and storage buffers. I intend to add back support for these as a follow-up. Because they aren't in any released Bevy version yet, I figured this was OK.

Finally, this patch updates StandardMaterial for the new bindless changes. Generally, code throughout the PBR shaders that looked like base_color_texture[slot] now looks like
bindless_2d_textures[material_indices[slot].base_color_texture].

This patch fixes a system hang that I experienced on the Caldera test when running with caldera --random-materials --texture-count 100. The time per frame is around 19.75 ms, down from 154.2 ms in Bevy 0.14: a 7.8× speedup.

@pcwalton pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward C-Bug An unexpected or incorrect behavior P-Crash A sudden unexpected crash labels Feb 17, 2025
@pcwalton pcwalton force-pushed the global-binding-arrays branch 2 times, most recently from 25b931b to 78b1009 Compare February 17, 2025 08:52
@pcwalton pcwalton added this to the 0.16 milestone Feb 17, 2025
@pcwalton pcwalton force-pushed the global-binding-arrays branch from 78b1009 to 3627a5d Compare February 17, 2025 18:07
pcwalton added a commit to pcwalton/bevy that referenced this pull request Feb 17, 2025
Currently, the specialized pipeline cache maps a (view entity, mesh
entity) tuple to the retained pipeline for that entity. This causes two
problems:

1. Using the view entity is incorrect, because the view entity isn't
   stable from frame to frame.

2. Switching the view entity to a `RetainedViewEntity`, which is
   necessary for correctness, significantly regresses performance of
   `specialize_material_meshes` and `specialize_shadows` because of the
   loss of the fast `EntityHash`.

This patch fixes both problems by switching to a *two-level* hash table.
The outer level of the table maps each `RetainedViewEntity` to an inner
table, which maps each `MainEntity` to its pipeline ID and change tick.
Because we loop over views first and, within that loop, loop over
entities visible from that view, we hoist the slow lookup of the view
entity out of the inner entity loop.

Additionally, this patch fixes a bug whereby pipeline IDs were leaked
when removing the view. We still have a problem with leaking pipeline
IDs for deleted entities, but that won't be fixed until the specialized
pipeline cache is retained.

This patch improves performance of the [Caldera benchmark] from 7.8×
faster than 0.14 to 9.0× faster than 0.14, when applied on top of the
global binding arrays PR, bevyengine#17898.

[Caldera benchmark]: https://github.com/DGriffin91/bevy_caldera_scene
Currently, Bevy's implementation of bindless resources is rather
unusual: every binding in an object that implements `AsBindGroup` (most
commonly, a material) becomes its own separate binding array in the
shader. This is inefficient for two reasons:

1. If multiple materials reference the same texture or other resource,
   the reference to that resource will be duplicated many times. This
   increases `wgpu` validation overhead.

2. It creates many unused binding array slots. This increases `wgpu` and
   driver overhead and makes it easier to hit limits on APIs that `wgpu`
   currently imposes tight resource limits on, like Metal.

This PR fixes these issues by switching Bevy to use the standard
approach in GPU-driven renderers, in which resources are de-duplicated
and passed as global arrays, one for each type of resource.

Along the way, this patch introduces per-platform resource limits and
bumps them from 16 resources per binding array to 64 resources per bind
group on Metal and 2048 resources per bind group on other platforms.
(Note that the number of resources per *binding array* isn't the same as
the number of resources per *bind group*; as it currently stands, if all
the PBR features are turned on, Bevy could pack as many as 496 resources
into a single slab.) The limits have been increased because `wgpu` now
has universal support for partially-bound binding arrays, which mean
that we no longer need to fill the binding arrays with fallback
resources on Direct3D 12. The `#[bindless(LIMIT)]` declaration when
deriving `AsBindGroup` can now simply be written `#[bindless]` in order
to have Bevy choose a default limit size for the current platform.
Custom limits are still available with the new
`#[bindless(limit(LIMIT))]` syntax: e.g. `#[bindless(limit(8))]`.

The material bind group allocator has been completely rewritten. Now
there are two allocators: one for bindless materials and one for
non-bindless materials. The new non-bindless material allocator simply
maintains a 1:1 mapping from material to bind group. The new bindless
material allocator maintains a list of slabs and allocates materials
into slabs on a first-fit basis. This unfortunately makes its
performance O(number of resources per object * number of slabs), but the
number of slabs is likely to be low, and it's planned to become even
lower in the future with `wgpu` improvements. Resources are
de-duplicated with in a slab and reference counted. So, for instance, if
multiple materials refer to the same texture, that texture will exist
only once in the appropriate binding array.

To support these new features, this patch adds the concept of a
*bindless descriptor* to the `AsBindGroup` trait.  The bindless
descriptor allows the material bind group allocator to probe the layout
of the material, now that an array of `BindGroupLayoutEntry` records is
insufficient to describe the group. The `#[derive(AsBindGroup)]` has
been heavily modified to support the new features. The most important
user-facing change to that macro is that the struct-level `uniform`
attribute, `#[uniform(BINDING_NUMBER, StandardMaterial)]`, now reads
`#[uniform(BINDLESS_INDEX, MATERIAL_UNIFORM_TYPE,
binding_array(BINDING_NUMBER)]`, allowing the material to specify the
binding number for the binding array that holds the uniform data.

To make this patch simpler, I removed support for bindless
`ExtendedMaterial`s, as well as field-level bindless uniform and storage
buffers. I intend to add back support for these as a follow-up. Because
they aren't in any released Bevy version yet, I figured this was OK.

Finally, this patch updates `StandardMaterial` for the new bindless
changes. Generally, code throughout the PBR shaders that looked like
`base_color_texture[slot]` now looks like
`bindless_2d_textures[material[slot].base_color_texture]`.

This patch fixes a system hang that I experienced on the [Caldera test]
when running with `caldera --random-materials --texture-count 100`. The
time per frame is around 19.75 ms, down from 154.2 ms in Bevy 0.14: a
7.8× speedup.
@pcwalton pcwalton force-pushed the global-binding-arrays branch from 3627a5d to b3afb48 Compare February 17, 2025 21:59
Copy link
Member

@aevyrie aevyrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit out of my depth here technically. I can't comment on the correctness of changes to bindless, however I did review the code and documentation to the best of my ability.

Documentation, code, and comments are all easy to read, though they require some background I am missing.

fn is_full(&self) -> bool {
self.used_slot_bitmap == (1 << (self.unprepared_bind_groups.len() as u32)) - 1
// Release the slot ID.
self.free_slots.push(slot);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a bad descriptor is passed in somehow and no resources are found in the table, this code to push the slot and decrement the allocation count will still run. I think this is still correct, but wanted to highlight for review in case this is a real edge case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of ref counting means that double frees are going to cause havoc in general unfortunately. We just have to be careful.

Auto,
Limit(LitInt),
}

pub fn derive_as_bind_group(ast: syn::DeriveInput) -> Result<TokenStream> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro is terrifying.

github-merge-queue bot pushed a commit that referenced this pull request Feb 18, 2025
Currently, the specialized pipeline cache maps a (view entity, mesh
entity) tuple to the retained pipeline for that entity. This causes two
problems:

1. Using the view entity is incorrect, because the view entity isn't
stable from frame to frame.

2. Switching the view entity to a `RetainedViewEntity`, which is
necessary for correctness, significantly regresses performance of
`specialize_material_meshes` and `specialize_shadows` because of the
loss of the fast `EntityHash`.

This patch fixes both problems by switching to a *two-level* hash table.
The outer level of the table maps each `RetainedViewEntity` to an inner
table, which maps each `MainEntity` to its pipeline ID and change tick.
Because we loop over views first and, within that loop, loop over
entities visible from that view, we hoist the slow lookup of the view
entity out of the inner entity loop.

Additionally, this patch fixes a bug whereby pipeline IDs were leaked
when removing the view. We still have a problem with leaking pipeline
IDs for deleted entities, but that won't be fixed until the specialized
pipeline cache is retained.

This patch improves performance of the [Caldera benchmark] from 7.8×
faster than 0.14 to 9.0× faster than 0.14, when applied on top of the
global binding arrays PR, #17898.

[Caldera benchmark]: https://github.com/DGriffin91/bevy_caldera_scene
Comment on lines 285 to 291
/// For example, suppose that the material slot is stored in a variable named
/// `slot`, the bindless index table is named `materials`, and that the first
/// field (index 0) of the bindless index table type is named `material`. Then
/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will
/// create a binding array buffer declared in the shader as `var<storage>
/// material: binding_array<StandardMaterialUniform>` and accessible as
/// `material[materials[slot].material]`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have an example. I'd recommend not reusing material for the binding name. So:

Suggested change
/// For example, suppose that the material slot is stored in a variable named
/// `slot`, the bindless index table is named `materials`, and that the first
/// field (index 0) of the bindless index table type is named `material`. Then
/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will
/// create a binding array buffer declared in the shader as `var<storage>
/// material: binding_array<StandardMaterialUniform>` and accessible as
/// `material[materials[slot].material]`.
/// For example, suppose that the material slot is stored in a variable named
/// `slot`, the bindless index table is named `material_indices`, and that the first
/// field (index 0) of the bindless index table type is named `material`. Then
/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will
/// create a binding array buffer declared in the shader as `var<storage>
/// material_array: binding_array<StandardMaterialUniform>` and accessible as
/// `material_array[material_indices[slot].material]`.

Or something like that.

I think this may need explicit code examples in the release notes, and/or a diagram showing how the pieces connect together. Not a criticism of what is done, which is necessary, just recognising that it's not the simplest thing to follow for people who just want to write a shader. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I made these changes:

  • pbr_bindings::material is now pbr_bindings::material_array.
  • pbr_bindings::material_bindings is now pbr_bindings::material_indices.
  • I added an ASCII art diagram to the AsBindGroup documentation to illustrate how this works.

/// This limit primarily exists in order to work around `wgpu` performance
/// problems involving large numbers of bindless resources. Also, some
/// platforms, such as Metal, currently enforce limits on the number of
/// resources in use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Metal, is this limit exposed through the wgpu Limits? If so, then maybe that one doesn't need to be hard-coded and this can maintain itself by taking the minimum of the appropriate value from Limits and the AUTO_BINDLESS_SLAB_RESOURCE_LIMIT. Then we can later just remove AUTO_BINDLESS_SLAB_RESOURCE_LIMIT when it is no longer relevant because wgpu performance issues for bindless arrays have been addressed, and it will still be correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the customizable limit so that applications that are seeing performance problems from large binding arrays have a way to address the issues. The idea is that, if you're seeing problems, you lower the limit until the performance is acceptable. This seems inherently application-specific, so I don't think we can query it from wgpu.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and that's a good choice and useful to have. But is Metal's value purely a performance constraint on Metal because these overheads are higher? Or is it because of a limit exposed by wgpu for the current way it is implemented in wgpu?

}

// Mark the buffer as needing to be recreated, in case we grew it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to detect whether the buffer grew (by saving capacity before and comparing after) and only mark it if so?

Copy link
Contributor Author

@pcwalton pcwalton Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I'd like to do this in a follow up because the optimal thing to do is actually more complex than that: we can actually avoid recreating the bind group at all in many cases, which is needed to do well on the benchmark that Griffin added.

Copy link
Contributor

@atlv24 atlv24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! I'm a bit apprehensive of the amount of panics and expects, however i think this is already very common in the rendering codebase. This is something that should be looked into in the future as a separate more targeted effort, and is not blocking for this PR.

@pcwalton
Copy link
Contributor Author

@atlv24 Yeah, originally I had fewer panics and expects, but it really made the code messy to have all these extraneous error! and return paths.

@pcwalton pcwalton requested a review from superdump February 21, 2025 01:24
/// | 2 | base_color_sampler +-------------------------------------------+
/// +----+-----------------------------+
/// | .. | ... |
/// +----+-----------------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, very nice!!!

Comment on lines +1377 to +1380
if self.bindings.len() < slot as usize + 1 {
self.bindings.resize_with(slot as usize + 1, || None);
}
self.bindings[slot as usize] = Some(MaterialBindlessBinding::new(resource));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not push?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's slightly less code this way. Otherwise it'd be something like:

if self.bindings.len() < slot as usize + 1 {
    if self.bindings.len() < slot as usize {
        self.bindings.resize_with(slot as usize, || None);
    }
    self.bindings.push(Some(MaterialBindlessBinding::new(resource));
} else {
    self.bindings[slot as usize] = Some(MaterialBindlessBinding::new(resource));
}

@superdump superdump added this pull request to the merge queue Feb 21, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 21, 2025
@superdump superdump added this pull request to the merge queue Feb 21, 2025
Merged via the queue into bevyengine:main with commit 2844133 Feb 21, 2025
29 checks passed
pcwalton added a commit to pcwalton/bevy that referenced this pull request Feb 21, 2025
on the frame we first encountered it.

We might not be able to prepare a material on the first frame we
encounter a mesh using it for various reasons, including that the
material hasn't been loaded yet or that preparing the material is
exceeding the per-frame cap on number of bytes to load. When this
happens, we currently try to find the material in the
`MaterialBindGroupAllocator`, fail, and then fall back to group 0, slot
0, the default `MaterialBindGroupId`, which is obviously incorrect.
Worse, we then fail to dirty the mesh and reextract it when we *do*
finish preparing the material, so the mesh will continue to be rendered
with an incorrect material.

This patch fixes both problems. In `collect_meshes_for_gpu_building`, if
we fail to find a mesh's material in the `MeshBindGroupAllocator`, then
we detect that case, bail out, and add it to a list,
`MeshesToReextractNextFrame`. On subsequent frames, we process all the
meshes in `MeshesToReextractNextFrame` as though they were changed. This
ensures both that we don't render a mesh if its material hasn't been
loaded and that we start rendering the mesh once its material does load.

This was first noticed in the intermittent Pixel Eagle failures in the
`testbed_3d` patch in bevyengine#17898, although the problem has actually existed
for some time. I believe it just so happened that the changes to the
allocator in that PR caused the problem to appear more commonly than it
did before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Bug An unexpected or incorrect behavior C-Performance A change motivated by improving speed, memory usage or compile times P-Crash A sudden unexpected crash S-Needs-Review Needs reviewer attention (from anyone!) to move forward
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants