Use global binding arrays for bindless resources. #17898

pcwalton · 2025-02-17T08:42:43Z

Currently, Bevy's implementation of bindless resources is rather unusual: every binding in an object that implements AsBindGroup (most commonly, a material) becomes its own separate binding array in the shader. This is inefficient for two reasons:

If multiple materials reference the same texture or other resource, the reference to that resource will be duplicated many times. This increases wgpu validation overhead.
It creates many unused binding array slots. This increases wgpu and driver overhead and makes it easier to hit limits on APIs that wgpu currently imposes tight resource limits on, like Metal.

This PR fixes these issues by switching Bevy to use the standard approach in GPU-driven renderers, in which resources are de-duplicated and passed as global arrays, one for each type of resource.

Along the way, this patch introduces per-platform resource limits and bumps them from 16 resources per binding array to 64 resources per bind group on Metal and 2048 resources per bind group on other platforms. (Note that the number of resources per binding array isn't the same as the number of resources per bind group; as it currently stands, if all the PBR features are turned on, Bevy could pack as many as 496 resources into a single slab.) The limits have been increased because wgpu now has universal support for partially-bound binding arrays, which mean that we no longer need to fill the binding arrays with fallback resources on Direct3D 12. The #[bindless(LIMIT)] declaration when deriving AsBindGroup can now simply be written #[bindless] in order to have Bevy choose a default limit size for the current platform. Custom limits are still available with the new
#[bindless(limit(LIMIT))] syntax: e.g. #[bindless(limit(8))].

The material bind group allocator has been completely rewritten. Now there are two allocators: one for bindless materials and one for non-bindless materials. The new non-bindless material allocator simply maintains a 1:1 mapping from material to bind group. The new bindless material allocator maintains a list of slabs and allocates materials into slabs on a first-fit basis. This unfortunately makes its performance O(number of resources per object * number of slabs), but the number of slabs is likely to be low, and it's planned to become even lower in the future with wgpu improvements. Resources are de-duplicated with in a slab and reference counted. So, for instance, if multiple materials refer to the same texture, that texture will exist only once in the appropriate binding array.

To support these new features, this patch adds the concept of a bindless descriptor to the AsBindGroup trait. The bindless descriptor allows the material bind group allocator to probe the layout of the material, now that an array of BindGroupLayoutEntry records is insufficient to describe the group. The #[derive(AsBindGroup)] has been heavily modified to support the new features. The most important user-facing change to that macro is that the struct-level uniform attribute, #[uniform(BINDING_NUMBER, StandardMaterial)], now reads #[uniform(BINDLESS_INDEX, MATERIAL_UNIFORM_TYPE, binding_array(BINDING_NUMBER)], allowing the material to specify the binding number for the binding array that holds the uniform data.

To make this patch simpler, I removed support for bindless ExtendedMaterials, as well as field-level bindless uniform and storage buffers. I intend to add back support for these as a follow-up. Because they aren't in any released Bevy version yet, I figured this was OK.

Finally, this patch updates StandardMaterial for the new bindless changes. Generally, code throughout the PBR shaders that looked like base_color_texture[slot] now looks like
bindless_2d_textures[material_indices[slot].base_color_texture].

This patch fixes a system hang that I experienced on the Caldera test when running with caldera --random-materials --texture-count 100. The time per frame is around 19.75 ms, down from 154.2 ms in Bevy 0.14: a 7.8× speedup.

Currently, the specialized pipeline cache maps a (view entity, mesh entity) tuple to the retained pipeline for that entity. This causes two problems: 1. Using the view entity is incorrect, because the view entity isn't stable from frame to frame. 2. Switching the view entity to a `RetainedViewEntity`, which is necessary for correctness, significantly regresses performance of `specialize_material_meshes` and `specialize_shadows` because of the loss of the fast `EntityHash`. This patch fixes both problems by switching to a *two-level* hash table. The outer level of the table maps each `RetainedViewEntity` to an inner table, which maps each `MainEntity` to its pipeline ID and change tick. Because we loop over views first and, within that loop, loop over entities visible from that view, we hoist the slow lookup of the view entity out of the inner entity loop. Additionally, this patch fixes a bug whereby pipeline IDs were leaked when removing the view. We still have a problem with leaking pipeline IDs for deleted entities, but that won't be fixed until the specialized pipeline cache is retained. This patch improves performance of the [Caldera benchmark] from 7.8× faster than 0.14 to 9.0× faster than 0.14, when applied on top of the global binding arrays PR, bevyengine#17898. [Caldera benchmark]: https://github.com/DGriffin91/bevy_caldera_scene

Currently, Bevy's implementation of bindless resources is rather unusual: every binding in an object that implements `AsBindGroup` (most commonly, a material) becomes its own separate binding array in the shader. This is inefficient for two reasons: 1. If multiple materials reference the same texture or other resource, the reference to that resource will be duplicated many times. This increases `wgpu` validation overhead. 2. It creates many unused binding array slots. This increases `wgpu` and driver overhead and makes it easier to hit limits on APIs that `wgpu` currently imposes tight resource limits on, like Metal. This PR fixes these issues by switching Bevy to use the standard approach in GPU-driven renderers, in which resources are de-duplicated and passed as global arrays, one for each type of resource. Along the way, this patch introduces per-platform resource limits and bumps them from 16 resources per binding array to 64 resources per bind group on Metal and 2048 resources per bind group on other platforms. (Note that the number of resources per *binding array* isn't the same as the number of resources per *bind group*; as it currently stands, if all the PBR features are turned on, Bevy could pack as many as 496 resources into a single slab.) The limits have been increased because `wgpu` now has universal support for partially-bound binding arrays, which mean that we no longer need to fill the binding arrays with fallback resources on Direct3D 12. The `#[bindless(LIMIT)]` declaration when deriving `AsBindGroup` can now simply be written `#[bindless]` in order to have Bevy choose a default limit size for the current platform. Custom limits are still available with the new `#[bindless(limit(LIMIT))]` syntax: e.g. `#[bindless(limit(8))]`. The material bind group allocator has been completely rewritten. Now there are two allocators: one for bindless materials and one for non-bindless materials. The new non-bindless material allocator simply maintains a 1:1 mapping from material to bind group. The new bindless material allocator maintains a list of slabs and allocates materials into slabs on a first-fit basis. This unfortunately makes its performance O(number of resources per object * number of slabs), but the number of slabs is likely to be low, and it's planned to become even lower in the future with `wgpu` improvements. Resources are de-duplicated with in a slab and reference counted. So, for instance, if multiple materials refer to the same texture, that texture will exist only once in the appropriate binding array. To support these new features, this patch adds the concept of a *bindless descriptor* to the `AsBindGroup` trait. The bindless descriptor allows the material bind group allocator to probe the layout of the material, now that an array of `BindGroupLayoutEntry` records is insufficient to describe the group. The `#[derive(AsBindGroup)]` has been heavily modified to support the new features. The most important user-facing change to that macro is that the struct-level `uniform` attribute, `#[uniform(BINDING_NUMBER, StandardMaterial)]`, now reads `#[uniform(BINDLESS_INDEX, MATERIAL_UNIFORM_TYPE, binding_array(BINDING_NUMBER)]`, allowing the material to specify the binding number for the binding array that holds the uniform data. To make this patch simpler, I removed support for bindless `ExtendedMaterial`s, as well as field-level bindless uniform and storage buffers. I intend to add back support for these as a follow-up. Because they aren't in any released Bevy version yet, I figured this was OK. Finally, this patch updates `StandardMaterial` for the new bindless changes. Generally, code throughout the PBR shaders that looked like `base_color_texture[slot]` now looks like `bindless_2d_textures[material[slot].base_color_texture]`. This patch fixes a system hang that I experienced on the [Caldera test] when running with `caldera --random-materials --texture-count 100`. The time per frame is around 19.75 ms, down from 154.2 ms in Bevy 0.14: a 7.8× speedup.

aevyrie

I'm a bit out of my depth here technically. I can't comment on the correctness of changes to bindless, however I did review the code and documentation to the best of my ability.

Documentation, code, and comments are all easy to read, though they require some background I am missing.

aevyrie · 2025-02-18T03:51:57Z

crates/bevy_pbr/src/material_bind_groups.rs

-    fn is_full(&self) -> bool {
-        self.used_slot_bitmap == (1 << (self.unprepared_bind_groups.len() as u32)) - 1
+        // Release the slot ID.
+        self.free_slots.push(slot);


If a bad descriptor is passed in somehow and no resources are found in the table, this code to push the slot and decrement the allocation count will still run. I think this is still correct, but wanted to highlight for review in case this is a real edge case.

The use of ref counting means that double frees are going to cause havoc in general unfortunately. We just have to be careful.

aevyrie · 2025-02-18T03:55:32Z

crates/bevy_render/macros/src/as_bind_group.rs

+    Auto,
+    Limit(LitInt),
+}
+
 pub fn derive_as_bind_group(ast: syn::DeriveInput) -> Result<TokenStream> {


This macro is terrifying.

Currently, the specialized pipeline cache maps a (view entity, mesh entity) tuple to the retained pipeline for that entity. This causes two problems: 1. Using the view entity is incorrect, because the view entity isn't stable from frame to frame. 2. Switching the view entity to a `RetainedViewEntity`, which is necessary for correctness, significantly regresses performance of `specialize_material_meshes` and `specialize_shadows` because of the loss of the fast `EntityHash`. This patch fixes both problems by switching to a *two-level* hash table. The outer level of the table maps each `RetainedViewEntity` to an inner table, which maps each `MainEntity` to its pipeline ID and change tick. Because we loop over views first and, within that loop, loop over entities visible from that view, we hoist the slow lookup of the view entity out of the inner entity loop. Additionally, this patch fixes a bug whereby pipeline IDs were leaked when removing the view. We still have a problem with leaking pipeline IDs for deleted entities, but that won't be fixed until the specialized pipeline cache is retained. This patch improves performance of the [Caldera benchmark] from 7.8× faster than 0.14 to 9.0× faster than 0.14, when applied on top of the global binding arrays PR, #17898. [Caldera benchmark]: https://github.com/DGriffin91/bevy_caldera_scene

crates/bevy_pbr/src/render/pbr_bindings.wgsl

superdump · 2025-02-20T12:47:03Z

crates/bevy_render/src/render_resource/bind_group.rs

+///   For example, suppose that the material slot is stored in a variable named
+///   `slot`, the bindless index table is named `materials`, and that the first
+///   field (index 0) of the bindless index table type is named `material`. Then
+///   specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will
+///   create a binding array buffer declared in the shader as `var<storage>
+///   material: binding_array<StandardMaterialUniform>` and accessible as
+///   `material[materials[slot].material]`.


Great to have an example. I'd recommend not reusing material for the binding name. So:

Suggested change

/// For example, suppose that the material slot is stored in a variable named

/// `slot`, the bindless index table is named `materials`, and that the first

/// field (index 0) of the bindless index table type is named `material`. Then

/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will

/// create a binding array buffer declared in the shader as `var<storage>

/// material: binding_array<StandardMaterialUniform>` and accessible as

/// `material[materials[slot].material]`.

/// For example, suppose that the material slot is stored in a variable named

/// `slot`, the bindless index table is named `material_indices`, and that the first

/// field (index 0) of the bindless index table type is named `material`. Then

/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will

/// create a binding array buffer declared in the shader as `var<storage>

/// material_array: binding_array<StandardMaterialUniform>` and accessible as

/// `material_array[material_indices[slot].material]`.

Or something like that.

I think this may need explicit code examples in the release notes, and/or a diagram showing how the pieces connect together. Not a criticism of what is done, which is necessary, just recognising that it's not the simplest thing to follow for people who just want to write a shader. :)

OK, I made these changes:

pbr_bindings::material is now pbr_bindings::material_array.

pbr_bindings::material_bindings is now pbr_bindings::material_indices.

I added an ASCII art diagram to the AsBindGroup documentation to illustrate how this works.

crates/bevy_render/src/render_resource/bind_group.rs

superdump · 2025-02-20T12:58:21Z

crates/bevy_render/src/render_resource/bindless.rs

+/// This limit primarily exists in order to work around `wgpu` performance
+/// problems involving large numbers of bindless resources. Also, some
+/// platforms, such as Metal, currently enforce limits on the number of
+/// resources in use.


On Metal, is this limit exposed through the wgpu Limits? If so, then maybe that one doesn't need to be hard-coded and this can maintain itself by taking the minimum of the appropriate value from Limits and the AUTO_BINDLESS_SLAB_RESOURCE_LIMIT. Then we can later just remove AUTO_BINDLESS_SLAB_RESOURCE_LIMIT when it is no longer relevant because wgpu performance issues for bindless arrays have been addressed, and it will still be correct?

I added the customizable limit so that applications that are seeing performance problems from large binding arrays have a way to address the issues. The idea is that, if you're seeing problems, you lower the limit until the performance is acceptable. This seems inherently application-specific, so I don't think we can query it from wgpu.

Right, and that's a good choice and useful to have. But is Metal's value purely a performance constraint on Metal because these overheads are higher? Or is it because of a limit exposed by wgpu for the current way it is implemented in wgpu?

crates/bevy_render/src/render_resource/bindless.rs

crates/bevy_pbr/src/material_bind_groups.rs

atlv24 · 2025-02-20T13:48:57Z

crates/bevy_pbr/src/material_bind_groups.rs

        }
+
+        // Mark the buffer as needing to be recreated, in case we grew it.


is it possible to detect whether the buffer grew (by saving capacity before and comparing after) and only mark it if so?

Yes, but I'd like to do this in a follow up because the optimal thing to do is actually more complex than that: we can actually avoid recreating the bind group at all in many cases, which is needed to do well on the benchmark that Griffin added.

atlv24

Nice job! I'm a bit apprehensive of the amount of panics and expects, however i think this is already very common in the rendering codebase. This is something that should be looked into in the future as a separate more targeted effort, and is not blocking for this PR.

pcwalton · 2025-02-21T01:09:44Z

@atlv24 Yeah, originally I had fewer panics and expects, but it really made the code messy to have all these extraneous error! and return paths.

superdump · 2025-02-21T04:12:36Z

crates/bevy_render/src/render_resource/bind_group.rs

+///     |  2 | base_color_sampler          +-------------------------------------------+
+///     +----+-----------------------------+
+///     | .. |            ...              |
+///     +----+-----------------------------+


Oh wow, very nice!!!

superdump · 2025-02-21T04:49:59Z

crates/bevy_pbr/src/material_bind_groups.rs

+        if self.bindings.len() < slot as usize + 1 {
+            self.bindings.resize_with(slot as usize + 1, || None);
+        }
+        self.bindings[slot as usize] = Some(MaterialBindlessBinding::new(resource));


Why not push?

I think it's slightly less code this way. Otherwise it'd be something like:

if self.bindings.len() < slot as usize + 1 { if self.bindings.len() < slot as usize { self.bindings.resize_with(slot as usize, || None); } self.bindings.push(Some(MaterialBindlessBinding::new(resource)); } else { self.bindings[slot as usize] = Some(MaterialBindlessBinding::new(resource)); }

on the frame we first encountered it. We might not be able to prepare a material on the first frame we encounter a mesh using it for various reasons, including that the material hasn't been loaded yet or that preparing the material is exceeding the per-frame cap on number of bytes to load. When this happens, we currently try to find the material in the `MaterialBindGroupAllocator`, fail, and then fall back to group 0, slot 0, the default `MaterialBindGroupId`, which is obviously incorrect. Worse, we then fail to dirty the mesh and reextract it when we *do* finish preparing the material, so the mesh will continue to be rendered with an incorrect material. This patch fixes both problems. In `collect_meshes_for_gpu_building`, if we fail to find a mesh's material in the `MeshBindGroupAllocator`, then we detect that case, bail out, and add it to a list, `MeshesToReextractNextFrame`. On subsequent frames, we process all the meshes in `MeshesToReextractNextFrame` as though they were changed. This ensures both that we don't render a mesh if its material hasn't been loaded and that we start rendering the mesh once its material does load. This was first noticed in the intermittent Pixel Eagle failures in the `testbed_3d` patch in bevyengine#17898, although the problem has actually existed for some time. I believe it just so happened that the changes to the allocator in that PR caused the problem to appear more commonly than it did before.

pcwalton requested review from IceSentry, tychedelia and superdump February 17, 2025 08:42

pcwalton force-pushed the global-binding-arrays branch 2 times, most recently from 25b931b to 78b1009 Compare February 17, 2025 08:52

pcwalton added this to the 0.16 milestone Feb 17, 2025

pcwalton force-pushed the global-binding-arrays branch from 78b1009 to 3627a5d Compare February 17, 2025 18:07

pcwalton mentioned this pull request Feb 17, 2025

Make the specialized pipeline cache two-level. #17915

Merged

pcwalton force-pushed the global-binding-arrays branch from 3627a5d to b3afb48 Compare February 17, 2025 21:59

aevyrie approved these changes Feb 18, 2025

View reviewed changes

pcwalton added 6 commits February 18, 2025 08:52

Merge remote-tracking branch 'origin/main' into global-binding-arrays

0f31e18

Make the MaterialBindGroupAllocator public

1405016

Merge remote-tracking branch 'origin/main' into global-binding-arrays

38b1735

Warning police

0407988

Merge remote-tracking branch 'origin/main' into global-binding-arrays

c9caf88

Fix non-bindless mode

d984984

github-actions bot mentioned this pull request Feb 19, 2025

17898 TheBevyFlock/bevy-example-runner#120

Closed

superdump reviewed Feb 20, 2025

View reviewed changes

crates/bevy_pbr/src/render/pbr_bindings.wgsl Show resolved Hide resolved

superdump reviewed Feb 20, 2025

View reviewed changes

crates/bevy_render/src/render_resource/bind_group.rs Show resolved Hide resolved

superdump reviewed Feb 20, 2025

View reviewed changes

crates/bevy_render/src/render_resource/bindless.rs Show resolved Hide resolved

superdump reviewed Feb 20, 2025

View reviewed changes

crates/bevy_render/src/render_resource/bindless.rs Outdated Show resolved Hide resolved

atlv24 reviewed Feb 20, 2025

View reviewed changes

crates/bevy_pbr/src/material_bind_groups.rs Outdated Show resolved Hide resolved

atlv24 reviewed Feb 20, 2025

View reviewed changes

atlv24 approved these changes Feb 20, 2025

View reviewed changes

pcwalton added 2 commits February 20, 2025 17:17

Address review comments

8ff560f

Rename material to material_array

74f9105

pcwalton requested a review from superdump February 21, 2025 01:24

superdump approved these changes Feb 21, 2025

View reviewed changes

superdump added this pull request to the merge queue Feb 21, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 21, 2025

superdump added this pull request to the merge queue Feb 21, 2025

Merged via the queue into bevyengine:main with commit 2844133 Feb 21, 2025
29 checks passed

pcwalton mentioned this pull request Feb 21, 2025

Reextract a mesh on the next frame if its material couldn't be prepared on the frame we first encountered it. #17963

Open

rparrett mentioned this pull request Feb 21, 2025

animated_material and other examples are broken #17970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use global binding arrays for bindless resources. #17898

Use global binding arrays for bindless resources. #17898

pcwalton commented Feb 17, 2025 •

edited

Loading

aevyrie left a comment

aevyrie Feb 18, 2025

pcwalton Feb 19, 2025

aevyrie Feb 18, 2025

superdump Feb 20, 2025

pcwalton Feb 21, 2025

superdump Feb 20, 2025

pcwalton Feb 21, 2025

superdump Feb 21, 2025

atlv24 Feb 20, 2025

pcwalton Feb 21, 2025 •

edited

Loading

atlv24 left a comment

pcwalton commented Feb 21, 2025

superdump Feb 21, 2025

superdump Feb 21, 2025

pcwalton Feb 21, 2025

		}

		// Mark the buffer as needing to be recreated, in case we grew it.

Use global binding arrays for bindless resources. #17898

Use global binding arrays for bindless resources. #17898

Conversation

pcwalton commented Feb 17, 2025 • edited Loading

aevyrie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcwalton Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

atlv24 left a comment

Choose a reason for hiding this comment

pcwalton commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcwalton commented Feb 17, 2025 •

edited

Loading

pcwalton Feb 21, 2025 •

edited

Loading