-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use global binding arrays for bindless resources. #17898
Conversation
25b931b
to
78b1009
Compare
78b1009
to
3627a5d
Compare
Currently, the specialized pipeline cache maps a (view entity, mesh entity) tuple to the retained pipeline for that entity. This causes two problems: 1. Using the view entity is incorrect, because the view entity isn't stable from frame to frame. 2. Switching the view entity to a `RetainedViewEntity`, which is necessary for correctness, significantly regresses performance of `specialize_material_meshes` and `specialize_shadows` because of the loss of the fast `EntityHash`. This patch fixes both problems by switching to a *two-level* hash table. The outer level of the table maps each `RetainedViewEntity` to an inner table, which maps each `MainEntity` to its pipeline ID and change tick. Because we loop over views first and, within that loop, loop over entities visible from that view, we hoist the slow lookup of the view entity out of the inner entity loop. Additionally, this patch fixes a bug whereby pipeline IDs were leaked when removing the view. We still have a problem with leaking pipeline IDs for deleted entities, but that won't be fixed until the specialized pipeline cache is retained. This patch improves performance of the [Caldera benchmark] from 7.8× faster than 0.14 to 9.0× faster than 0.14, when applied on top of the global binding arrays PR, bevyengine#17898. [Caldera benchmark]: https://github.com/DGriffin91/bevy_caldera_scene
Currently, Bevy's implementation of bindless resources is rather unusual: every binding in an object that implements `AsBindGroup` (most commonly, a material) becomes its own separate binding array in the shader. This is inefficient for two reasons: 1. If multiple materials reference the same texture or other resource, the reference to that resource will be duplicated many times. This increases `wgpu` validation overhead. 2. It creates many unused binding array slots. This increases `wgpu` and driver overhead and makes it easier to hit limits on APIs that `wgpu` currently imposes tight resource limits on, like Metal. This PR fixes these issues by switching Bevy to use the standard approach in GPU-driven renderers, in which resources are de-duplicated and passed as global arrays, one for each type of resource. Along the way, this patch introduces per-platform resource limits and bumps them from 16 resources per binding array to 64 resources per bind group on Metal and 2048 resources per bind group on other platforms. (Note that the number of resources per *binding array* isn't the same as the number of resources per *bind group*; as it currently stands, if all the PBR features are turned on, Bevy could pack as many as 496 resources into a single slab.) The limits have been increased because `wgpu` now has universal support for partially-bound binding arrays, which mean that we no longer need to fill the binding arrays with fallback resources on Direct3D 12. The `#[bindless(LIMIT)]` declaration when deriving `AsBindGroup` can now simply be written `#[bindless]` in order to have Bevy choose a default limit size for the current platform. Custom limits are still available with the new `#[bindless(limit(LIMIT))]` syntax: e.g. `#[bindless(limit(8))]`. The material bind group allocator has been completely rewritten. Now there are two allocators: one for bindless materials and one for non-bindless materials. The new non-bindless material allocator simply maintains a 1:1 mapping from material to bind group. The new bindless material allocator maintains a list of slabs and allocates materials into slabs on a first-fit basis. This unfortunately makes its performance O(number of resources per object * number of slabs), but the number of slabs is likely to be low, and it's planned to become even lower in the future with `wgpu` improvements. Resources are de-duplicated with in a slab and reference counted. So, for instance, if multiple materials refer to the same texture, that texture will exist only once in the appropriate binding array. To support these new features, this patch adds the concept of a *bindless descriptor* to the `AsBindGroup` trait. The bindless descriptor allows the material bind group allocator to probe the layout of the material, now that an array of `BindGroupLayoutEntry` records is insufficient to describe the group. The `#[derive(AsBindGroup)]` has been heavily modified to support the new features. The most important user-facing change to that macro is that the struct-level `uniform` attribute, `#[uniform(BINDING_NUMBER, StandardMaterial)]`, now reads `#[uniform(BINDLESS_INDEX, MATERIAL_UNIFORM_TYPE, binding_array(BINDING_NUMBER)]`, allowing the material to specify the binding number for the binding array that holds the uniform data. To make this patch simpler, I removed support for bindless `ExtendedMaterial`s, as well as field-level bindless uniform and storage buffers. I intend to add back support for these as a follow-up. Because they aren't in any released Bevy version yet, I figured this was OK. Finally, this patch updates `StandardMaterial` for the new bindless changes. Generally, code throughout the PBR shaders that looked like `base_color_texture[slot]` now looks like `bindless_2d_textures[material[slot].base_color_texture]`. This patch fixes a system hang that I experienced on the [Caldera test] when running with `caldera --random-materials --texture-count 100`. The time per frame is around 19.75 ms, down from 154.2 ms in Bevy 0.14: a 7.8× speedup.
3627a5d
to
b3afb48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit out of my depth here technically. I can't comment on the correctness of changes to bindless, however I did review the code and documentation to the best of my ability.
Documentation, code, and comments are all easy to read, though they require some background I am missing.
fn is_full(&self) -> bool { | ||
self.used_slot_bitmap == (1 << (self.unprepared_bind_groups.len() as u32)) - 1 | ||
// Release the slot ID. | ||
self.free_slots.push(slot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a bad descriptor is passed in somehow and no resources are found in the table, this code to push the slot and decrement the allocation count will still run. I think this is still correct, but wanted to highlight for review in case this is a real edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of ref counting means that double frees are going to cause havoc in general unfortunately. We just have to be careful.
Auto, | ||
Limit(LitInt), | ||
} | ||
|
||
pub fn derive_as_bind_group(ast: syn::DeriveInput) -> Result<TokenStream> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro is terrifying.
Currently, the specialized pipeline cache maps a (view entity, mesh entity) tuple to the retained pipeline for that entity. This causes two problems: 1. Using the view entity is incorrect, because the view entity isn't stable from frame to frame. 2. Switching the view entity to a `RetainedViewEntity`, which is necessary for correctness, significantly regresses performance of `specialize_material_meshes` and `specialize_shadows` because of the loss of the fast `EntityHash`. This patch fixes both problems by switching to a *two-level* hash table. The outer level of the table maps each `RetainedViewEntity` to an inner table, which maps each `MainEntity` to its pipeline ID and change tick. Because we loop over views first and, within that loop, loop over entities visible from that view, we hoist the slow lookup of the view entity out of the inner entity loop. Additionally, this patch fixes a bug whereby pipeline IDs were leaked when removing the view. We still have a problem with leaking pipeline IDs for deleted entities, but that won't be fixed until the specialized pipeline cache is retained. This patch improves performance of the [Caldera benchmark] from 7.8× faster than 0.14 to 9.0× faster than 0.14, when applied on top of the global binding arrays PR, #17898. [Caldera benchmark]: https://github.com/DGriffin91/bevy_caldera_scene
/// For example, suppose that the material slot is stored in a variable named | ||
/// `slot`, the bindless index table is named `materials`, and that the first | ||
/// field (index 0) of the bindless index table type is named `material`. Then | ||
/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will | ||
/// create a binding array buffer declared in the shader as `var<storage> | ||
/// material: binding_array<StandardMaterialUniform>` and accessible as | ||
/// `material[materials[slot].material]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to have an example. I'd recommend not reusing material
for the binding name. So:
/// For example, suppose that the material slot is stored in a variable named | |
/// `slot`, the bindless index table is named `materials`, and that the first | |
/// field (index 0) of the bindless index table type is named `material`. Then | |
/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will | |
/// create a binding array buffer declared in the shader as `var<storage> | |
/// material: binding_array<StandardMaterialUniform>` and accessible as | |
/// `material[materials[slot].material]`. | |
/// For example, suppose that the material slot is stored in a variable named | |
/// `slot`, the bindless index table is named `material_indices`, and that the first | |
/// field (index 0) of the bindless index table type is named `material`. Then | |
/// specifying `#[uniform(0, StandardMaterialUniform, binding_array(10)]` will | |
/// create a binding array buffer declared in the shader as `var<storage> | |
/// material_array: binding_array<StandardMaterialUniform>` and accessible as | |
/// `material_array[material_indices[slot].material]`. |
Or something like that.
I think this may need explicit code examples in the release notes, and/or a diagram showing how the pieces connect together. Not a criticism of what is done, which is necessary, just recognising that it's not the simplest thing to follow for people who just want to write a shader. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I made these changes:
pbr_bindings::material
is nowpbr_bindings::material_array
.pbr_bindings::material_bindings
is nowpbr_bindings::material_indices
.- I added an ASCII art diagram to the
AsBindGroup
documentation to illustrate how this works.
/// This limit primarily exists in order to work around `wgpu` performance | ||
/// problems involving large numbers of bindless resources. Also, some | ||
/// platforms, such as Metal, currently enforce limits on the number of | ||
/// resources in use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Metal, is this limit exposed through the wgpu Limits
? If so, then maybe that one doesn't need to be hard-coded and this can maintain itself by taking the minimum of the appropriate value from Limits
and the AUTO_BINDLESS_SLAB_RESOURCE_LIMIT
. Then we can later just remove AUTO_BINDLESS_SLAB_RESOURCE_LIMIT
when it is no longer relevant because wgpu
performance issues for bindless arrays have been addressed, and it will still be correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the customizable limit so that applications that are seeing performance problems from large binding arrays have a way to address the issues. The idea is that, if you're seeing problems, you lower the limit until the performance is acceptable. This seems inherently application-specific, so I don't think we can query it from wgpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and that's a good choice and useful to have. But is Metal's value purely a performance constraint on Metal because these overheads are higher? Or is it because of a limit exposed by wgpu for the current way it is implemented in wgpu?
} | ||
|
||
// Mark the buffer as needing to be recreated, in case we grew it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to detect whether the buffer grew (by saving capacity before and comparing after) and only mark it if so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I'd like to do this in a follow up because the optimal thing to do is actually more complex than that: we can actually avoid recreating the bind group at all in many cases, which is needed to do well on the benchmark that Griffin added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! I'm a bit apprehensive of the amount of panics and expects, however i think this is already very common in the rendering codebase. This is something that should be looked into in the future as a separate more targeted effort, and is not blocking for this PR.
@atlv24 Yeah, originally I had fewer panics and expects, but it really made the code messy to have all these extraneous |
/// | 2 | base_color_sampler +-------------------------------------------+ | ||
/// +----+-----------------------------+ | ||
/// | .. | ... | | ||
/// +----+-----------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, very nice!!!
if self.bindings.len() < slot as usize + 1 { | ||
self.bindings.resize_with(slot as usize + 1, || None); | ||
} | ||
self.bindings[slot as usize] = Some(MaterialBindlessBinding::new(resource)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not push?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's slightly less code this way. Otherwise it'd be something like:
if self.bindings.len() < slot as usize + 1 {
if self.bindings.len() < slot as usize {
self.bindings.resize_with(slot as usize, || None);
}
self.bindings.push(Some(MaterialBindlessBinding::new(resource));
} else {
self.bindings[slot as usize] = Some(MaterialBindlessBinding::new(resource));
}
on the frame we first encountered it. We might not be able to prepare a material on the first frame we encounter a mesh using it for various reasons, including that the material hasn't been loaded yet or that preparing the material is exceeding the per-frame cap on number of bytes to load. When this happens, we currently try to find the material in the `MaterialBindGroupAllocator`, fail, and then fall back to group 0, slot 0, the default `MaterialBindGroupId`, which is obviously incorrect. Worse, we then fail to dirty the mesh and reextract it when we *do* finish preparing the material, so the mesh will continue to be rendered with an incorrect material. This patch fixes both problems. In `collect_meshes_for_gpu_building`, if we fail to find a mesh's material in the `MeshBindGroupAllocator`, then we detect that case, bail out, and add it to a list, `MeshesToReextractNextFrame`. On subsequent frames, we process all the meshes in `MeshesToReextractNextFrame` as though they were changed. This ensures both that we don't render a mesh if its material hasn't been loaded and that we start rendering the mesh once its material does load. This was first noticed in the intermittent Pixel Eagle failures in the `testbed_3d` patch in bevyengine#17898, although the problem has actually existed for some time. I believe it just so happened that the changes to the allocator in that PR caused the problem to appear more commonly than it did before.
Currently, Bevy's implementation of bindless resources is rather unusual: every binding in an object that implements
AsBindGroup
(most commonly, a material) becomes its own separate binding array in the shader. This is inefficient for two reasons:If multiple materials reference the same texture or other resource, the reference to that resource will be duplicated many times. This increases
wgpu
validation overhead.It creates many unused binding array slots. This increases
wgpu
and driver overhead and makes it easier to hit limits on APIs thatwgpu
currently imposes tight resource limits on, like Metal.This PR fixes these issues by switching Bevy to use the standard approach in GPU-driven renderers, in which resources are de-duplicated and passed as global arrays, one for each type of resource.
Along the way, this patch introduces per-platform resource limits and bumps them from 16 resources per binding array to 64 resources per bind group on Metal and 2048 resources per bind group on other platforms. (Note that the number of resources per binding array isn't the same as the number of resources per bind group; as it currently stands, if all the PBR features are turned on, Bevy could pack as many as 496 resources into a single slab.) The limits have been increased because
wgpu
now has universal support for partially-bound binding arrays, which mean that we no longer need to fill the binding arrays with fallback resources on Direct3D 12. The#[bindless(LIMIT)]
declaration when derivingAsBindGroup
can now simply be written#[bindless]
in order to have Bevy choose a default limit size for the current platform. Custom limits are still available with the new#[bindless(limit(LIMIT))]
syntax: e.g.#[bindless(limit(8))]
.The material bind group allocator has been completely rewritten. Now there are two allocators: one for bindless materials and one for non-bindless materials. The new non-bindless material allocator simply maintains a 1:1 mapping from material to bind group. The new bindless material allocator maintains a list of slabs and allocates materials into slabs on a first-fit basis. This unfortunately makes its performance O(number of resources per object * number of slabs), but the number of slabs is likely to be low, and it's planned to become even lower in the future with
wgpu
improvements. Resources are de-duplicated with in a slab and reference counted. So, for instance, if multiple materials refer to the same texture, that texture will exist only once in the appropriate binding array.To support these new features, this patch adds the concept of a bindless descriptor to the
AsBindGroup
trait. The bindless descriptor allows the material bind group allocator to probe the layout of the material, now that an array ofBindGroupLayoutEntry
records is insufficient to describe the group. The#[derive(AsBindGroup)]
has been heavily modified to support the new features. The most important user-facing change to that macro is that the struct-leveluniform
attribute,#[uniform(BINDING_NUMBER, StandardMaterial)]
, now reads#[uniform(BINDLESS_INDEX, MATERIAL_UNIFORM_TYPE, binding_array(BINDING_NUMBER)]
, allowing the material to specify the binding number for the binding array that holds the uniform data.To make this patch simpler, I removed support for bindless
ExtendedMaterial
s, as well as field-level bindless uniform and storage buffers. I intend to add back support for these as a follow-up. Because they aren't in any released Bevy version yet, I figured this was OK.Finally, this patch updates
StandardMaterial
for the new bindless changes. Generally, code throughout the PBR shaders that looked likebase_color_texture[slot]
now looks likebindless_2d_textures[material_indices[slot].base_color_texture]
.This patch fixes a system hang that I experienced on the Caldera test when running with
caldera --random-materials --texture-count 100
. The time per frame is around 19.75 ms, down from 154.2 ms in Bevy 0.14: a 7.8× speedup.