Retain skins from frame to frame. #17818

pcwalton · 2025-02-12T06:02:35Z

Currently, Bevy rebuilds the buffer containing all the transforms for joints every frame, during the extraction phase. This is inefficient in cases in which many skins are present in the scene and their joints don't move, such as the Caldera test scene.

To address this problem, this commit switches skin extraction to use a set of retained GPU buffers with allocations managed by the offset allocator. I use fine-grained change detection in order to determine which skins need updating. Note that the granularity is on the level of an entire skin, not individual joints. Using the change detection at that level would yield poor performance in common cases in which an entire skin is animated at once. Also, this patch yields additional performance from the fact that changing joint transforms no longer requires the skinned mesh to be re-extracted.

Note that this optimization can be a double-edged sword. In many_foxes, fine-grained change detection regressed the performance of extract_skins by 3.4x. This is because every joint is updated every frame in that example, so change detection is pointless and is pure overhead. Because the many_foxes workload is actually representative of animated scenes, this patch includes a heuristic that disables fine-grained change detection if the number of transformed entities in the frame exceeds a certain fraction of the total number of joints. Currently, this threshold is set to 25%. Note that this is a crude heuristic, because it doesn't distinguish between the number of transformed joints and the number of transformed entities; however, it should be good enough to yield the optimum code path most of the time.

Finally, this patch fixes a bug whereby skinned meshes are actually being incorrectly retained if the buffer offsets of the joints of those skinned meshes changes from frame to frame. To fix this without retaining skins, we would have to re-extract every skinned mesh every frame. Doing this was a significant regression on Caldera. With this PR, by contrast, mesh joints stay at the same buffer offset, so we don't have to update the MeshInputUniform containing the buffer offset every frame. This also makes PR #17717 easier to implement, because that PR uses the buffer offset from the previous frame, and the logic for calculating that is simplified if the previous frame's buffer offset is guaranteed to be identical to that of the current frame.

On Caldera, this patch reduces the time spent in extract_skins from 1.79 ms to near zero. On many_foxes, this patch regresses the performance of extract_skins by approximately 10%-25%, depending on the number of foxes. This has only a small impact on frame rate.

Currently, Bevy rebuilds the buffer containing all the transforms for joints every frame, during the extraction phase. This is inefficient in cases in which many skins are present in the scene and their joints don't move, such as the Caldera test scene. To address this problem, this commit switches skin extraction to use a set of retained GPU buffers with allocations managed by the offset allocator. I use fine-grained change detection in order to determine which skins need updating. Note that the granularity is on the level of an entire skin, not individual joints. Using the change detection at that level would yield poor performance in common cases in which an entire skin is animated at once. Also, this patch yields additional performance from the fact that changing joint transforms no longer requires the skinned mesh to be re-extracted. Note that this optimization can be a double-edged sword. In `many_foxes`, fine-grained change detection regressed the performance of `extract_skins` by 3.4x. This is because every joint is updated every frame in that example, so change detection is pointless and is pure overhead. Because the `many_foxes` workload is actually representative of animated scenes, this patch includes a heuristic that disables fine-grained change detection if the number of transformed entities in the frame exceeds a certain fraction of the total number of joints. Currently, this threshold is set to 25%. Note that this is a crude heuristic, because it doesn't distinguish between the number of transformed *joints* and the number of transformed *entities*; however, it should be good enough to yield the optimum code path most of the time. Finally, this patch fixes a bug whereby skinned meshes are actually being incorrectly retained if the buffer offsets of the joints of those skinned meshes changes from frame to frame. To fix this without retaining skins, we would have to re-extract every skinned mesh every frame. Doing this was a significant regression on Caldera. With this PR, by contrast, mesh joints stay at the same buffer offset, so we don't have to update the `MeshInputUniform` containing the buffer offset every frame. This also makes PR bevyengine#17717 easier to implement, because that PR uses the buffer offset from the previous frame, and the logic for calculating that is simplified if the previous frame's buffer offset is guaranteed to be identical to that of the current frame. On Caldera, this patch reduces the time spent in `extract_skins` from 1.79 ms to near zero. On `many_foxes`, this patch regresses the performance of `extract_skins` by approximately 10%-25%, depending on the number of foxes. This has only a small impact on frame rate.

greeble-dev · 2025-02-12T16:30:03Z

I hit a panic if I force skins_use_uniform_buffers to return true. I'm not sure if it's actually valid to force that on, but flagging it up just in case.

Repro: cargo run --example custom_skinned_mesh. Does not occur in many_foxes or animated_mesh examples. Does not occur if I roll back to before the PR (153ce46).

Caused by:
  In RenderPass::end
    In a set_bind_group command
      Dynamic binding offset index 0 with offset 2048 would overrun the buffer bound to BindGroup with 'skinned_mesh_bind_group' label 1 -> binding 1. Buffer size is 16384 bytes, the binding binds bytes 0..16384, meaning the maximum the binding can be offset is 0 bytes

SystemInfo { os: "Windows 10 Pro", kernel: "19045" }
AdapterInfo { name: "NVIDIA GeForce GT 1030", vendor: 4318, device: 7425, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "560.94", backend: Vulkan }

crates/bevy_pbr/src/render/skin.rs

greeble-dev · 2025-02-12T15:31:35Z

crates/bevy_pbr/src/render/skin.rs

+///
+/// This is 256 GiB worth of joint matrices, which we will never hit under any
+/// reasonable circumstances.
+const MAX_TOTAL_JOINTS: u32 = 1024 * 1024 * 1024;


Took me a little while to work how this value was derived. Maybe worth doing it explicitly or as a comment?

Suggested change

const MAX_TOTAL_JOINTS: u32 = 1024 * 1024 * 1024;

const MAX_TOTAL_JOINTS: u32 = (2u64.pow(u32::BITS) / (JOINTS_PER_ALLOCATION_UNIT as u64)) as u32;

You're reading more into it than I intended :) It's actually a completely arbitrary number, set to be "higher than you'll ever need but not so high as to be absurd", so as to catch bugs before OOM happens.

Hah, that makes sense. I also misunderstood what the allocator was doing.

I think it will panic a while before this due to byte_offset: u32? But doesn't seem like a pressing issue.

pcwalton · 2025-02-12T20:44:46Z

@greeble-dev Fixed that issue. I needed to leave space at the end of the buffer.

tychedelia

Lgtm other than the note about using the new asset filter.

tychedelia · 2025-02-13T07:46:29Z

crates/bevy_pbr/src/render/skin.rs

+/// fine-grained detection to determine which skins need extraction. If the
+/// number of changed entities is over this threshold, we skip change detection
+/// and simply re-extract the transforms of all joints.
+const JOINT_EXTRACTION_THRESHOLD_FACTOR: f64 = 0.25;


This heuristic intuitively makes sense to me, but I wonder if it should be exposed as configuration. Maybe something for the future with more testing to understand the balance here. I also hope that our change detection (particularly w.r.t. assets) will become cheaper in the future.

Do you mean plugin configuration? I'm a bit wary of having knobs that we have to maintain unless we know we need them.

No, I agree we're fine now, let's just revisit this heuristic if, e.g., asset change detection improves in the future.

crates/bevy_pbr/src/render/skin.rs

greeble-dev

Clicking the approve button with some minor suggestions, but take with salt as I don't have much renderer experience.

In case it's relevant, I'm planning to look at #16929 once things have calmed down, and that will involve adding a test for skinning corner cases (missing joints, adding/removing meshes). So that might shake out any subtle issues with this PR.

EDIT: Forgot to mention that I tested many_foxes and custom_skinned_mesh, also with uniform buffers forced on. Win10/Nvidia/Vulkan.

crates/bevy_pbr/src/render/skin.rs

crates/bevy_pbr/src/render/mesh.rs

pcwalton mentioned this pull request Feb 12, 2025

Fix motion vector computation after #17688. #17717

Open

pcwalton requested review from IceSentry, tychedelia and superdump February 12, 2025 06:03

alice-i-cecile added this to the 0.16 milestone Feb 12, 2025

pcwalton added 2 commits February 11, 2025 22:28

Fix meshlets

3b9c765

Fix ambiguity

d58aae4

greeble-dev reviewed Feb 12, 2025

View reviewed changes

pcwalton added 3 commits February 12, 2025 11:35

Make SkinByteOffset not derive Component

f043ffb

Leave enough space at the end of the buffer if we use a uniform buffer.

c048c79

Merge remote-tracking branch 'origin/main' into retain-skins

c55ce3f

pcwalton requested a review from greeble-dev February 12, 2025 20:44

tychedelia approved these changes Feb 13, 2025

View reviewed changes

greeble-dev approved these changes Feb 13, 2025

View reviewed changes

crates/bevy_pbr/src/render/skin.rs Outdated Show resolved Hide resolved

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved

pcwalton added 3 commits February 13, 2025 10:40

Merge remote-tracking branch 'origin/main' into retain-skins

1c8512d

Use AssetChanged

2840876

Address review comments

61202d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain skins from frame to frame. #17818

Retain skins from frame to frame. #17818

pcwalton commented Feb 12, 2025

greeble-dev commented Feb 12, 2025

greeble-dev Feb 12, 2025

pcwalton Feb 12, 2025

greeble-dev Feb 13, 2025 •

edited

Loading

pcwalton commented Feb 12, 2025

tychedelia left a comment

tychedelia Feb 13, 2025

pcwalton Feb 13, 2025 •

edited

Loading

tychedelia Feb 13, 2025

greeble-dev left a comment •

edited

Loading

	const MAX_TOTAL_JOINTS: u32 = 1024 * 1024 * 1024;
	const MAX_TOTAL_JOINTS: u32 = (2u64.pow(u32::BITS) / (JOINTS_PER_ALLOCATION_UNIT as u64)) as u32;

Retain skins from frame to frame. #17818

Are you sure you want to change the base?

Retain skins from frame to frame. #17818

Conversation

pcwalton commented Feb 12, 2025

greeble-dev commented Feb 12, 2025

greeble-dev Feb 12, 2025

Choose a reason for hiding this comment

pcwalton Feb 12, 2025

Choose a reason for hiding this comment

greeble-dev Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

pcwalton commented Feb 12, 2025

tychedelia left a comment

Choose a reason for hiding this comment

tychedelia Feb 13, 2025

Choose a reason for hiding this comment

pcwalton Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

tychedelia Feb 13, 2025

Choose a reason for hiding this comment

greeble-dev left a comment • edited Loading

Choose a reason for hiding this comment

greeble-dev Feb 13, 2025 •

edited

Loading

pcwalton Feb 13, 2025 •

edited

Loading

greeble-dev left a comment •

edited

Loading