Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded counter optimisation #109

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JackThomson2
Copy link
Collaborator

@JackThomson2 JackThomson2 commented Aug 14, 2022

This is picking up from the great work from @jimvdl in this PR: #101

I've added the crate here: https://github.com/JackThomson2/fast-counter which is using a basic sharded concurrent counter which is identified using a thread_local. It seems at worse case it's slightly slower but with more cores it starts to pull ahead.

These are the results using the conc-bench repo on my 16 core machine:

315967174_6042340619130631_2502339656878077868_n


This change is Reviewable

@codecov
Copy link

codecov bot commented Aug 20, 2022

Codecov Report

Merging #109 (c92dbf2) into master (85ac469) will decrease coverage by 6.39%.
The diff coverage is 90.00%.

Additional details and impacted files
Impacted Files Coverage Δ
src/map.rs 75.72% <88.88%> (-4.86%) ⬇️
src/node.rs 73.84% <100.00%> (-4.69%) ⬇️
src/serde_impls.rs 83.33% <0.00%> (-7.98%) ⬇️
src/rayon_impls.rs 91.66% <0.00%> (-7.00%) ⬇️
src/iter/traverser.rs 82.89% <0.00%> (-6.95%) ⬇️
src/iter/mod.rs 100.00% <0.00%> (ø)
src/lib.rs 16.12% <0.00%> (ø)
src/raw/mod.rs 80.00% <0.00%> (+0.27%) ⬆️

Copy link
Owner

@jonhoo jonhoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for pushing this forward!

I took a look at the code for fast-counter, and had a couple of thoughts:

  • I think THREAD_ID can be a Cell<usize> instead of an UnsafeCell. Should let you get rid of the unsafe block you have in there.
  • On nightly you can use ThreadId::as_u64 rather than keeping THREAD_ID yourself.
  • It's not entirely clear to me why you have stable and nightly separated? The use of #[thread_local] doesn't feel like it's very important, and in terms of performance the difference seems fairly minor.

src/map.rs Outdated Show resolved Hide resolved
@JackThomson2
Copy link
Collaborator Author

Nice, thanks for pushing this forward!

I took a look at the code for fast-counter, and had a couple of thoughts:

  • I think THREAD_ID can be a Cell<usize> instead of an UnsafeCell. Should let you get rid of the unsafe block you have in there.
  • On nightly you can use ThreadId::as_u64 rather than keeping THREAD_ID yourself.
  • It's not entirely clear to me why you have stable and nightly separated? The use of #[thread_local] doesn't feel like it's very important, and in terms of performance the difference seems fairly minor.

Thanks for the feedback, all great points, I'll try address these soon.

The reason for the night / stable just happened as an effect of me starting with the nightly first, and originally the performance difference was greater. I left it was an isize as the previous PR used this approach, happy to update it to a usize 👍

@JackThomson2
Copy link
Collaborator Author

Hey @jonhoo been looking into this and a few thoughts and findings:

  • I tested using the ThreadID::as_u64 this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently
  • I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular AtomicUsize on every core count.
  • I will look at removing the nightly option it does look like the results are very close now so probably not needed.
  • I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a usize and take a usize but use the isize internally.

Thanks again for reviewing this

@jonhoo
Copy link
Owner

jonhoo commented Oct 8, 2022

Sorry for the super slow reply!

I tested using the ThreadID::as_u64 this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently

Looking at the std source, it seems like the Arc is in the Thread type, whereas the ThreadId type is just a plain integer. I'm guessing this ends up slow because you're either calling thread::current() on every access (which is slow) or keeping the Thread around, which means going through the Arc each time to get to the numeric ID. I think the way to fix this is to store the ThreadId in a thread-local. That way you should have super cheap access to it (ThreadId::as_u64 is just a field access) without having to keep an ID counter yourself.

I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular AtomicUsize on every core count.

That's super weird! The code for Cell is really just an UnsafeCell with #[repr(transparent)], and Cell::set just does mem::replace(unsafe { self.0.get() }, v), which should cost exactly as much as just setting the value, and certainly not as much as running atomic operations. Could you give the raw numbers you observed to see if we can figure out what's going on?

I will look at removing the nightly option it does look like the results are very close now so probably not needed.

Nice!

I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a usize and take a usize but use the isize internally.

Ah, yes, that's a great point, and a totally valid reason for using an isize for the sharded counter. I think I'd still prefer for the API to return a usize because the sum should never be negative, which I think we can simply assert with a debug_assert! (so it doesn't affect release build performance). It's fine to make add take an isize though I think. It's a little weird, but not the end of the world.

Thanks again for reviewing this

Thanks for digging so deep!

@jonhoo
Copy link
Owner

jonhoo commented Oct 8, 2022

Actually, on second thought, in a concurrent setting with lazily updated counters, it's entirely possible that the sum of the counters ends up being negative for some shorter period of time. So I think we should actually leave the return value as an isize 👍

Comment on lines +1158 to +1161
let _saw_bin_length = match resize_hint {
Some(hint) => hint,
None => return
};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd probably make this an if let Some

@jonhoo
Copy link
Owner

jonhoo commented Dec 4, 2022

Looks like there are some CI failures now?

@jimvdl
Copy link
Contributor

jimvdl commented Dec 4, 2022

I've been following along somewhat closely and happy to see that you guys made good progress. 🎉 If either of you ever need assistance (with anything really) you can always @ me :) (although I'm fairly certain you both got this in the bag). Well done Jack & Jon!

@JackThomson2
Copy link
Collaborator Author

Looks like there are some CI failures now?

Saw these, I'll take a look at resolving them when I get a chance! Thanks for running the workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants