-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix checksumming in the presence of large, mostly-zero buffers. #3355
base: master
Are you sure you want to change the base?
Conversation
Tools like asan allocate large regions of memory (multi-TB) that are generally sparsely allocated using demand-paging. Currently, when checksumming a process that has such regions, we attempt to read the entire multi-TB region into a std::vector. This obviously does not work. Switch our checksum method to crc32c for performance and further improve performance by: 1. Reading the pagemap to detect pages that were allocated, but are currently still 0 and waiting to demand-paged in. These can be fast-forwarded using a precomputed crc operator. 2. Capping the amount of memory to be read at once at a reasonable max buffer size to avoid thrashing. The crc32c implementation here is copied from Julia, which itself is cobbled together from various places around the web - it's not the prettiest, but keeping it aligned with the Julia version will make it easier to port any future hardware acceleration improvements over, if necessary.
How much of a performance improvement is this fancy crc32 implementation? |
You can see the benchmark result I did a while ago when I implemented the arm versions here JuliaLang/julia#22385 there isn’t any benchmark result from the x86 version when it was originally added afaict. JuliaLang/julia#18297 |
# endif | ||
# endif | ||
|
||
/* Table-driven software version as a fall-back. This is about 15 times slower |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the comment suggests 15x faster for the hardware accelerated version which is similar with what I've seen on some older ARM cores.
I think the pagemap optimization makes sense because the speedup is nearly infinite in those pathological cases with lots of unreserved memory. I'm less convinced about the benefits of accelerating the CRC implementation for data that we've had to read from the tracee. Wouldn't it be good enough to use the crc32() function from zlib? We could replace the existing crc code in util.cc with that too. Personally I almost never use the memory checksumming. Maybe you and Keno use it a lot? |
Friendly ping on this.
As the zlib crc32 implementation has ARM optimizations since 2019, I think that's good to use and preferable over (another) 3rd party implementation. |
Tools like asan allocate large regions of memory (multi-TB)
that are generally sparsely allocated using demand-paging.
Currently, when checksumming a process that has such regions,
we attempt to read the entire multi-TB region into a std::vector.
This obviously does not work. Switch our checksum method to
crc32c for performance and further improve performance by:
are currently still 0 and waiting to demand-paged in. These
can be fast-forwarded using a precomputed crc operator.
max buffer size to avoid thrashing.
The crc32c implementation here is copied from Julia, which itself
is cobbled together from various places around the web - it's not
the prettiest, but keeping it aligned with the Julia version will
make it easier to port any future hardware acceleration improvements
over, if necessary.