-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BitShuffle#shuffle for arrays with blocksize 4 (like integers) seems to be broken #296
Comments
I don't know much about the java interface, but on the bitshuffle README the function is That said, I didn't see a function with that name in a quick look at the java wrapper. I'm pretty sure the C-code should be correct, since our unit tests check against a python implementation. I'll need to verify that point. I also wonder if the apparent byte-shuffling behavior might be fooling you and just be an artifact of the very particular test pattern you chose. Can you try something with a bit more entropy? I see you specify big endian in your test, whereas most of our testing has been under little endian architectures. On that note, what architecture/instruction set are you using? |
Thanks for the quick reaction. 👍 The mentioned method I use this wrapper class: https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/BitShuffle.java I nailed it down to a very simple test and made further findings: The algorithm seems to require a minimum number of elements, otherwise it is not shuffled at all. Could that be the reason? All my unit tests are "handmade" having only a few elements which actually did not reach this "threshold"? Have a look to this very simple tests (obviously only the second test is actually bit shuffled):
Of course I know that bit shuffling makes no sense for very few elements (only costs computing time). But for the pure algorithm I would have expected that it ignores the number of elements and always shuffles (even if there are only two elements given). In other words a meaningful threshold has to be applied in the outer context. Or is this threshold a pre-requirement by the blockwise implementation? |
Just to be sure that I fully understand how the realized bitshuffle algorithm works. Can you please confirm that the following bit sequence is the correct result?
|
Few things: I think you are interpreting Yes, having arrays that are too short could definitely be the issue. You need at least 8 elements ( Ah, the bit ordering always breaks my brain. There is a good chance I'll get this wrong but let me try.
In Bitshuffle, everything is little endian, so A1 is the least significant byte of word A and A1.0 is the least significant bit (LSB) of A1. Bitshuffle puts the LSB first, so the output should be:
This is counter intuitive, because we normally think of the LSB as being the last bit in a byte. But that's actually somewhat a matter of interpretation. If you want to see how confused I was about this while writing bitshuffle, there is this thread on SO: https://stackoverflow.com/questions/24045102/how-does-endianness-work-with-simd-registers |
Actually I think it may be the other way round? :-)
My initial problem definitely results from the minimum words threshold (at least 8 elements of a given size), otherwise it acts in copy-only mode. For the other points (raised during this analysis) I opened further issues: #297 -> byte-order Thank you so much for your quick reaction! By the way, bit-shuffling of small arrays also makes sense if it is not about performance but only about bandwith. We transfer small self-contained data packets to smartphones and bit-shuffling can reduce the amount of data considerably (even for data series with less than 8x total size). If bandwith matters (and not performance) you are willing to go the extra mile. ;-) Example: Once assumed the following data series (each number is stored as a UInt64, thus for each number 7 bytes of 8 are wasted with zero): 1, 2, 1, 0, 0, 1, 2, 1, 2, 0, 1, 0, 0, 1, 1 The total number of elements is 15 and the total array size in bytes is 15 x 8 = 120 bytes in total. Here are the numbers and ratios: Original Size: 120 (bytes) -> Ratio: 100% Obviously even for small messages (e.g. that needs to be transported over the wire) bit-shuffling may have a significant positive effect. This was the reason how we figured out, that the bit-shuffling library does not kick-in here for these short messages. Our LZ4 compression ratio was sticked to 38% in this example. |
Some of our unit tests triggers that
BitShuffle#shuffle()
literally does nothing meaningful for arrays having a wordsize of 4 bytes (e.g. like integers). In contrast shuffling byte arrays (wordsize of 1 byte) works pretty well (unit tests passed) but for integers only the byte order is changed but no bits are re-arranged at all.Since the shuffle and unshuffle methods both reflect this behaviour the bijective API contract is not broken and this effect is only "visible" on deeply inspecting the binary results.
I attached a short unit test allowing to easily reproduce the behaviour (one test pass and one fails).
The failing one is based on an array that contains 2 integers (each having 4 bytes) and the output is as follows:
Since I am not familiar with the internals, I can only guess where the error lies. I suspect something is wrong with the integration / passing to the bitshuffle library.
Since we are very dependent on a fast (native) bit-shuffling algorithm, I would really appreciate some help.
Thanks in advance for any support!
BitShuffleTest-java.txt
The text was updated successfully, but these errors were encountered: