Skip to content
This repository has been archived by the owner on Apr 19, 2020. It is now read-only.

UTF8 string serialization #15

Open
velvia opened this issue Mar 18, 2015 · 8 comments
Open

UTF8 string serialization #15

velvia opened this issue Mar 18, 2015 · 8 comments

Comments

@velvia
Copy link

velvia commented Mar 18, 2015

Strings form a large portion of many objects. Just storing a pointer to the on-heap String object is not a practical way to reduce GC pressure. Instead, how about having a UTF8-based string wrapper class that can offer support for basic operations:

equals()
startsWith()
maybe contains()

other more complex methods can be delegated to the native Java/Scala string class by serializing to a string on-heap on demand, but the above would offer enough support for simple things like HTTP or JSON parsing.

The goal is to allow for basic fast string operations without the expensive conversion and object allocation to serialize UTF8-encoded strings to UTF16-native Java byte format.

@densh
Copy link
Owner

densh commented Mar 18, 2015

I think that having support for offheap strings in the API is a great idea. I'm not sure about details of the implementation yet, but I'll update the issue once I have some more concrete thoughts on the topic.

@andresilva
Copy link

I agree that the conversion to String is indeed expensive and incurs an unnecessary object allocation. Still, you'll have a hard time beating the performance of String#equals() since the JVM has an intrinsic method that uses SSE4.2 instructions to do the comparison. You might be able to use Arrays.equals (which is also intrinsic) but then you'd incur an allocation since you need to create a byte array from off heap memory. I'm curious to see what you come up with. 😄

@velvia
Copy link
Author

velvia commented Mar 21, 2015

Unsafe has memcopy, too bad it doesn't have memcompare... :(

-Evan
"Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Mar 20, 2015, at 4:47 PM, André Silva [email protected] wrote:

I agree that the conversion to String is indeed expensive and incurs an unnecessary object allocation. Still, you'll have a hard time beating the performance of String#equals() since the JVM has an intrinsic method that uses SSE4.2 instructions to do the comparison. You might be able to use Arrays.equals (which is also intrinsic) but then you'd incur an allocation since you need to create a byte array from off heap memory. I'm curious to see what you come up with.


Reply to this email directly or view it on GitHub.

@densh
Copy link
Owner

densh commented Mar 21, 2015

JNI might be the answer here. Considering the fact that we don't need to copy any data over (as the data is already effectively allocated in C heap) we wouldn't have much performance overhead. Of course we need to benchmark to validate this.

@ghost
Copy link

ghost commented Jun 13, 2015

Hi Denys,

With the jemalloc JNI binding, we can add utility functions as well to expose low level operations from or potentially SIMD instructions. I think for the latter case we might have to be careful as to chipset family for the target platforms. I can dig into some of the hotspot code from openjdk and check their implementation. For now I can put this work into a parallel branch while we flush out the jemalloc binding and just plan to include that in the JNI library that houses jemalloc.

@densh
Copy link
Owner

densh commented Jun 13, 2015

@arosenberger Please don't use GPL code bases as a reference. We use Scala license (3-clause BSD derivative) for our code and can only borrow implementation ideas from software with compatible license. Otherwise we might get in to legal trouble some day even if we don't borrow any code. (Note to self: this really needs to be documented somewhere.)

@densh
Copy link
Owner

densh commented Jun 13, 2015

@arosenberger I think that we need to concentrate on getting 0.1 out before we proceed with this. I'm afraid there are lots of corner cases in string support and it will take a while to get it right.

@ghost
Copy link

ghost commented Jun 13, 2015

Thanks for the heads up on the GPL. I'll focus on finishing up jemalloc and adding the ArrayOps methods from the other issues. We can revisit this one down the road.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants