-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getfield performance #3588
Comments
Not really a performance bug; you're doing 2 loads instead of 1. I will consider this a "request for optimization" (part of #3440). More concerning is how slow the string version is. @StefanKarpinski It is quite unfortunate to validate ASCII strings on every access, and perhaps even worse in the case of UTF-8. If we're that paranoid, we should check on construction, which is just as safe but more efficient. It would still be possible to avoid redundant checking if we've already done a check to determine which kind of string to return. |
Yeah, the string performance is lousy. We should definitely do something about it. In general, our handling of invalid UTF-8 and ASCII needs improvement. Here are the choices I can see:
Option 1 is unviable because invalid data happens all the time in the real world. Option 3 is what we're doing now and has bad performance overhead for lots of string accesses. The trouble with Option 2 is that if you read data in and just write it out again without ever touching it, you end up changing the data in the process. This is specifically recommended against in various Unicode standards. One possibility would be to check validity on input and produce a different type of object for valid vs. invalid data – then the access performance penalty would only apply to strings with invalid data. I hate to explode the number of string types further, but it's the only win-win option I can see. |
You can certainly have and use invalid string data, as a byte vector. On some level it doesn't make sense to wrap non-utf8 data as a UTF8String; for example PCRE might crash on it. Having I/O functions that return strings by default without the API mentioning encodings was probably a mistake. We should use byte vectors more, especially since that's the first step of constructing a ByteString anyway. |
I agree that encodings should be made more visible in Julia: dealing with them was a big headache when I was first writing |
Well, I'm glad we've come around to the same point of view on the encoding business and the need to separate the byte and string layer. We should definitely work on that. However, I do think it's reasonable to insist that this program output the same data as it's fed, regardless of whether that data is valid UTF-8 or not: for line in eachline(STDIN)
print(STDOUT,line)
end I also think that throwing an error on any invalid UTF-8 data rather than yielding replacement characters at the appropriate places is going to be a usability nightmare. |
moving discussion to #1792 |
@JeffBezanson would be great if this can be optimized on the future as part of #3440 |
If type
immutable type
|
Please be patient. I can only implement optimizations at a finite rate. |
There is not hurry :) in fact I'm not using it now |
@JeffBezanson is a machine that converts energy bars into LLVM bitcode - but every machine has physical constraints! |
I was trying to generate a better abstraction for BioSeq's Sequence objects. At the moment, I was using Vectors of bitstypes. But a new type more similar to Strings wrapping a vector, can be better on some situations and can gives me a better abstraction for different sequence encodings.
I was testing indexing performance on this sequences, and I see a slow down with respect to the actual code using simply Vectors.
This ~ 1.5x slow down comes form call to
getfield
an not for others things, like conversion of types. Now, movegetfield
out of the loop gives better performance.Bechmark times:
Benchmark code:
The text was updated successfully, but these errors were encountered: