-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion performance #8
Comments
Could you write a summary of the results? Also, I think it would make sense to measure the performance of converting many strings using the same iconv/ICU handler, since this is a more reasonable scenario. I think progressively adding pure-Julia converters, starting with the most common encodings, is a good idea. That would justify changing the name of the package. One difficult point is to get a relatively consistent behavior as regards invalid characters, given that |
The test in the gist used different sizes, 1, 16, 32, 64, 256, 1024, 5120. I'm not sure exactly what you want, about measuring converting many strings. About the pure-Julia converters, I think it will be pretty easy for me to add all 8-bit encodings, with all of the invalid character behaviors. Big take away, which I learned when doing the Unicode conversion code in Base, after being pushed by Tony et. al. to do it in pure Julia, is that pure Julia rocks! 😀 |
5120 still isn't that much to offset the cost of creating a handle. But indeed in many use cases that overhead is an issue, so that's fair.
Why do you say that? UnicodeExtras.jl supports
Yeah, though I don't expect this to change the results too much. I wonder why iconv (and ICU to some extent) are so slow. Note that I haven't done any optimization on iconv.jl, so there might be significant performance issues to fix before considering the results as significant.
Makes sense. |
Ah, we are using
Yes - I understand iconv.jl isn't optimized yet - we'd really need to benchmark separately
I like to benchmark as many possibilities as possible - so I don't get caught out by some case I didn't benchmark that turns out to be common. |
julia-iconv performance will be the platform native performance plus a small marshalling overhead by the julia and native C communication layer. iconv is a very matured library and no significant code additions have not gone in after 2011 which makes it ultra-stable. Any newer library may have that downside. Unless proven otherwise it may not be a good idea to move away from iconv. |
The problem is not the cost of communication between Julia and C (that cost should be null), it's just that iconv is said to be relatively slow. Julia allows generating very efficient code for common conversions on the fly, which should be worth it at least for simple cases. Anyway we'd only do this if benchmarks show it's really faster. |
I've written some simple functions that create tables using iconv.jl, and then do the conversions in pure Julia code instead of calling iconv, as well as comparing the performance of
I've made a Gist with benchmark results (using https://github.com/johnmyleswhite/Benchmarks.jl)
along with the code and benchmarking code, at:
https://gist.github.com/ScottPJones/fcd12f675edb3d79b5ce.
The tables created are also very small, at most couple hundred bytes (or less) per character set
(maximum, if the character set is ASCII compatible, is 256 bytes, if it an ANSI character set, max is 192 bytes, and only 64 bytes for CP1252 - which woud probably be the most used conversion).
Should we move towards using this approach at least for the 8-bit character set conversions?
It would also make it easy to add all of the options that Python 3 has, for handling invalid characters
(error, remove, replace with fixed replacement character (default 0xfffd) or string, insert quoted XML escape sequence, insert quoted as
\uxxxx
or\u{xxxx}
.The text was updated successfully, but these errors were encountered: