Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getfield performance #3588

Closed
diegozea opened this issue Jun 30, 2013 · 11 comments
Closed

getfield performance #3588

diegozea opened this issue Jun 30, 2013 · 11 comments

Comments

@diegozea
Copy link
Contributor

I was trying to generate a better abstraction for BioSeq's Sequence objects. At the moment, I was using Vectors of bitstypes. But a new type more similar to Strings wrapping a vector, can be better on some situations and can gives me a better abstraction for different sequence encodings.

I was testing indexing performance on this sequences, and I see a slow down with respect to the actual code using simply Vectors.

This ~ 1.5x slow down comes form call to getfield an not for others things, like conversion of types. Now, move getfield out of the loop gives better performance.

Bechmark times:

                         Function    Elapsed Relative Replications
[1,]                       "_now" 0.00375577      1.0          100
[2,]                   "_new_seq" 0.00567916  1.51212          100
[3,]     "_new_seq_avoid_convert" 0.00567669  1.51146          100
[4,]    "_new_seq_avoid_getfield" 0.00379458  1.01033          100
[5,]                       "_str"  0.0189466  5.04466          100

Benchmark code:

importall Base
using Benchmark
using BioSeq

immutable AA
  byte::Uint8
end

type AASeq <: AbstractVector{AA}
  data::Vector{Uint8}
end

## size & length
size(seq::AASeq) = size( seq.data )
length(seq::AASeq) = length( seq.data )

## getindex
getindex(seq::AASeq,I::Real) = AA( seq.data[I] )
getindex(seq::AASeq,I::Real,J::Real) = AA( seq.data[I,J] )
getindex(seq::AASeq,I) = AASeq( getindex(seq.data,I) )

const now = AMINO_IUPAC[rand(1:29,72000)];
const new_seq = AASeq(uint8(now))
const str = ascii(now);

function _now()
    s = now
    for l in 1:72000
      s[l]
    end
end

function _new_seq()
    s = new_seq
    for l in 1:72000
      s[l]
    end
end

function _new_seq_avoid_convert()
    s = new_seq
    for l in 1:72000
      s.data[l]
    end
end

function _new_seq_avoid_getfield()
    s = new_seq.data
    for l in 1:72000
      AA( s[l] )
    end
end

function _str()
    s = str
    for l in 1:72000
      s[l]
    end
end

compare([_now,_new_seq,_new_seq_avoid_convert,_new_seq_avoid_getfield,_str],100)
compare([_now,_new_seq,_new_seq_avoid_convert,_new_seq_avoid_getfield,_str],100)
@JeffBezanson
Copy link
Member

Not really a performance bug; you're doing 2 loads instead of 1. I will consider this a "request for optimization" (part of #3440).

More concerning is how slow the string version is. @StefanKarpinski It is quite unfortunate to validate ASCII strings on every access, and perhaps even worse in the case of UTF-8. If we're that paranoid, we should check on construction, which is just as safe but more efficient. It would still be possible to avoid redundant checking if we've already done a check to determine which kind of string to return.

@StefanKarpinski
Copy link
Member

Yeah, the string performance is lousy. We should definitely do something about it. In general, our handling of invalid UTF-8 and ASCII needs improvement. Here are the choices I can see:

  1. Validate on input and choke on invalid ASCII and/or UTF-8; you never have to worry about accessing invalid data in normal usage.
  2. Validate on input and modify the data so that it becomes valid, replacing invalid sequences with replacement characters; again, you never have to worry about accessing invalid data.
  3. Don't validate on input, and handle invalid encoded data on access; this is what we're doing now.

Option 1 is unviable because invalid data happens all the time in the real world. Option 3 is what we're doing now and has bad performance overhead for lots of string accesses. The trouble with Option 2 is that if you read data in and just write it out again without ever touching it, you end up changing the data in the process. This is specifically recommended against in various Unicode standards. One possibility would be to check validity on input and produce a different type of object for valid vs. invalid data – then the access performance penalty would only apply to strings with invalid data. I hate to explode the number of string types further, but it's the only win-win option I can see.

@JeffBezanson
Copy link
Member

You can certainly have and use invalid string data, as a byte vector. On some level it doesn't make sense to wrap non-utf8 data as a UTF8String; for example PCRE might crash on it.

Having I/O functions that return strings by default without the API mentioning encodings was probably a mistake. We should use byte vectors more, especially since that's the first step of constructing a ByteString anyway.

@johnmyleswhite
Copy link
Member

I agree that encodings should be made more visible in Julia: dealing with them was a big headache when I was first writing readtable months ago.

@StefanKarpinski
Copy link
Member

Well, I'm glad we've come around to the same point of view on the encoding business and the need to separate the byte and string layer. We should definitely work on that. However, I do think it's reasonable to insist that this program output the same data as it's fed, regardless of whether that data is valid UTF-8 or not:

for line in eachline(STDIN)
  print(STDOUT,line)
end

I also think that throwing an error on any invalid UTF-8 data rather than yielding replacement characters at the appropriate places is going to be a usability nightmare.

@JeffBezanson
Copy link
Member

moving discussion to #1792

@diegozea
Copy link
Contributor Author

diegozea commented Jul 1, 2013

@JeffBezanson would be great if this can be optimized on the future as part of #3440
type_wrapping_a_vector[i] is a very common operation inside loops.
For the moment, I can move out the getfield outside some loops.

@diegozea
Copy link
Contributor Author

diegozea commented Jul 5, 2013

If AASeq is a immutable type, the performance still suffer for the extra getfield operation.
I guess than in immutable types, when the value of the field doesn't change, the compiler can replace the getfiled operation for the real value. Is this an issue @JeffBezanson ?

type
[2,]                   "_new_seq" 0.00583584  1.50466          100
[4,]    "_new_seq_avoid_getfield" 0.00389443   1.0041          100
immutable type
[2,]                   "_new_seq" 0.00582046  1.51721          100
[4,]    "_new_seq_avoid_getfield"  0.0038363      1.0          100

@JeffBezanson
Copy link
Member

Please be patient. I can only implement optimizations at a finite rate.

@diegozea
Copy link
Contributor Author

diegozea commented Jul 5, 2013

There is not hurry :) in fact I'm not using it now
I test the immutable version by curiosity, and I didn't know if It was a known issue.

@ViralBShah
Copy link
Member

@JeffBezanson is a machine that converts energy bars into LLVM bitcode - but every machine has physical constraints!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants