getfield performance #3588

diegozea · 2013-06-30T20:05:33Z

I was trying to generate a better abstraction for BioSeq's Sequence objects. At the moment, I was using Vectors of bitstypes. But a new type more similar to Strings wrapping a vector, can be better on some situations and can gives me a better abstraction for different sequence encodings.

I was testing indexing performance on this sequences, and I see a slow down with respect to the actual code using simply Vectors.

This ~ 1.5x slow down comes form call to getfield an not for others things, like conversion of types. Now, move getfield out of the loop gives better performance.

Bechmark times:

                         Function    Elapsed Relative Replications
[1,]                       "_now" 0.00375577      1.0          100
[2,]                   "_new_seq" 0.00567916  1.51212          100
[3,]     "_new_seq_avoid_convert" 0.00567669  1.51146          100
[4,]    "_new_seq_avoid_getfield" 0.00379458  1.01033          100
[5,]                       "_str"  0.0189466  5.04466          100

Benchmark code:

importall Base
using Benchmark
using BioSeq

immutable AA
  byte::Uint8
end

type AASeq <: AbstractVector{AA}
  data::Vector{Uint8}
end

## size & length
size(seq::AASeq) = size( seq.data )
length(seq::AASeq) = length( seq.data )

## getindex
getindex(seq::AASeq,I::Real) = AA( seq.data[I] )
getindex(seq::AASeq,I::Real,J::Real) = AA( seq.data[I,J] )
getindex(seq::AASeq,I) = AASeq( getindex(seq.data,I) )

const now = AMINO_IUPAC[rand(1:29,72000)];
const new_seq = AASeq(uint8(now))
const str = ascii(now);

function _now()
    s = now
    for l in 1:72000
      s[l]
    end
end

function _new_seq()
    s = new_seq
    for l in 1:72000
      s[l]
    end
end

function _new_seq_avoid_convert()
    s = new_seq
    for l in 1:72000
      s.data[l]
    end
end

function _new_seq_avoid_getfield()
    s = new_seq.data
    for l in 1:72000
      AA( s[l] )
    end
end

function _str()
    s = str
    for l in 1:72000
      s[l]
    end
end

compare([_now,_new_seq,_new_seq_avoid_convert,_new_seq_avoid_getfield,_str],100)
compare([_now,_new_seq,_new_seq_avoid_convert,_new_seq_avoid_getfield,_str],100)

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2013-06-30T21:41:06Z

Not really a performance bug; you're doing 2 loads instead of 1. I will consider this a "request for optimization" (part of #3440).

More concerning is how slow the string version is. @StefanKarpinski It is quite unfortunate to validate ASCII strings on every access, and perhaps even worse in the case of UTF-8. If we're that paranoid, we should check on construction, which is just as safe but more efficient. It would still be possible to avoid redundant checking if we've already done a check to determine which kind of string to return.

StefanKarpinski · 2013-06-30T22:16:00Z

Yeah, the string performance is lousy. We should definitely do something about it. In general, our handling of invalid UTF-8 and ASCII needs improvement. Here are the choices I can see:

Validate on input and choke on invalid ASCII and/or UTF-8; you never have to worry about accessing invalid data in normal usage.
Validate on input and modify the data so that it becomes valid, replacing invalid sequences with replacement characters; again, you never have to worry about accessing invalid data.
Don't validate on input, and handle invalid encoded data on access; this is what we're doing now.

Option 1 is unviable because invalid data happens all the time in the real world. Option 3 is what we're doing now and has bad performance overhead for lots of string accesses. The trouble with Option 2 is that if you read data in and just write it out again without ever touching it, you end up changing the data in the process. This is specifically recommended against in various Unicode standards. One possibility would be to check validity on input and produce a different type of object for valid vs. invalid data – then the access performance penalty would only apply to strings with invalid data. I hate to explode the number of string types further, but it's the only win-win option I can see.

JeffBezanson · 2013-06-30T22:46:51Z

You can certainly have and use invalid string data, as a byte vector. On some level it doesn't make sense to wrap non-utf8 data as a UTF8String; for example PCRE might crash on it.

Having I/O functions that return strings by default without the API mentioning encodings was probably a mistake. We should use byte vectors more, especially since that's the first step of constructing a ByteString anyway.

johnmyleswhite · 2013-06-30T22:48:19Z

I agree that encodings should be made more visible in Julia: dealing with them was a big headache when I was first writing readtable months ago.

StefanKarpinski · 2013-06-30T22:52:57Z

Well, I'm glad we've come around to the same point of view on the encoding business and the need to separate the byte and string layer. We should definitely work on that. However, I do think it's reasonable to insist that this program output the same data as it's fed, regardless of whether that data is valid UTF-8 or not:

for line in eachline(STDIN)
  print(STDOUT,line)
end

I also think that throwing an error on any invalid UTF-8 data rather than yielding replacement characters at the appropriate places is going to be a usability nightmare.

JeffBezanson · 2013-06-30T23:08:15Z

moving discussion to #1792

diegozea · 2013-07-01T03:48:57Z

@JeffBezanson would be great if this can be optimized on the future as part of #3440
type_wrapping_a_vector[i] is a very common operation inside loops.
For the moment, I can move out the getfield outside some loops.

diegozea · 2013-07-05T05:29:11Z

If AASeq is a immutable type, the performance still suffer for the extra getfield operation.
I guess than in immutable types, when the value of the field doesn't change, the compiler can replace the getfiled operation for the real value. Is this an issue @JeffBezanson ?

type

[2,]                   "_new_seq" 0.00583584  1.50466          100
[4,]    "_new_seq_avoid_getfield" 0.00389443   1.0041          100

immutable type

[2,]                   "_new_seq" 0.00582046  1.51721          100
[4,]    "_new_seq_avoid_getfield"  0.0038363      1.0          100

JeffBezanson · 2013-07-05T05:31:05Z

Please be patient. I can only implement optimizations at a finite rate.

diegozea · 2013-07-05T05:40:18Z

There is not hurry :) in fact I'm not using it now
I test the immutable version by curiosity, and I didn't know if It was a known issue.

ViralBShah · 2013-07-05T06:23:41Z

@JeffBezanson is a machine that converts energy bars into LLVM bitcode - but every machine has physical constraints!

JeffBezanson closed this as completed Jun 30, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getfield performance #3588

getfield performance #3588

diegozea commented Jun 30, 2013

JeffBezanson commented Jun 30, 2013

StefanKarpinski commented Jun 30, 2013

JeffBezanson commented Jun 30, 2013

johnmyleswhite commented Jun 30, 2013

StefanKarpinski commented Jun 30, 2013

JeffBezanson commented Jun 30, 2013

diegozea commented Jul 1, 2013

diegozea commented Jul 5, 2013

JeffBezanson commented Jul 5, 2013

diegozea commented Jul 5, 2013

ViralBShah commented Jul 5, 2013

getfield performance #3588

getfield performance #3588

Comments

diegozea commented Jun 30, 2013

JeffBezanson commented Jun 30, 2013

StefanKarpinski commented Jun 30, 2013

JeffBezanson commented Jun 30, 2013

johnmyleswhite commented Jun 30, 2013

StefanKarpinski commented Jun 30, 2013

JeffBezanson commented Jun 30, 2013

diegozea commented Jul 1, 2013

diegozea commented Jul 5, 2013

type

immutable type

JeffBezanson commented Jul 5, 2013

diegozea commented Jul 5, 2013

ViralBShah commented Jul 5, 2013