Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use promote_typejoin for Tuple and NamedTuple promotion #25924

Closed
wants to merge 2 commits into from

Conversation

nalimilan
Copy link
Member

Even if we do not want to call promote_type on the entries when promoting tuples (#25553 (comment)), we still have to use promote_typejoin so that e.g. promote_type(Tuple{Int}, Tuple{Missing}) returns Tuple{Union{Int,Missing}}. In particular, this ensures array concatenation uses Union{T,Missing/Nothing} instead of Any, which is consistent with map(identity, x).

Even if we do not want to call promote_type() on the entries when promoting tuples,
we still have to use promote_typejoin() so that e.g. promote_type(Tuple{Int}, Tuple{Missing})
returns Tuple{Union{Int,Missing}}. In particular, this ensures array concatenation uses
Union{T,Missing/Nothing} instead of Any, which is consistent with map(identity, x).
@nalimilan nalimilan added the missing data Base.missing and related functionality label Feb 7, 2018
@vtjnash
Copy link
Member

vtjnash commented Feb 7, 2018

returns Tuple{Union{Int,Missing}}

It's not clear to me why we would want it to be this type

@nalimilan
Copy link
Member Author

It's not clear to me why we would want it to be this type

Because that allows using an efficient storage and fastercode down the line? Storing [(missing, 1), (1, missing)] as a Vector{Tuple{Any,Any}} is really bad in that regard. That's really the same reasoning as the one which prompted us to make [1, missing] return a Vector{Union{Int,Missing}} rather than a Vector{Any}.

This is particularly problematic when using NamedTuple to represent data frame rows. See also #25925.

@vtjnash
Copy link
Member

vtjnash commented Feb 7, 2018

Because that allows using an efficient storage and fastercode down the line?

That part is what is not clear to me, as being likely / soon :)

@davidanthoff
Copy link
Contributor

Wouldn't the minimum requirement for an efficient in-memory representation of Tuple{Union{Int,Missing}} be that it is a concrete type? And wouldn't that require that tuples are not covariant? And that would probably be a breaking change (assuming it would be desirable in the first place, which I'm not convinced off), i.e. if it isn't done for 1.0 it would have to wait at least for 2.0?

@nalimilan
Copy link
Member Author

That part is what is not clear to me, as being likely / soon :)

What do you mean? Tuple{Union{T,Missing}}} is already much more efficient than Tuple{Any}. For example:

# Creates a Vector{Tuple{Union{Missing, Int64}} thanks to my recent changes to map
julia> x = map(_ -> (rand(Bool) ? missing : 1,), 1:10_000);

# This is the kind of vector you get with the array literal syntax
julia> y = convert(Array{Tuple{Any}}, x);

julia> @btime map(identity, x);
  48.031 μs (13 allocations: 78.66 KiB)

julia> @btime map(identity, y);
  3.175 ms (13 allocations: 78.66 KiB)

Wouldn't the minimum requirement for an efficient in-memory representation of Tuple{Union{Int,Missing}} be that it is a concrete type? And wouldn't that require that tuples are not covariant? And that would probably be a breaking change (assuming it would be desirable in the first place, which I'm not convinced off), i.e. if it isn't done for 1.0 it would have to wait at least for 2.0?

I don't know, but as shown above the performance gain is so large that whether or not the storage is more efficient using a more precise type sounds worth it already.

Anyway, even disregarding performance, without this PR we have a weird exception for tuples, which is that array literal syntax gives a less precise element type than map, while for other types it's the contrary (due to promotion, e.g. [1, 2.0] vs map(identity, Any[1, 2.0])). I think the burden of the proof should be on people who advocate for keeping this inconsistency.

@davidanthoff
Copy link
Contributor

I'm not advocating against this PR! I'm just surprised that there is a performance improvement by going from Tuple{Any} to Tuple{Union{T,Missing}}. Given that neither is a a concrete type I would have thought that perf is similar between them. But clearly, that is not the case :)

I wonder how this looks for tuples with multiple elements? Maybe for a tuple with one element the compiler in some clever way just "removes" (whatever that might mean) the tuple wrapper and treats it as the element type, in which case all the small union optimizations might kick in, and that is why there is this perf improvement?

@nalimilan
Copy link
Member Author

I'm not advocating against this PR! I'm just surprised that there is a performance improvement by going from Tuple{Any} to Tuple{Union{T,Missing}}. Given that neither is a a concrete type I would have thought that perf is similar between them. But clearly, that is not the case :)

Looking at the generated code, it appears that knowing the type of the elements allows the compiler to use small-Union optimizations, which makes a big difference.

I wonder how this looks for tuples with multiple elements? Maybe for a tuple with one element the compiler in some clever way just "removes" (whatever that might mean) the tuple wrapper and treats it as the element type, in which case all the small union optimizations might kick in, and that is why there is this perf improvement?

I've just tested with a 4-tuple, and the performance difference is still there, even without heterogeneous types.

@vtjnash
Copy link
Member

vtjnash commented Feb 8, 2018

Tuple{Union{T,Missing}}} is already much more efficient than Tuple{Any}

Isn't that just a side-effect of map not giving Tuple{Any} (or Any) for this case, so it's expending extra work trying to narrow the type for a case that we don't optimize? (I had thought we were avoiding changing this based on the comment at #25553 (comment)). I get the best performance on that example above by removing that promote rule entirely (followed by going back to Tuple{Any}):

Base.promote_typejoin(::Type{Tuple{T}}, ::Type{Tuple{T}}) where {T} = T
Base.promote_typejoin(::Type{Tuple{T}}, ::Type{Tuple{S}}) where {T, S} = Any

@nalimilan
Copy link
Member Author

Interesting. I get this:

julia> Base.promote_typejoin(::Type{Tuple{T}}, ::Type{Tuple{T}}) where {T} = T

julia> Base.promote_typejoin(::Type{Tuple{T}}, ::Type{Tuple{S}}) where {T, S} = Any

julia> x = map(_ -> (rand(Bool) ? missing : 1,), 1:10_000); # ::Vector{Any}

julia> y = convert(Array{Tuple{Any}}, x);

julia> z = convert(Array{Tuple{Union{Int,Missing}}}, x);

julia> @btime map(identity, x);
  244.888 μs (5041 allocations: 235.11 KiB)

julia> @btime map(identity, y);
  40.441 μs (9 allocations: 156.48 KiB)

julia> @btime map(identity, z);
  34.031 μs (9 allocations: 156.48 KiB)

# And for a more complex operation involving a call
julia> @btime map(x -> sin(x[1]), x);
  1.504 ms (5039 allocations: 244.89 KiB)

julia> @btime map(x -> sin(x[1]), y);
  1.517 ms (5039 allocations: 244.89 KiB)

julia> @btime map(x -> sin(x[1]), z);
  306.159 μs (5039 allocations: 244.89 KiB)

So AFAICT the narrowest type still gives better performance, in particular when a function call is involved (I guess because of dynamic dispatch when the type isn't known statically).

I'm surprised that the promote_typejoin call reduces performance so much for the Tuple{Any} case in the identity benchmark from my previous post, though. AFAIK it should only be called once (the first time encountering a missing or a valid value). But I guess for short operations like that any inference operation has a significant cost.

(I had thought we were avoiding changing this based on the comment at #25553 (comment))

The change that this comment and #25553 (comment) referred to was defining promote_type on tuples to call promote_type on the element types, which was more radical. Here I just propose calling promote_typejoin on the element types, which only affects Union{T, Missing/Nothing}.

@JeffBezanson
Copy link
Member

I suppose this makes sense for the sake of synchronizing cat with map, but we're not quite there yet. The remaining problem is that, for containers of containers, we don't generally want to convert all the inner containers to have abstract element types. For example, we do this:

julia> [[1], [:a]]
2-element Array{Array{T,1} where T,1}:
 [1]       
 Symbol[:a]

Notice: we returned a Vector{Vector}, not a Vector{Vector{Any}}. If we returned a Vector{Vector{Any}}, that would require copying both inner arrays, and boxing every element of the [1] array. If it had a million elements, that would be bad. Unfortunately, we do that for NamedTuples:

julia> map(x->(rand(Bool) ? (a=1,) : (a=:a,)), 1:10)
10-element Array{NamedTuple{(:a,),Tuple{Any}},1}:
 NamedTuple{(:a,),Tuple{Any}}((1,)) 

Solving that seems quite tricky though.

@nalimilan
Copy link
Member Author

OK. So should we merge this then?

Regarding the fact that Base.promote_typejoin(typeof((a=1,)), typeof((a=:a,))) returns NamedTuple{(:a,),Tuple{Any}} (already on master), AFAICT it's relatively logical if we want promote_typejoin to return a named tuple type with Union{T, Missing} fields where it makes sense. Is that really a problem? Contrary to arrays, named tuples are immutable, so does changing their type require any additional computations? If that's a problem, I guess we could use a custom function similar to promote_typejoin which would leave unspecified the type of fields which would end up not being concrete or Union{T, Missing/Nothing} with T concrete.

vtjnash added a commit that referenced this pull request Feb 19, 2018
This also helps to re-synchronize `cat` and `map`

Closes #25924
@nalimilan nalimilan added the triage This should be discussed on a triage call label Feb 21, 2018
@vtjnash vtjnash closed this Feb 22, 2018
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Feb 22, 2018
@JeffBezanson
Copy link
Member

Discussed on triage, and I think I agree with @vtjnash that map should just collect the exact values you give it and not meddle with their types. The union type issue is better solved upstream, by making named tuples with the element types you want in the first place.

@nalimilan
Copy link
Member Author

Wait, this PR isn't about map at all, it's about promotion and cat. As I said, I don't care a lot about this PR. OTOH #26109 is about map, and the things it reverts seem much more problematic to me.

vtjnash added a commit that referenced this pull request Feb 27, 2018
This also helps to re-synchronize `cat` and `map`

Closes #25924
@bramtayl
Copy link
Contributor

bramtayl commented Mar 17, 2018

I'm confused here. I've been working on a package called KeyedTables because I'd been given to understand that pure tuple implementations of namedtuples would be better able to handle inference. I'd gotten to this point:

julia> julia> z = keyed_tuple(x = [1, 2, missing], y = [missing, "a", "b"])
((.x, Union{Missing, Int64}[1, 2, missing]), (.y, Union{Missing, String}[missing, "a", "b"]))

julia> collect(rows(z))
3-element Array{Tuple{Tuple{Key{:x},Any},Tuple{Key{:y},Any}},1}:
 ((.x, 1), (.y, missing))
 ((.x, 2), (.y, "a"))
 ((.x, missing), (.y, "b"))

So I find myself in strong agreement with my personal downvote fairy @ararslan above.

This is especially confusing because

julia> Base.@default_eltype rows(z)
Tuple{Tuple{Key{:x},Union{Missing, Int64}},Tuple{Key{:y},Union{Missing, String}}}

@bramtayl
Copy link
Contributor

bramtayl commented Mar 17, 2018

What would people think of something like this?

function typejoin(@nospecialize(a), @nospecialize(b); recur = typejoin)
    # same but use recur instead of typejoin when recurring
end


promote_typejoin(@nospecialize(a), @nospecialize(b)) = typejoin(a, b, recur = promote_typejoin)

@vtjnash
Copy link
Member

vtjnash commented Mar 17, 2018

Tuple{Tuple{Key{:x},Union{Missing, Int64}},Tuple{Key{:y},Union{Missing, String}}}

I think you want Tuple{Pair{Key{:x}, Union{Missing, Int64}}, Pair{Key{:y}, Union{Missing, String}}} here. Is that possible?

@bramtayl
Copy link
Contributor

Um sure I can try with pairs would that improve inference?

@bramtayl
Copy link
Contributor

bramtayl commented Mar 17, 2018

Doesn't seem to really help:

julia>  z = keyed_tuple(x = [1, 2, missing], y = [missing, "a", "b"])
(.x => Union{Missing, Int64}[1, 2, missing], .y => Union{Missing, String}[missing, "a", "b"])

julia> collect(rows(z))
3-element Array{Tuple{Pair{Keys.Key{:x},B} where B,Pair{Keys.Key{:y},B} where B},1}:
 (.x => 1, .y => missing)
 (.x => 2, .y => "a")
 (.x => missing, .y => "b")

EDIT: Nope, pairs make things worse:

julia> Base.@default_eltype rows(z)
Tuple{Union{Pair{Keys.Key{:x},Missing}, Pair{Keys.Key{:x},Int64}},Union{Pair{Keys.Key{:y},Missing}, Pair{Keys.Key{:y},String}}}

@vtjnash
Copy link
Member

vtjnash commented Mar 17, 2018

rows(z) should also be returning an object of that type (Pair{Key{:x}, Union{T, Missing}}(key, val))

@bramtayl
Copy link
Contributor

Currently it's returning a generator?

@vtjnash
Copy link
Member

vtjnash commented Mar 18, 2018

Sure – I meant that it should generate objects of that specific type

@bramtayl
Copy link
Contributor

Oh yes, it should. That would definitely be nice, not currently happening in 0.7 tho.

@bramtayl
Copy link
Contributor

bramtayl commented Mar 18, 2018

Sorry don't mean to derail yet another thread but here's my function for rows

function rows(k)
    keys = map(key, k)
    Generator(
        values -> map(tuple, keys, values),
        zip(map(value, k)...)
    )
end

where key is just x -> x[1] and value is x -> x[2]

or with the pairs version

function rows(k)
    keys = map(key, k)
    Generator(
        values -> map(Pair, keys, values),
        zip(map(value, k)...)
    )
end

@vtjnash
Copy link
Member

vtjnash commented Mar 18, 2018

    values -> map(Pair, keys, values),

These are wrong: they drop the type parameters of the source. It's probably easier/clear (and more efficient) to implement this as a simple view-wrapper(*), rather than eagerly copying the elements to a (named)tuple. But here's the eager version anyways:

function rows(k)
    keys = map(key, k) # list of keys
    values_t = map(value, k) # list of Arrays of values
    Vs = map(eltype, values_t) # list of value Types
    element(V, key, value) = Pair{typeof(key), V}(key, value) # element constructor
    return (map(element, Vs, keys, values) for values in zip(values_t...))
end

(* if the original table was able to indexed with [row, :col], then this is perhaps just Transpose?)

@bramtayl
Copy link
Contributor

Oooo got it so if I ensure each element is the same type then I won't have to worry about promote_typejoin at all

@goretkin
Copy link
Contributor

Was this PR about removing the following "discrepancy"?

julia> [1, missing]
2-element Vector{Union{Missing, Int64}}:
 1
  missing

julia> [(1,), (missing,)]
2-element Vector{Tuple{Any}}:
 (1,)
 (missing,)

I don't understand why that's the behavior, instead of producing a Vector{Tuple{Union{Missing, Int64}}}.

@nalimilan
Copy link
Member Author

Yes that was the goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants