Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base.unzip() #13942

Open
ZacCranko opened this issue Nov 10, 2015 · 85 comments · May be fixed by #33515
Open

Base.unzip() #13942

ZacCranko opened this issue Nov 10, 2015 · 85 comments · May be fixed by #33515
Labels
feature Indicates new feature / enhancement requests

Comments

@ZacCranko
Copy link

Hi there,

apologies if this has already been addressed somewhere, but is there a reason that there is no unzip() function in Base?

Ideally this would be a function that would take a Vector{Tuple{ ... }} and return a Tuple{Vector, ..., Vector} for output. E.g.

julia> v = [(1,"a",:meow), (2,"b",:woof), (3,"c",:doh!)]; unzip(v)
([1,2,3],ASCIIString["a","b","c"],[:meow,:woof,:doh!])

A naive implementation might be something like

function unzip(input::Vector)
    n = length(input)
    types  = map(typeof, first(input))
    output = map(T->Vector{T}(n), types)

    for i = 1:n
       @inbounds for (j, x) in enumerate(input[i])
           (output[j])[i] = x
       end
    end
    return (output...)
end
@StefanKarpinski
Copy link
Member

We definitely need an unzip function in Base.

@ivarne
Copy link
Member

ivarne commented Nov 11, 2015

Or even simpler

unzip(a) = zip(a...) 

@yuyichao
Copy link
Contributor

No

@StefanKarpinski
Copy link
Member

Trying the zip(...) thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.

@tlnagy
Copy link
Contributor

tlnagy commented May 6, 2016

I just ran into this issue when I absentmindedly used the Python-esque zip(arr...) where length(arr)=6000. It was not pretty. A dedicated unzip would be nice. Maybe even throw a warning that you shouldn't splat anything nontrivial.

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2016

Maybe even throw a warning that you shouldn't splat anything nontrivial.

Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.

It looks like this is not covered in the performance tip?

@tlnagy
Copy link
Contributor

tlnagy commented May 6, 2016

Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.

Yeah, a warning is a bad idea.

It looks like this is not covered in the performance tip?

I think this should also be marked under differences from other languages. People coming from Python are probably used to the zip(*arr) syntax and zip(arr...) might be a tempting target.

@hayd
Copy link
Member

hayd commented May 6, 2016

You can dispatch only on arrays of tuples:

function unzip{T<:Tuple}(A::Array{T})
    res = map(x -> x[], T.parameters)
    res_len = length(res)
    for t in A
        for i in 1:res_len
            push!(res[i], t[i])
        end
    end
    res
end

@StefanKarpinski
Copy link
Member

Bump, we still need and unzip function.

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request good first issue Indicates a good issue for first-time contributors to Julia labels Feb 7, 2017
@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Feb 7, 2017
@StefanKarpinski
Copy link
Member

BUMP if anyone is bored...

@JeffreySarnoff
Copy link
Contributor

function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
    n = length(zipped)
    vec1 = Vector{T1}(n)
    vec2 = Vector{T2}(n)
    i = 1
    for pair in enumerate(zipped)
        vec1[i], vec2[i] = pair[2]
        i = i+1
    end
    return vec1, vec2
end

@JeffreySarnoff
Copy link
Contributor

maybe <= 1.5x is

function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
    return map(first,zipped), map(last,zipped)
 end

@martinholters
Copy link
Member

unzip(zipped::Base.Iterators.Zip2) = (zipped.a, zipped.b)

But all those differ in how they treat length(a) != length(b). I think I'd favor the behavior of the first.

This trivial case apart, the desired semantics are not absolutely clear to me: Given an iterator itr, should unzip(itr) return an iterator unzipped, that generates iterators unzipped[1] to unzipped[N], where N is the length of the iterators generated by itr? What if the first 10000 values generated by itr are Tuples of length N==3, but then comes a Tuple of length 2? Throw when iterating at least one unzipped[n] up to that point? Or eagerly collect itr the moment unzip is called?

Think about

unzip(f(n) for n in Base.Iterators.countfrom(0))

for different fs to get an idea of my open questions.

@spalato
Copy link

spalato commented Mar 19, 2018

There seems to be multiple use cases for an unzip, with different requirements.

A common case is involves Arrays of Tuples, in which the size is known and array shape must be preserved. Some suggestions here do not obey this constraint.

I have come up with a solution for this specific case using Base.Cartesian. It's up to twice as fast as the alternative commonly suggested alternative for NTuples, and more than 10x for heterogeneous tuples.
See: https://github.com/spalato/Destruct.jl
Feel free to use and steal.

@bramtayl
Copy link
Contributor

Can we get this for generalized iterators and not just arrays? Basically just a collect_many modeled after collect?

@bramtayl
Copy link
Contributor

I could give it a stab? The trickiest bit (I think) will be getting an equivalent of default_eltype for zero length eltype unknown iterators

@StefanKarpinski
Copy link
Member

Go for it, @bramtayl!

@bramtayl
Copy link
Contributor

bramtayl commented Nov 16, 2018

Tada, this seems to work. Great excuse not to finish my grading.

struct Unzip{Iterator}
    iterator::Iterator
end

iterate(iterator::Unzip) = iterate(iterator.iterator)
iterate(iterator::Unzip, state) = iterate(iterator.iterator, state)

IteratorEltype(iterator::Unzip) = IteratorEltype(iterator.iterator)
eltype(iterator::Unzip) = eltype(iterator.iterator)

IteratorSize(iterator::Unzip) = IteratorSize(iterator.iterator)
axes(iterator::Unzip) = axes(iterator.iterator)
size(iterator::Unzip) = size(iterator.iterator)
length(iterator::Unzip) = length(iterator.iterator)

struct ModelArray{ElementType, NumberOfDimensions, Model, Rest <: Tuple} <:
    AbstractArray{ElementType, NumberOfDimensions}
    model::Model
    rest::Rest
end

model_array(model, rest...) =
    ModelArray{
        Tuple{eltype(model), eltype.(rest)...},
        ndims(model),
        typeof(model),
        typeof(rest)
    }(model, rest)

axes(array::ModelArray) = axes(array.model)
size(array::ModelArray) = size(array.model)

IndexStyle(array::ModelArray) = IndexLinear()

@propagate_inbounds getindex(array::ModelArray, index::Int) = (
    getindex(array.model, index),
    getindex.(array.rest, index)...
)

@propagate_inbounds setindex!(array::ModelArray, value::Tuple, index::Int) = (
    setindex!(array.model, value[1], index),
    setindex!.(array.rest, tail(value), index)...
)

push!(array::ModelArray, value::Tuple) = (
    push!(array.model, value[1]),
    push!.(array.rest, tail(value))
)

# TODO: use fieldtypes instead of value_field_types in a future with more constant propagation
value_field_types(a_type) =
    ntuple(index -> Val(fieldtype(a_type, index)), fieldcount(a_type))

function similar(
    array::Unzip,
    ::Type{ElementType},
    dims::Dims
) where {ElementType, NumberOfDimensions}

    inner_array(::Val{InnerElementType}) where InnerElementType =
        Array{InnerElementType}(undef, dims)
    model_array(inner_array.(value_field_types(ElementType))...)
end

# Unzip needs to replicate AbstractArray similar machinery
similar(a::Unzip, ::Type{T}) where {T}                     = similar(a, T, to_shape(axes(a)))
similar(a::Unzip{T}, dims::Tuple) where {T}                = similar(a, T, to_shape(dims))
similar(a::Unzip{T}, dims::DimOrInd...) where {T}          = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::DimOrInd...) where {T}  = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::Tuple{Union{Integer, OneTo}, Vararg{Union{Integer, OneTo}}}) where {T} = similar(a, T, to_shape(dims))

similar(array::ModelArray, ::Type{ElementType}, dims::Dims) where ElementType =
    similar(Unzip(nothing), ElementType, dims)

_similar_for(c::Unzip{Nothing}, T, itr, ::SizeUnknown) =
    similar(c, T, 0)
_similar_for(c::Unzip{Nothing}, T, itr, ::HasLength) =
    similar(c, T, Int(length(itr)::Integer))
_similar_for(c::Unzip{Nothing}, T, itr, ::HasShape) =
    similar(c, T, axes(itr))

collect(iterator::Unzip) = _collect(
    Unzip(nothing),
    iterator,
    IteratorEltype(iterator),
    IteratorSize(iterator)
)

# Examples

Generator(x -> (x, x + 1.0), [1]) |> Unzip |> collect
Generator(x -> (x, x + 1.0), [1, missing]) |> Unzip |> collect
zip([1], [1.0]) |> Unzip |> collect
[(1, 1.0)] |> Unzip |> collect
Filter(x -> x[1] == x[2], [(1, 1.0)]) |> Unzip |> collect
``

@bramtayl
Copy link
Contributor

The above strategy turns out to be extremely sub-optimal for EltypeUnknown iterators, but it still works. I'm not sure I have the programming wherewithall to design a better system.

@bramtayl
Copy link
Contributor

I'm thinking it would probably just need specialized ::ModelArray methods for grow_to! and collect_to! to be efficient, I'll see what I can cook up

@bramtayl
Copy link
Contributor

bramtayl commented May 9, 2020

I took at stab at simplifying the code in my gist. I made a little headway but not much. If it's any comfort, many of the functions (map_unrolled, partial_map, zip_missing) would fit nicely into a unexported mini-module for unrolling tuple operations. I've been kicking around a plan for that for a long time.

@oxinabox
Copy link
Contributor

oxinabox commented Sep 7, 2020

It would be really nice to get this solved for 1.6.

@JeffreySarnoff
Copy link
Contributor

@StefanKarpinski I remember that you had a full grasp of this -- is anything of deep import holding it back?

@StefanKarpinski
Copy link
Member

Only performance, Iirc.

@ghost
Copy link

ghost commented Sep 8, 2020

@JeffreySarnoff, I may be misremembering some things, but I think there were some unresolved questions aside from performance:

  1. What to do if tuples have different length?
    • Fill in missings?
    • Throw error?
  2. What to do if the same field has different types in different tuples?
    • Error?
    • Promote to Any?
    • Promote to smallest possible Union?
    • Promote to parent type?
    • Widen numeric types?
  3. Lazy or strict semantics?
    • In particular, do we want unzip of zip be the identity? (Rather than throwing an error if the input arrays are of different size, as zip currently does).

Maybe some other things too, but it's been a while since I read through the entire thread.

Note that these issues only really arise in TypeUnknown iterators. For homogeneous tuple arrays there is the Destruct package, which implements a simple and efficient unzip operation.

@bramtayl
Copy link
Contributor

bramtayl commented Sep 8, 2020

#33324 seems pretty good to me still. If we want to simplify the code I think the first step would be to solve #31909

@bramtayl
Copy link
Contributor

bramtayl commented Sep 9, 2020

I wrote up https://github.com/bramtayl/Unzip.jl and it should be registered 3 days from now.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Sep 9, 2020

afaic a clear candidate after looking at dump(zip(avec, bvec))

  • unzip that which is zipped
	unzip(x::Base.Iterators.Zip) = x.is

@bramtayl
Copy link
Contributor

bramtayl commented Sep 9, 2020

Feel free to open up a PR over there, but I'm not so sure about that method. The behavior in my version unzip is to collect new vectors, not pass along old ones. z.is also pretty easy to type if you already have a zip.

@ghost
Copy link

ghost commented Sep 9, 2020

iknow uknow this is furiously obvious

* `unzip` that which is `zip`ped
	unzip(x::Base.Iterators.Zip) = z.is

How is this obvious? It's one possible way to define unzip of zip, and has the (IMO major) disadvantage that unzip is not guaranteed to return arrays of equal length.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Sep 9, 2020

(it works better with z.is replaced by x.is)
It will return the arrays that were given to zip.
How is that a major disadvantage for unzip?
I conceived that its essential purpose is to recover what zip got.

Certainly, the unzip of a collected zip operates on equilength firsts and .. and lasts, irrespective of what zip got. And that is appropriate, as one can only process information that exists when the processing occur. The most performant approaches to unzip(x::Vector{Tuple{_}}) cannot be expected to perform as quickly as does doing x.is.

To forgo that very high throughput advancing some computation because it provides unequal length vectors iff unequal length vectors were given to zip seems unnecessary. Where that is undesirable, one may apply e.g.rtrim_longer_vecs(unzip(zipped)) and obtain equi-sized vectors.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Sep 9, 2020

@zzyxyzz I meant "obvious" in the sense that this is what dump(zip(a,b)) shows. I had not known about the inner structure of our zips until today .. so, I revised that phrase: "afaic a clear candidate after looking at dump(zip(avec, bvec))".

@mschauer
Copy link
Contributor

mschauer commented Sep 9, 2020

return arrays of equal length

good point

@bramtayl
Copy link
Contributor

bramtayl commented Sep 10, 2020

Maybe parents would be a good name for what function (x::Base.Iterators.Zip) = x.is end does

@ghost
Copy link

ghost commented Sep 10, 2020

@JeffreySarnoff, I'm not saying that making unzip(zip(...)) a no-op is necessarily a bad idea. It's just that unzip has a surprising number of loose ends and subtle trade-offs for something that is so simple in principle. This is marked as "good first issue", but it really isn't. Over the years several people have taken a stab at it (@bramtayl has come furthest), but it's no accident that there is still no generally accepted solution.

@ghost
Copy link

ghost commented Sep 10, 2020

Maybe this issue should be renamed, following Julia tradition, "Taking unzip seriously". 😛

@bramtayl
Copy link
Contributor

Registered!

@JeffreySarnoff
Copy link
Contributor

For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?

@ghost
Copy link

ghost commented Sep 13, 2020

@JeffreySarnoff

For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?

My personal take on this:

In any decent implementation, unzip should call its source iterator exactly once for each input tuple. This forces unzip to construct the output arrays in lock-step; it cannot finish the first array before starting on the second. Thus, unzip cannot act as a lazy iterator in any reasonable sense. It is fundamentally a collect-like operation. So the safe, consistent thing to do would be to emulate the behavior of collect and always return freshly allocated arrays.

But I do admit that the temptation to special-case unzip(::Base.Iterators.Zip) is strong. And it wouldn't be the first time that safe, predictable behavior is sacrificed for performance in Julia. 😄

@vtjnash vtjnash removed good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request labels Apr 8, 2021
@o314
Copy link
Contributor

o314 commented Jan 28, 2022

No post in 2021 ? So let's give it a try now! HNY 2022 Unzip !

Multidispatch-friendly zen of unzip in julia

@ https://gist.github.com/o314/214e26c6fb70512b56597d633dd87e6f
see #13942

OK unzip(a) = zip(a...) fails in Julia
But Zen of python is great - ok, it's somewhat a lie, it's also great for julia !
So let's try to bring correct simple things first, and (maybe) complicated ones later or away.

FWIW, my good-enough-poor-man, but usable in prod, isn't it, unzip considering than

  • writing 5 lines > waiting 5 years
  • not optimized for nonexistant use case of unzip (for streaming or tensorflow or whatever)
  • third party pkg for 5 lines is a joke or an industrial hazard (software bom somebody?)
  • con : 5 sloc of src; 50 lines of test. so what?
using Test
import Base.Iterators as _I

unzip(s...) = unzip(collect(s))
unzip(vs::Vector{<:Vector}) =
    let M=length(vs), N=mapfoldl(length, min, vs); # todo remove me when SVector is in Base
        ([vs[i][j] for i in 1:M] for j in 1:N)
    end
unzip(a::Vector{<:Pair}) = [k for (k,_) in a], [v for (_,v) in a]

TEST

import Base.Iterators as _I
using Test
# zipdata(M,N) = let v=collect(1:M), vt=ntuple(N) do _; copy(v) end; vt end
data(M,N) = ntuple(M) do i; fill(i,N) end
data(N) = let ks=_I.take(_I.cycle('a':'z'), N), vs=(1:N...,); (k=>v for (k,v) in zip(ks,vs)) end

VALIDITY TEST

# unzip of vector
@test data(5,3) == ([1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5])
@test unzip(data(5,3)...) |> collect == ([1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]) |> collect

# unzip of pair vector
@test data(5) |> collect == ('a'=>1, 'b'=>2, 'c'=>3, 'd'=>4, 'e'=>5) |> collect
@test unzip(data(5) |> collect) |> collect == (['a','b','c','d','e'], [1,2,3,4,5]) |> collect

SPEED TEST

# unzip of vector
julia> @time unzip(data(1000,3)...);
  0.029086 seconds (42.07 k allocations: 2.766 MiB, 99.28% compilation time)

julia> @time unzip(data(1_000_000,3)...);
  1.507531 seconds (4.04 M allocations: 223.797 MiB, 18.21% gc time, 72.32% compilation time)

julia> @time unzip(data(1_000_000,3)...);
  0.294386 seconds (1.00 M allocations: 152.588 MiB)

julia> @time unzip(data(1000,50)...);
  0.000727 seconds (1.01 k allocations: 531.922 KiB)

julia> @time unzip(data(1_000_000,50)...);
  1.082615 seconds (1.00 M allocations: 518.799 MiB, 48.09% gc time)

julia> @time unzip(data(1_000_000,50)...);
  0.527460 seconds (1.00 M allocations: 518.799 MiB)

# unzip of pair vector
julia> @time unzip(data(1000));
  2.728774 seconds (166.12 k allocations: 10.524 MiB, 99.98% compilation time)

julia> @time unzip(data(1000));
  0.000334 seconds (2.00 k allocations: 116.375 KiB)

julia> @time unzip(data(1_000_000) |> collect);              # BUG wo collect
  0.634841 seconds (3.50 M allocations: 178.888 MiB, 18.39% gc time, 57.41% compilation time)

@LilithHafner
Copy link
Member

Hmm... @o314, that's not working for me:

julia> unzip(zip(1:10, 1:10))
Base.Generator{UnitRange{Int64}, var"#7#9"{Vector{Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}, Int64}}(var"#7#9"{Vector{Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}, Int64}(Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}[[zip(1:10, 1:10)]], 1), 1:1)

julia> for i in unzip(zip(1:10, 1:10))
           println(i)
       end
Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}[zip(1:10, 1:10)]

julia> x = collect(zip(1:5, 2:2:10))
5-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (2, 4)
 (3, 6)
 (4, 8)
 (5, 10)

julia> collect(unzip(x))
5-element Vector{Vector{Tuple{Int64, Int64}}}:
 [(1, 2)]
 [(2, 4)]
 [(3, 6)]
 [(4, 8)]
 [(5, 10)]

@complyue
Copy link

complyue commented Mar 4, 2022

Per 1.7.2, the zip(...) trick would not even scale to 1k!

$ julia --version
julia version 1.7.2
$ time julia -e 'zip(collect([3,2,5] for _ in 1:10)...)|>collect' 
julia -e 'zip(collect([3,2,5] for _ in 1:10)...)|>collect'  0.38s user 0.09s system 98% cpu 0.474 total
$ time julia -e 'zip(collect([3,2,5] for _ in 1:1000)...)|>collect'
ERROR: StackOverflowError:
Stacktrace:
 [1] _zip_iterate_interleave(xs1::NTuple{980, Tuple{Int64, Int64}}, xs2::Tuple{}, ds::NTuple{980, Missing})
   @ Base.Iterators ./iterators.jl:368
 [2] _zip_iterate_interleave (repeats 20 times)
   @ ./iterators.jl:369 [inlined]
 [3] _zip_iterate_all(is::NTuple{1000, Vector{Int64}}, ss::NTuple{1000, Tuple{}})
   @ Base.Iterators ./iterators.jl:354
 [4] iterate
   @ ./iterators.jl:340 [inlined]
 [5] copyto!(dest::Vector{NTuple{1000, Int64}}, src::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}})
   @ Base ./abstractarray.jl:890
 [6] _collect(cont::UnitRange{Int64}, itr::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}}, #unused#::Base.HasEltype, isz::Base.HasShape{1})
   @ Base ./array.jl:655
 [7] collect(itr::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}})
   @ Base ./array.jl:649
 [8] |>(x::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}}, f::typeof(collect))
   @ Base ./operators.jl:966
julia -e 'zip(collect([3,2,5] for _ in 1:1000)...)|>collect'  130.60s user 3.81s system 99% cpu 2:14.47 total

Hell I miss an unzip!

@DilumAluthge DilumAluthge removed this from the 1.x milestone Mar 13, 2022
@brenhinkeller brenhinkeller added the feature Indicates new feature / enhancement requests label Nov 19, 2022
@ParadaCarleton
Copy link
Contributor

Someday unzip will work. Eventually 😅

@adienes
Copy link
Contributor

adienes commented Feb 26, 2024

note that since #50435 was merged, there is precedent to throw on follow-up operations when zipped iterators have unequal lengths. so many of the previous concerns upthread about the desire to & controversy of defining unzip(z::Zip) = zip.is are addressed & no longer super relevant

w.r.t. returning the original iterators or copies, I definitely prefer not to copy. all unzip needs to promise is a tuple of iterators, and if the user wants to copy when the method would otherwise not she can always write collect.(unzip(...))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Indicates new feature / enhancement requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.