Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docstring for keys(::AbstractArray) #36073

Merged
merged 2 commits into from
Apr 21, 2021
Merged

Add docstring for keys(::AbstractArray) #36073

merged 2 commits into from
Apr 21, 2021

Conversation

nalimilan
Copy link
Member

This is helpful for users, and it matters because some Base functions rely on keys(::AbstractArray) returning a collection of integer ranges (e.g. hash).

Cc: @bkamins

This is helpful for users, and it matters because some Base functions
rely on `keys(::AbstractArray)` returning a collection of integer ranges
(e.g. `hash`).
@nalimilan nalimilan added the docs This change adds or pertains to documentation label May 29, 2020
base/abstractarray.jl Outdated Show resolved Hide resolved
@mbauman
Copy link
Member

mbauman commented May 29, 2020

some Base functions rely on keys(::AbstractArray) returning a collection of integer ranges (e.g. hash).

Are you thinking of axes?

@bkamins
Copy link
Member

bkamins commented May 29, 2020

If I understand what @naliman wants to point that hash(A::AbstractArray, h::UInt) actually requires keys to return numeric indices.

This distinction is relevant for arrays that allow for other indexing except numbers (like NamedArrays.jl) - and the question is if such arrays can return those "other indices" though keys. And we came to the conclusion that it is not allowed - but maybe it is not intended?

@nalimilan
Copy link
Member Author

Thanks for the suggestion.That's fine with me too.

So keys is supposed to be "efficient", but maybe not as efficient as eachindex, right? That's how I understood it. In that case, keys for NamedArray and similar type should definitely return integer or cartesian indices rather than names. It's a bit weird that keys(::Dict) doesn't return an efficient index type (holding a reference to the slot), but I guess that's not the end of the world.

hash(a::AbstractArray) (currently) relies on the fact that keys(a) returns something that can be passed to LinearIndices, so it can't return e.g. (["a", "b"], ["A", "B"]) for a NamedMatrix.

@mbauman
Copy link
Member

mbauman commented May 29, 2020

Fascinating. I'm not sure anyone has tried doing this yet — and we don't have a well-defined boundary for what needs to be overridden if keys(::AbstractArray) are something other than an integer or CartesianIndex (or if that's possible at all).

wants to point that hash(A::AbstractArray, h::UInt) actually requires keys to return numeric indices.

That's not quite how hash uses keys; I think it just depends on these two rough requirements:

  • LinearIndices contains integers

  • keys(a) are valid not just for a but also for LinearIndices(a) and CartesianIndices(a). In fact, I think all three of these should be valid keys for each other! For example, a rough relation that I have in my head is:

    a[keys(a)]==a && LinearIndices(a)[keys(a)] == LinearIndices(a) && CartesianIndices(a)[keys(a)] == CartesianIndices(a)

I'd be inclined to help make this work rather than simply documenting that it doesn't. In fact, NamedArrays might be in good stead for this usage if it provides NamedLinearIndices and NamedCartesianIndices and sneakily overrides the LinearIndices(::NamedArray) constructor to pass back its named version.

@mbauman
Copy link
Member

mbauman commented May 29, 2020

At the same time, keys(::AbstractVector)::AbstractVector{<:Integer} is probably too well-entrenched to do anything about it thanks to downstream uses of find.

@mbauman
Copy link
Member

mbauman commented May 29, 2020

So keys is supposed to be "efficient", but maybe not as efficient as eachindex, right?

I only put that in there to assuage the thought that it might return a dense Array of indices. Note that we're only documenting the ::AbstractArray method here; ::AbstractDict and ::Any have their own docstrings.

@bkamins
Copy link
Member

bkamins commented May 29, 2020

Well - not knowing these details we already made DataFrameColumns object to be an AbstractVector and have keys other than integers, but at the same time it seems we are consistent what is required:

julia> df = DataFrame(rand(2,3))
2×3 DataFrame
│ Row │ x1       │ x2       │ x3       │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.276473 │ 0.961011 │ 0.329211 │
│ 2   │ 0.193074 │ 0.643824 │ 0.292972 │

julia> e = eachcol(df)
2×3 DataFrameColumns
│ Row │ x1       │ x2       │ x3       │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.276473 │ 0.961011 │ 0.329211 │
│ 2   │ 0.193074 │ 0.643824 │ 0.292972 │

julia> keys(e)
3-element Array{Symbol,1}:
 :x1
 :x2
 :x3

julia> LinearIndices(keys(e))
3-element LinearIndices{1,Tuple{Base.OneTo{Int64}}}:
 1
 2
 3

@mbauman
Copy link
Member

mbauman commented May 29, 2020

Nice, that's great! That's precisely the sort of usage that I think is acceptable (and beyond that, it's great!) and I don't want to document it away.

@nalimilan
Copy link
Member Author

@bkamins AFAICT DataFrameColumns doesn't satisfy the properties that @mbauman mentioned above. With your example, e[keys(e)], LinearIndices(e)[keys(e)] and CartesianIndices(e)[keys(e)] all fail. Even findfirst(x->false, e) fails actually. Not sure what would be the best solution... We could probably fix getindex and findfirst (via nextind), but should we really implement custom LinearIndices and CartesianIndices methods?

@bkamins
Copy link
Member

bkamins commented May 29, 2020

Ah - you are right. I was looking at hash, but not deep enough. Unfortunately now:

julia> df = DataFrame(a=1,b=2)
1×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> hash(eachcol(df))
ERROR: ArgumentError: invalid index: :b of type Symbol

I will open an issue for this in DataFrames.jl

@nalimilan
Copy link
Member Author

I just noticed that keys should probably also be documented in the Interfaces section for AbstractArray, right?

BTW, @mbauman, how do attempts to improve the handling of axis names in axes like JuliaArrays/AxisArrays.jl#81 fit in this discussion? Should axes and keys differ in that regard?

@bkamins
Copy link
Member

bkamins commented May 30, 2020

I just noticed that keys should probably also be documented in the Interfaces section for AbstractArray, right?

Actually this is what I have also checked today :).

I think that the key thing that would be great to have decided is the rule for keys for "array-like" collections that allow "dual" indexing (with integers and with something else).

The issue is that - at least in DataFrames.jl - we found it more convenient for keys to return names and pairs return name => value mappings for such types. This is exactly what NamedTuple type does now, but it is inconsistent with what Base now assumes for arrays.

@bkamins
Copy link
Member

bkamins commented Jun 8, 2020

So what is the decision here? Just to confirm, so that we can take appropriate steps in DataFrames.jl.

Is the contract that keys on AbstractVector must return a collection of integers? Thank you!

@StefanKarpinski
Copy link
Member

@mbauman, what's your take?

@mbauman
Copy link
Member

mbauman commented Jun 11, 2020

Good question.

Is the contract that keys on AbstractVector must return a collection of integers?

This is definitely the "safe" thing to do, and maybe it's what we need to demand. I was hoping that we weren't too far from making this possible, but I think it's too tricky to try to officially support at the moment.

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Apr 20, 2021
@vtjnash vtjnash merged commit 9418acc into master Apr 21, 2021
@vtjnash vtjnash deleted the nl/keys branch April 21, 2021 18:47
@simeonschaub simeonschaub removed the merge me PR is reviewed. Merge when all tests are passing label Apr 28, 2021
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
This is helpful for users, and it matters because some Base functions may
rely on these guarantees for their correct behavior (e.g. `hash`).

Co-authored-by: Matt Bauman <[email protected]>
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request May 9, 2021
This is helpful for users, and it matters because some Base functions may
rely on these guarantees for their correct behavior (e.g. `hash`).

Co-authored-by: Matt Bauman <[email protected]>
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
This is helpful for users, and it matters because some Base functions may
rely on these guarantees for their correct behavior (e.g. `hash`).

Co-authored-by: Matt Bauman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants