Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing DataAPI.defaultarray support #27

Closed
bkamins opened this issue Jun 26, 2023 · 12 comments · Fixed by #28
Closed

Missing DataAPI.defaultarray support #27

bkamins opened this issue Jun 26, 2023 · 12 comments · Fixed by #28

Comments

@bkamins
Copy link

bkamins commented Jun 26, 2023

See JuliaData/DataFrames.jl#3351 (comment).

This is needed for ReasStatTables.jl to interoperate seamlessly with Tables.jl.

@junyuan-chen
Copy link
Owner

Thanks for reporting this. I noticed this issue but haven't got a chance to fix it, although this is definitely something that should be addressed. I will need to figure out a more robust way to deal with the value labels in this case.

@bkamins
Copy link
Author

bkamins commented Jun 26, 2023

Note that the standard way to use this functionality is:

output_vector = Tables.allocatecolumn(T, total_length) # this needs to have a correct type and uses DataAPI.defaultarray internally
offset = 1 # Tables.jl assumes 1-based indexing
for v in iterator_of_source_vectors
    copyto!(output_vector, offset, v) # note that it is important that copyto! is properly implemented for output_vector type so that it correctly "adds labels" as needed across values stored in vectors that constitute iterator_of_source_vectors
    offset += length(v)
end

The point of this approach is to ensure efficiency of concatenation operation (you allocate output_vector once and then store values in it; but it is the responsibility of the output_vector type that this is correctly handled and efficient).

@bkamins
Copy link
Author

bkamins commented Jun 26, 2023

And to add - the general idea of this design is to ensure that proper output_vector type is allocated depending on type T that is defined as:

T = mapreduce(eltype, promote_type, iterator_of_source_vectors)

(the common case is that all vectors have the same element type)

@junyuan-chen
Copy link
Owner

@bkamins The difficulty I encountered earlier was that I didn't find a simple way to inform output_vector the value labels (a Dict) associated with the source vector. I would need to use something like Tables.allocatecolumn(v, total_length) with v being an instance of the vector instead of just the type. The type alone doesn't provide the full information needed for building the new output_vector here.

@bkamins
Copy link
Author

bkamins commented Jun 26, 2023

This is what I expected. The API was designed to be type-based only.
That is why I highlighted that copyto! would need to properly "build" the Dict you mentioned while concrete values are being added to the vector. Is this doable?

@junyuan-chen
Copy link
Owner

Ah, I just saw your comment above with copyto!. The simplest solution seems to be adding new methods with copyto!. That's totally doable!

@bkamins
Copy link
Author

bkamins commented Jun 26, 2023

This is the original intention 😄. Thank you!

@bkamins
Copy link
Author

bkamins commented Jun 26, 2023

In general @nalimilan went through the same process in CategoricalArrays.jl so maybe he would have some more comments.

@nalimilan
Copy link
Contributor

Yes, you can take inspiration from https://github.com/JuliaData/CategoricalArrays.jl/blob/246e8f0c86fbc08a664266c9e0edffbc24727298/src/array.jl#L524-L647

That's relatively straightforward, but there is an ambiguity with SentinelArrays (JuliaData/CategoricalArrays.jl#369), which requires a special method using an extension: https://github.com/JuliaData/CategoricalArrays.jl/blob/246e8f0c86fbc08a664266c9e0edffbc24727298/ext/CategoricalArraysSentinelArraysExt.jl

This also raised a deeper issue in CategoricalArrays: should copyto!(y, x) be equivalent to for i in eachindex(x, y); x[i] = y[i]; end when x is a CategoricalArray and y = similar(x)? The question is how to handle levels, to ensure that setindex!(x::CategoricalArray, i::Integer, v::CategoricalValue) applies levels from v to x. That's not too hard, but it took quite some work to find an efficient approach which caches in x the levels of the last assigned value so that the check is very cheap in a for loop (JuliaData/CategoricalArrays.jl#337). Similar reflections may apply to LabeledArray, even if you may not need an efficient implementation of setindex! right now as in most cases copyto! is enough.

Then there's an additional complication regarding what to do if people combine values from different LabeledArrays (with copyto!, vcat or setindex!). CategoricalArrays tries to identify an ordered superset of levels. LabeledArray could do the same, but that's probably not super common so low priority. :-)

@junyuan-chen
Copy link
Owner

@bkamins @nalimilan Thank you so much for the helpful comments here!

I did something as simple as possible here: #28. The logic is to simply update the value labels associated with the destination LabeledArray in case the source array has a different set of value labels. Unlike a CategoricalArray, a new value in a LabeledArray does not automatically have a label unless the label is explicitly inserted into that label Dict. This behavior mimics Stata. When a value does not have a value label defined, the label displayed is simply the value itself.

@bkamins
Copy link
Author

bkamins commented Jun 28, 2023

@greimel - could you please test if #28 resolves your issues?

@greimel
Copy link

greimel commented Jun 28, 2023

It does! Thanks @junyuan-chen and @bkamins!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants