Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flattening of strings #56

Open
sprmnt21 opened this issue Apr 30, 2022 · 11 comments
Open

Flattening of strings #56

sprmnt21 opened this issue Apr 30, 2022 · 11 comments
Labels
bug Something isn't working decision enhancement New feature or request

Comments

@sprmnt21
Copy link

I wonder if the result of the flatten function in these cases is the most expected one.
Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?

julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "", "b", ["a","c"]])
4×2 Dataset
 Row │ A         B
     │ identity  identity
     │ Int64?    Any
─────┼──────────────────────
   1 │        1  ["a", "b"]
   2 │        2
   3 │        3  b
   4 │        4  ["a", "c"]

julia> flatten(df, :B)
5×2 Dataset
 Row │ A         B        
     │ identity  identity
     │ Int64?    Any
─────┼────────────────────
   1 │        1  a
   2 │        1  b
   3 │        3  b
   4 │        4  a
   5 │        4  c
julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "pippo", "b", ["a","c"]])
4×2 Dataset
 Row │ A         B
     │ identity  identity
     │ Int64?    Any
─────┼──────────────────────
   1 │        1  ["a", "b"]
   2 │        2  pippo
   3 │        3  b
   4 │        4  ["a", "c"]

julia> flatten(df, :B)
10×2 Dataset
 Row │ A         B        
     │ identity  identity
     │ Int64?    Any
─────┼────────────────────
   1 │        1  a
   2 │        1  b
   3 │        2  p
   4 │        2  i
   5 │        2  p
   6 │        2  p
   7 │        2  o
   8 │        3  b
   9 │        4  a
  10 │        4  c

@sprmnt21 sprmnt21 changed the title Flatten of strings Flattening of strings Apr 30, 2022
@monopolynomial
Copy link

is it possible(or useful) to support mapformats in flatten!? it is very useful for the example from @sprmnt21

@sl-solution sl-solution added enhancement New feature or request question Further information is requested decision labels May 2, 2022
@sl-solution
Copy link
Owner

Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?

This is the Julia's behaviour, however, I don't like this. Maybe adding a keyword argument would be a good idea?

In long term, I am interested to have a fixed-width String (probably a better implementation of Characters) type for InMemoryDatasets that treats empty String as missing, since I think zero length string and/or empty string should be treated as missing value in data analysis workflow.

@sl-solution
Copy link
Owner

is it possible(or useful) to support mapformats in flatten!? it is very useful for the example from @sprmnt21

I moved this post to #57, so it is easier to track.

@sl-solution sl-solution added bug Something isn't working and removed question Further information is requested decision labels May 2, 2022
@sl-solution
Copy link
Owner

sl-solution commented May 2, 2022

On second thoughts, I classify this as a bug, and a fix is coming soon.

@sprmnt21
Copy link
Author

sprmnt21 commented May 2, 2022

Suppose that, somehow, you have come to have a dataset like this:

7×3 Dataset
 Row │ id        outcome   sds
     │ identity  identity  identity
     │ Int64?    Bool?     Dataset?
─────┼─────────────────────────────────
   1 │        1     false  3×3 Dataset
   2 │        1      true  1×3 Dataset
   3 │        1     false  1×3 Dataset
   4 │        2     false  1×3 Dataset
   5 │        2      true  1×3 Dataset
   6 │        2     false  1×3 Dataset
   7 │        3      true  3×3 Dataset

Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?

PS

I got the dataset with nested tables in the following way:

using InMemoryDatasets
ds = Dataset(id = [1,1,1,1,1,2,2,2,3,3,3],
date = Date.(["2019-03-05", "2019-03-12", "2019-04-10",
        "2019-04-29", "2019-05-10", "2019-03-20",
        "2019-04-22", "2019-05-04", "2019-11-01",
        "2019-11-10", "2019-12-12"]),
outcome = [false, false, false, true, false, false,
           true, false, true, true, true])

gb=gatherby(ds, [1, 3], isgathered = true)  

cgb1 = combine(gb, ("id",2,:outcome) => ((x...)-> Dataset(; zip([:a,:b,:c], x) ...))=>:sds)

@sl-solution
Copy link
Owner

Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?

Few remarks:

  • IMD has a new function, eachgroup, which can be used to iterate grouped data sets, and in similar situation using eachgroup is the recommended approach.
  • flatten/! works based on length, and length is not defined for data sets. Thus, I think such a functionality should be placed in a new function. (?)
  • To achieve what you are looking for, e.g. in this case, you may use append!

@sl-solution
Copy link
Owner

On second thoughts, I classify this as a bug, and a fix is coming soon.

Originally, I thought we could use a separate path for empty collections, however, this creates other sort of problems. E.g. if we have a Int[] value, keeping it as Int[] is not consistent because it is not flatten properly (?)

I think we should leave it as a quirk of the package (?)

@giantmoa
Copy link
Contributor

Hi there,
does flatten Int[] as nothing solve this problem?

@sl-solution
Copy link
Owner

does flatten Int[] as nothing solve this problem?

probably not, since dealing with nothing is not easy. IMD handles missing for many function efficiently, however, nothing will be inconvenient.

@sprmnt21
Copy link
Author

sprmnt21 commented May 11, 2022

I don't know what Int [] is exactly / formally, other than to think of it as an empty vector.
But leaving it as it is, could it give rise to problems?

A different hypothesis, perhaps a bit risky, would be to put missing for everything that has a defined length and equal to 0.

@sl-solution
Copy link
Owner

I don't know what Int [] is exactly / formally, other than to think of it as an empty vector.
But leaving it as it is, could it give rise to problems?

Leaving Int[] as it is has two problems, a) it is not consistent with flattening operation, b) makes the sub-sequence operations on the output data sets inefficient (e.g. if using flattening changes everything to Int and just one observation remains as Int[] the whole type of the processed column is affected)

A different hypothesis, perhaps a bit risky, would be to put missing for everything that has a defined length and equal to 0.

I am not sure if this is a right way to handle this - empty object is not equivalent to missing (?) BTW, #57 provide a convenient way to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working decision enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants