Flattening of strings #56

sprmnt21 · 2022-04-30T12:18:44Z

I wonder if the result of the flatten function in these cases is the most expected one.
Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?

julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "", "b", ["a","c"]])
4×2 Dataset
 Row │ A         B
     │ identity  identity
     │ Int64?    Any
─────┼──────────────────────
   1 │        1  ["a", "b"]
   2 │        2
   3 │        3  b
   4 │        4  ["a", "c"]

julia> flatten(df, :B)
5×2 Dataset
 Row │ A         B        
     │ identity  identity
     │ Int64?    Any
─────┼────────────────────
   1 │        1  a
   2 │        1  b
   3 │        3  b
   4 │        4  a
   5 │        4  c

julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "pippo", "b", ["a","c"]])
4×2 Dataset
 Row │ A         B
     │ identity  identity
     │ Int64?    Any
─────┼──────────────────────
   1 │        1  ["a", "b"]
   2 │        2  pippo
   3 │        3  b
   4 │        4  ["a", "c"]

julia> flatten(df, :B)
10×2 Dataset
 Row │ A         B        
     │ identity  identity
     │ Int64?    Any
─────┼────────────────────
   1 │        1  a
   2 │        1  b
   3 │        2  p
   4 │        2  i
   5 │        2  p
   6 │        2  p
   7 │        2  o
   8 │        3  b
   9 │        4  a
  10 │        4  c

The text was updated successfully, but these errors were encountered:

monopolynomial · 2022-05-01T01:19:50Z

is it possible(or useful) to support mapformats in flatten!? it is very useful for the example from @sprmnt21

sl-solution · 2022-05-02T07:36:32Z

Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?

This is the Julia's behaviour, however, I don't like this. Maybe adding a keyword argument would be a good idea?

In long term, I am interested to have a fixed-width String (probably a better implementation of Characters) type for InMemoryDatasets that treats empty String as missing, since I think zero length string and/or empty string should be treated as missing value in data analysis workflow.

sl-solution · 2022-05-02T07:39:27Z

is it possible(or useful) to support mapformats in flatten!? it is very useful for the example from @sprmnt21

I moved this post to #57, so it is easier to track.

sl-solution · 2022-05-02T10:28:09Z

On second thoughts, I classify this as a bug, ~~and a fix is coming soon~~.

sprmnt21 · 2022-05-02T16:52:45Z

Suppose that, somehow, you have come to have a dataset like this:

7×3 Dataset
 Row │ id        outcome   sds
     │ identity  identity  identity
     │ Int64?    Bool?     Dataset?
─────┼─────────────────────────────────
   1 │        1     false  3×3 Dataset
   2 │        1      true  1×3 Dataset
   3 │        1     false  1×3 Dataset
   4 │        2     false  1×3 Dataset
   5 │        2      true  1×3 Dataset
   6 │        2     false  1×3 Dataset
   7 │        3      true  3×3 Dataset

Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?

PS

I got the dataset with nested tables in the following way:

using InMemoryDatasets
ds = Dataset(id = [1,1,1,1,1,2,2,2,3,3,3],
date = Date.(["2019-03-05", "2019-03-12", "2019-04-10",
        "2019-04-29", "2019-05-10", "2019-03-20",
        "2019-04-22", "2019-05-04", "2019-11-01",
        "2019-11-10", "2019-12-12"]),
outcome = [false, false, false, true, false, false,
           true, false, true, true, true])

gb=gatherby(ds, [1, 3], isgathered = true)  

cgb1 = combine(gb, ("id",2,:outcome) => ((x...)-> Dataset(; zip([:a,:b,:c], x) ...))=>:sds)

sl-solution · 2022-05-02T20:04:15Z

Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?

Few remarks:

IMD has a new function, eachgroup, which can be used to iterate grouped data sets, and in similar situation using eachgroup is the recommended approach.
flatten/! works based on length, and length is not defined for data sets. Thus, I think such a functionality should be placed in a new function. (?)
To achieve what you are looking for, e.g. in this case, you may use append!

sl-solution · 2022-05-02T20:35:53Z

On second thoughts, I classify this as a bug, ~~and a fix is coming soon~~.

Originally, I thought we could use a separate path for empty collections, however, this creates other sort of problems. E.g. if we have a Int[] value, keeping it as Int[] is not consistent because it is not flatten properly (?)

I think we should leave it as a quirk of the package (?)

giantmoa · 2022-05-10T23:35:07Z

Hi there,
does flatten Int[] as nothing solve this problem?

sl-solution · 2022-05-11T06:42:11Z

does flatten Int[] as nothing solve this problem?

probably not, since dealing with nothing is not easy. IMD handles missing for many function efficiently, however, nothing will be inconvenient.

sprmnt21 · 2022-05-11T08:24:07Z

I don't know what Int [] is exactly / formally, other than to think of it as an empty vector.
But leaving it as it is, could it give rise to problems?

A different hypothesis, perhaps a bit risky, would be to put missing for everything that has a defined length and equal to 0.

sl-solution · 2022-05-12T07:20:58Z

I don't know what Int [] is exactly / formally, other than to think of it as an empty vector.
But leaving it as it is, could it give rise to problems?

Leaving Int[] as it is has two problems, a) it is not consistent with flattening operation, b) makes the sub-sequence operations on the output data sets inefficient (e.g. if using flattening changes everything to Int and just one observation remains as Int[] the whole type of the processed column is affected)

A different hypothesis, perhaps a bit risky, would be to put missing for everything that has a defined length and equal to 0.

I am not sure if this is a right way to handle this - empty object is not equivalent to missing (?) BTW, #57 provide a convenient way to do this.

sprmnt21 changed the title ~~Flatten of strings~~ Flattening of strings Apr 30, 2022

sl-solution added enhancement New feature or request question Further information is requested decision labels May 2, 2022

sl-solution mentioned this issue May 2, 2022

is it possible(or useful) to support mapformats in flatten!? #57

Closed

sl-solution added bug Something isn't working and removed question Further information is requested decision labels May 2, 2022

sl-solution added the decision label May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flattening of strings #56

Flattening of strings #56

sprmnt21 commented Apr 30, 2022

monopolynomial commented May 1, 2022

sl-solution commented May 2, 2022

sl-solution commented May 2, 2022

sl-solution commented May 2, 2022 •

edited

Loading

sprmnt21 commented May 2, 2022

sl-solution commented May 2, 2022

sl-solution commented May 2, 2022

giantmoa commented May 10, 2022

sl-solution commented May 11, 2022

sprmnt21 commented May 11, 2022 •

edited

Loading

sl-solution commented May 12, 2022

Flattening of strings #56

Flattening of strings #56

Comments

sprmnt21 commented Apr 30, 2022

monopolynomial commented May 1, 2022

sl-solution commented May 2, 2022

sl-solution commented May 2, 2022

sl-solution commented May 2, 2022 • edited Loading

sprmnt21 commented May 2, 2022

sl-solution commented May 2, 2022

sl-solution commented May 2, 2022

giantmoa commented May 10, 2022

sl-solution commented May 11, 2022

sprmnt21 commented May 11, 2022 • edited Loading

sl-solution commented May 12, 2022

sl-solution commented May 2, 2022 •

edited

Loading

sprmnt21 commented May 11, 2022 •

edited

Loading