feature(baking): add variables.type column that explicitly specifies type information #2039

Marigold · 2023-03-24T06:26:35Z

Add column JSON column variables.type that can contain explicit type of a variable. Some examples are {"type": "int"}, {"type": "string"} or {"type": "ordinal", "sort": ["Low", "Medium", "High"]}. If null, then type is determined from data (as int, float, string or mixed) to be backward compatible.

If you're trying this locally, first apply the latest migration with yarn buildTsc && yarn runDbMigrations and then set type for income groups

update variables
set type = '{"type": "ordinal", "sort": ["Not categorized", "Low income", "Lower-middle income", "Upper-middle income", "High income"]}'
where id = 42563

Baked metadata file for that variable would then contain following fields:

{
  ...
  "type": "ordinal",
  "sort": ["Not categorized", "Low income", "Lower-middle income", "Upper-middle income", "High income"]
}

Related: #1692, #1324

sophiamersmann · 2023-03-27T12:29:03Z

I made the (minimal) changes necessary to sort the scatterplot legend using the new metadata field (it works!).

I'm unsure though what's the best way to let Grapher know that a variable is known to be "ordinal". If I am understanding correctly, Grapher seems to ignore the metadata field type at the moment and simply marks every value column as NumberOrStringColumn (which is essentially a "mixed" type):

owid-grapher/packages/@ourworldindata/grapher/src/core/LegacyToOwidTable.ts

Lines 581 to 583 in 77d3bcd

    
           type: isContinent 
        
               ? ColumnTypeNames.Continent 
        
               : ColumnTypeNames.NumberOrString,

A code comment regarding NumberOrStringColumn:

owid-grapher/packages/@ourworldindata/core-table/src/CoreTableColumns.ts

Lines 585 to 593 in 77d3bcd

    
           /** 
        
            * We strive to have clearly typed variables in the future, but for now our 
        
            * grapher variables are still untyped. Most are number-only, but we also have some 
        
            * string-only, and even some mixed ones. 
        
            * Hence, NumberOrStringColumn is used to store grapher variables. 
        
            * It extends AbstractColumnWithNumberFormatting, which ensures that we have 
        
            * implementations of formatValueShortWithAbbreviations and the like already. 
        
            * -- @marcelgerber, 2022-07-01 
        
            */

I'm not confident making changes at the moment. I'll have a chat with Marcel once he's back and then get back to this :)

…type information

sophiamersmann · 2023-04-28T08:35:01Z

After discussion with @danyx23 and @samizdatco, how to best represent ordinal variables remains an open question. The options are:

Store ordinal variables as string values and attach an ordering to them (current behaviour, inspired by Vega types)
Store ordinal variables as numerical values (that define an ordering implicitly) and attach labels to them

I believe (1) would make our life a bit easier since turning a categorial variable into an ordinal one would require us to add a "sort" array but no data migration would be needed. (Though we do have some ordinal variables that are currently stored as numerical values as well.) The advantage of (2) is that the extra bit of encoded information (negative/neutral/positive) would help us auto-pick appropriate colour palettes (e.g. if we had an ordinal variable with categories "disagree" / "don't care" / "agree" encoded by -1, 0, 1, Grapher could infer that a diverging colour palette is needed).

One additional thing to keep in mind, and this might be controversial, is that I think it would be beneficial to attach an ordering to the "continents" variable as well. We could then customise the order of the continents such that "North America" and "South America", for example, would appear next to each other in legends (example chart). If we wanted to do this, approach (1) would be less awkward.

Marigold · 2023-04-28T11:37:20Z

(It doesn't make a difference on ETL end, so I yield and let you decide.)

marcelgerber · 2023-05-01T12:10:49Z

Ideally, our new handling of ordinal data would be able to cover:

all the cases where we already have ordinal data points, and are currently using (unordered) categorical legends, with underlying string data (example)
all the cases where we already have ordinal data points, and are currently encoding them as numbers, with custom labels applied in the legend (example)
bonus points if it can handle scenarios where we mix ordinal categories with actual numerical observations (example)

In terms of data representation, I would prefer some sort of shorthand for categorical values, which we would then put into the metadata file, similar to how we already handle entities (where there's a mapping entityId -> entityName in metadata.json, and the data.json only contains the entity ids).
The reason here is that the (uncompressed) data files can become very unwieldy very quickly otherwise, and we will always have a restricted domain of possible categorical values.

(See also: quick analysis of categorical variables that are currently in use)

What's unclear to me at this point, though, is how we would differentiate "actual" numeric values from named values, then, in the case of mixed types.

sophiamersmann · 2023-05-02T08:22:06Z

Thanks for having a look at this, Marcel!

Ideally, our new handling of ordinal data would be able to cover:

all the cases where we already have ordinal data points, and are currently using (unordered) categorical legends, with underlying string data (example)

all the cases where we already have ordinal data points, and are currently encoding them as numbers, with custom labels applied in the legend (example)

Am I understanding correctly that you would want a system that handles both cases, underlying string data and underlying numerical data, at the same time? My thinking was that we decide on one strategy to represent ordinal data and then migrate offending ordinal variables to the new system, therefore reducing complexity in the long run.

bonus points if it can handle scenarios where we mix ordinal categories with actual numerical observations (example)

Do you have an estimate of how many variables we have that have a mixed type with numerical and ordinal categories? My guess is that most mixed types contain categorical data rather than ordinal data (e.g. different types of missing data). If this turns out to be a small number of variables, then I'm happy to treat those as "mixed" and wait for a more pressing need to address this.

In terms of data representation, I would prefer some sort of shorthand for categorical values, which we would then put into the metadata file, similar to how we already handle entities (where there's a mapping entityId -> entityName in metadata.json, and the data.json only contains the entity ids). The reason here is that the (uncompressed) data files can become very unwieldy very quickly otherwise, and we will always have a restricted domain of possible categorical values.

Good idea! Maybe we could then for categorical/ordinal variables add a metadata field dimensions.values.values with a list of categories, e.g. [ { id: 1, name: "Low income" }, { id: 2, name: "Lower-middle income" }, ...]. In the case of ordinal variables, the ids would imply an ordering (this would mean we go for the numerical representation).

(See also: quick analysis of categorical variables that are currently in use)

Nice!

marcelgerber · 2023-05-02T08:26:13Z

My thinking was that we decide on one strategy to represent ordinal data and then migrate offending ordinal variables to the new system, therefore reducing complexity in the long run.

Ah, yes, that's exactly what I mean. Definitely pro reducing complexity here.

Do you have an estimate of how many variables we have that have a mixed type with numerical and ordinal categories?

Yeah, as you say this is a tiny minority. Not a pressing need.

sophiamersmann · 2023-05-02T16:32:10Z

For reference: the original proposal has been updated to reflect the outcome of recent discussions

github-actions · 2023-08-04T07:06:16Z

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

github-actions · 2023-09-09T07:05:48Z

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

Marigold · 2023-09-10T07:51:40Z

We've talked about this recently and have an appetite for finishing it.

github-actions · 2023-09-26T07:06:18Z

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

Marigold · 2023-09-26T07:09:58Z

Hold on bot!

sophiamersmann · 2023-09-26T11:15:30Z

I think adding the pinned label keeps the bot away

sophiamersmann · 2024-03-07T08:41:07Z

Hey @Marigold,

This PR died a long time ago (rip), but we're still keen on getting type information into Grapher. If we wanted to resurrect this PR, how much effort/time do you think it would be on your part?

Marigold · 2024-03-11T13:54:18Z

If we wanted to resurrect this PR, how much effort/time do you think it would be on your part?

I'm not sure, but we put it on this cycle's list. I'll prioritize it to unblock you.

Marigold · 2024-03-20T16:07:10Z

Killed in favour of #3362

Marigold marked this pull request as draft March 24, 2023 06:26

Marigold and others added 2 commits April 14, 2023 15:31

feature(baking): add variables.type column that explicitly specifies …

da7da09

…type information

feat(scatter): sort legend items using the new metadata field "sort"

7c8322a

sophiamersmann force-pushed the ordinal-variables branch from 41b0834 to 7c8322a Compare April 14, 2023 13:31

sophiamersmann added 3 commits April 14, 2023 16:22

enhance(grapher): use metadata type to infer the column type

a315bce

fix(grapher): fix type error

c0ea242

enhance(grapher): add OrdinalColumn

3c9a331

sophiamersmann mentioned this pull request May 1, 2023

Float values are detected as type: "int" in metadata #1927

Closed

marcelgerber requested review from danyx23 and marcelgerber May 1, 2023 12:11

github-actions bot added the stale label Aug 4, 2023

github-actions bot closed this Aug 8, 2023

danyx23 reopened this Aug 24, 2023

github-actions bot assigned Marigold Aug 24, 2023

github-actions bot removed the stale label Aug 25, 2023

github-actions bot added the stale label Sep 9, 2023

github-actions bot removed the stale label Sep 11, 2023

github-actions bot added the stale label Sep 26, 2023

sophiamersmann added pinned and removed stale labels Sep 26, 2023

sophiamersmann mentioned this pull request Oct 6, 2023

Admin: Add support for manually ordering legend items in maps #2710

Open

marcelgerber removed their request for review March 7, 2024 13:24

Marigold mentioned this pull request Mar 11, 2024

Batch cycle: ETL and ops improvements for cycle 2024.2 owid/etl#2390

Closed

15 tasks

Marigold mentioned this pull request Mar 20, 2024

✨ add type and sort columns to variables table #3362

Merged

sophiamersmann closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(baking): add variables.type column that explicitly specifies type information #2039

feature(baking): add variables.type column that explicitly specifies type information #2039

Marigold commented Mar 24, 2023 •

edited by sophiamersmann

Loading

sophiamersmann commented Mar 27, 2023

sophiamersmann commented Apr 28, 2023

Marigold commented Apr 28, 2023

marcelgerber commented May 1, 2023

sophiamersmann commented May 2, 2023

marcelgerber commented May 2, 2023

sophiamersmann commented May 2, 2023

github-actions bot commented Aug 4, 2023

github-actions bot commented Sep 9, 2023

Marigold commented Sep 10, 2023

github-actions bot commented Sep 26, 2023

Marigold commented Sep 26, 2023

sophiamersmann commented Sep 26, 2023

sophiamersmann commented Mar 7, 2024

Marigold commented Mar 11, 2024

Marigold commented Mar 20, 2024

feature(baking): add variables.type column that explicitly specifies type information #2039

feature(baking): add variables.type column that explicitly specifies type information #2039

Conversation

Marigold commented Mar 24, 2023 • edited by sophiamersmann Loading

sophiamersmann commented Mar 27, 2023

sophiamersmann commented Apr 28, 2023

Marigold commented Apr 28, 2023

marcelgerber commented May 1, 2023

sophiamersmann commented May 2, 2023

marcelgerber commented May 2, 2023

sophiamersmann commented May 2, 2023

github-actions bot commented Aug 4, 2023

github-actions bot commented Sep 9, 2023

Marigold commented Sep 10, 2023

github-actions bot commented Sep 26, 2023

Marigold commented Sep 26, 2023

sophiamersmann commented Sep 26, 2023

sophiamersmann commented Mar 7, 2024

Marigold commented Mar 11, 2024

Marigold commented Mar 20, 2024

Marigold commented Mar 24, 2023 •

edited by sophiamersmann

Loading