-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(baking): add variables.type column that explicitly specifies type information #2039
Conversation
I made the (minimal) changes necessary to sort the scatterplot legend using the new metadata field (it works!). I'm unsure though what's the best way to let Grapher know that a variable is known to be "ordinal". If I am understanding correctly, Grapher seems to ignore the metadata field owid-grapher/packages/@ourworldindata/grapher/src/core/LegacyToOwidTable.ts Lines 581 to 583 in 77d3bcd
A code comment regarding owid-grapher/packages/@ourworldindata/core-table/src/CoreTableColumns.ts Lines 585 to 593 in 77d3bcd
I'm not confident making changes at the moment. I'll have a chat with Marcel once he's back and then get back to this :) |
41b0834
to
7c8322a
Compare
After discussion with @danyx23 and @samizdatco, how to best represent ordinal variables remains an open question. The options are:
I believe (1) would make our life a bit easier since turning a categorial variable into an ordinal one would require us to add a "sort" array but no data migration would be needed. (Though we do have some ordinal variables that are currently stored as numerical values as well.) The advantage of (2) is that the extra bit of encoded information (negative/neutral/positive) would help us auto-pick appropriate colour palettes (e.g. if we had an ordinal variable with categories "disagree" / "don't care" / "agree" encoded by -1, 0, 1, Grapher could infer that a diverging colour palette is needed). One additional thing to keep in mind, and this might be controversial, is that I think it would be beneficial to attach an ordering to the "continents" variable as well. We could then customise the order of the continents such that "North America" and "South America", for example, would appear next to each other in legends (example chart). If we wanted to do this, approach (1) would be less awkward. |
(It doesn't make a difference on ETL end, so I yield and let you decide.) |
Ideally, our new handling of ordinal data would be able to cover:
In terms of data representation, I would prefer some sort of shorthand for categorical values, which we would then put into the metadata file, similar to how we already handle entities (where there's a mapping (See also: quick analysis of categorical variables that are currently in use) What's unclear to me at this point, though, is how we would differentiate "actual" numeric values from named values, then, in the case of mixed types. |
Thanks for having a look at this, Marcel!
Am I understanding correctly that you would want a system that handles both cases, underlying string data and underlying numerical data, at the same time? My thinking was that we decide on one strategy to represent ordinal data and then migrate offending ordinal variables to the new system, therefore reducing complexity in the long run.
Do you have an estimate of how many variables we have that have a mixed type with numerical and ordinal categories? My guess is that most mixed types contain categorical data rather than ordinal data (e.g. different types of missing data). If this turns out to be a small number of variables, then I'm happy to treat those as "mixed" and wait for a more pressing need to address this.
Good idea! Maybe we could then for categorical/ordinal variables add a metadata field
Nice! |
Ah, yes, that's exactly what I mean. Definitely pro reducing complexity here.
Yeah, as you say this is a tiny minority. Not a pressing need. |
For reference: the original proposal has been updated to reflect the outcome of recent discussions |
This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected. |
This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected. |
We've talked about this recently and have an appetite for finishing it. |
This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected. |
Hold on bot! |
I think adding the pinned label keeps the bot away |
Hey @Marigold, This PR died a long time ago (rip), but we're still keen on getting type information into Grapher. If we wanted to resurrect this PR, how much effort/time do you think it would be on your part? |
I'm not sure, but we put it on this cycle's list. I'll prioritize it to unblock you. |
Killed in favour of #3362 |
See the original proposal.
Add column JSON column
variables.type
that can contain explicit type of a variable. Some examples are{"type": "int"}
,{"type": "string"}
or{"type": "ordinal", "sort": ["Low", "Medium", "High"]}
. If null, then type is determined from data (as int, float, string or mixed) to be backward compatible.If you're trying this locally, first apply the latest migration with
yarn buildTsc && yarn runDbMigrations
and then settype
for income groupsBaked metadata file for that variable would then contain following fields:
Related: #1692, #1324