UPDATED PROPOSAL: Enforce character string data type for all type_id
output_type
properties
#52
Replies: 6 comments 2 replies
-
This basically seems ok to me. Still not fully clear on how this might impact downstream operations, like plotting. Say for example you have a hub that has some quantile forecasts and some categorical CDF forecasts, so the type_id column is treated as character. Then to plot those quantile forecasts we still would have to coerce the quantiles in type_id column to numeric, right? And that operation would be specific to the calculated "type_id" data type class for the hub overall? |
Beta Was this translation helpful? Give feedback.
-
Yes, if the |
Beta Was this translation helpful? Give feedback.
-
To try to flesh out the example a bit more, I think it might help to think about the table that @annakrystalli included at the top as broken down by The following table determines, given a
[1] I note that in the current version of the documentation we have "categorical" and "ordinal" as target types but not "nominal" but I think this should be changed. Some additional notes/explanations on the above table:
Noting in general that the above roughly aligns with the "Valid prediction types by target type" table for Zoltar as well. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the useful comment @nickreich! This is all great info for our docs. A quick note that in the upcoming schema version (v1.0.0), possible target types have been amended to |
Beta Was this translation helpful? Give feedback.
-
Given we seem in agreement, I will go ahead and start implementing this in the package. |
Beta Was this translation helpful? Give feedback.
-
I've had an idea of how to deal with the issues raised regarding the non stability of the What if we introduced an argument
This way, users that want to develop more dependable downstream code can do so by fixing the |
Beta Was this translation helpful? Give feedback.
-
Background
The ability to have categorical variable values as entries to
type_id
orvalue
in some output types introduces a risk for data type inconsistencies in bothtype_id
orvalue
columns in model-output data. This can affect ability for opening an arrow dataset successfully and consistently.The table below lists the possibility of encountering character/numeric data types in
type_id
orvalue
columns for each output type (x
= possible,-
= not possible).type_id
column data type is predictable. This is made difficult by mixing output_types whosetype_id
are different, especiallycharacter
withnumeric
.value
column always remains numeric. This is not possible forsample
output types of categorical or epiweek variables.Options
A few options to handled this were proposed and discussed.
Option A: enforce
type id
to be a character column andvalue
to remain numeric.PROs
It's the simplest approach
CONs
value
forsample
output types of categorical variables.type_id
to numeric first.Example:
Option B
Another option discussed was to introduce an additional
type_id_label
column for storing character labels for categorical variables, mapping them to integer indices intype_id
column`.PROs
type_id
(numeric) andtype_id_label
(character) column.type_id_label
as a column ensures data files are self contained without needing to look up what valuestype_id
indices map to.type_id
s not required to be converted to character to filter/analyse.sample
output types for categorical variables astype_id
integers can be sampled and stored in thevalue
column.CONs
type_id
andtype_id_labels
are consistent across rounds.value
column map to in a sample output type of a categorical variable, an additional column (e.g.value_labels
) is required.DECISION: OPTION A
It was decided to go with Option A as it is the simplest to implement at this stage. It means that
sample
output types for categorical variables are not currently supported but that was deemed acceptable for now due to the rarity of that situation.However, some aspects of the approach when looking into implementation feel lacking.
The most jarring implementation feature would be the charges in the
tasks.json
config files:By enforcing
type_id
to be a character in all cases, it makes the schema fortype_id
in some output types really clunky. For example, inquantile
not only does the"type"
of therequired
andoptional
arrays change tostring
but now it is not possible to enforce/document a minimum and maximum value through the schema, as the keywords have no meaning for string types.While these checks can of course be carried out in R rather than automatically through validation against the schema, the lack of encoding/documentation of the criterion within the schema feels very jarring, unintuitive and inefficient. It will have to be documented somewhere else which feels clunky.
Proposal:
While consistency and predictability are indeed important to interacting with model-output data,
it may not be the case that the best approach is to fix
type_id
column tocharacter
.Instead, I propose that we use the group of output types specified in a hub's
tasks.json
across all rounds to decipher whether a hub should have a charactertype_id
column or not.The proposal follows the principle of type coercion in R:
Principles
type_id
could beinteger
,double
orcharacter
dependant on the group of output_typestype_id
a given hub should have can predictably be determined by examining the collection of output types defined in thetasks.json
instead of hard coded through the schema.For example:
a hub with output types:
mean
median
quantile
would have a double `type_id
A hub that includes
cdf
output type of peak week, withtype_id
specified as epiweeks (e.g.EW202301
) would have a charactertype_id
The predictability means the
tasks.json
config file can be used to cast thetype_id
column to the appropriate data type when connecting to the hub consistently.Applying such rules and making the determination of
type_id
data type predictable makes it easier to document. We would just document transparently the rules the software uses for determining the data type when connecting to the hub. I feel that's much easier to explain than having to explain why numeric output types must be defined as strings intasks.json
.The fact that we use the collection of
type_id
s across all output_types means that individual output typetype_id
properties can be defined and checked individually according to the criteria type_ids for a given output type should adhere to.Bonus benefit!
I've drawn up draft functionality in the form of function
build_arrow_schema()
in branchschema-from-config
to create a schema for use when opening an arrow dataset connection.The fact that the determination is made from the collection of output types across all rounds means we don't have to enforce a string type for all
type_id
properties across all output types. We can therefore leave the schema as is and continue to use it to accurately encode the expectations of data for given output types.The functionality required to use the config file to determine
type_id
can also be used to determine the correct data type for all columns and generate an overarching schema. This in turn can be used to open datasets for multiple file formats and combine them into a single hub connection.Created on 2023-04-26 with reprex v2.0.2
Let me know what you think of this proposal and of course happy to answer questions if anything is unclear.
Beta Was this translation helpful? Give feedback.
All reactions