Introduce a property to fix the `output_type_id` column data type across the hub #87

annakrystalli · 2024-07-01T07:15:56Z

Background

For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round.
Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).

This is why early on when we had lots of discussions about this we allowed for a hub to override the automatic detection of the output_type_id data type in the hubData::create_hub_schema() function, used to determine the overall hub schema from the tasks.json config file. To future proof the output_type_id admins could set the value of output_type_id_datatype to the safest, most future proof data type, i.e. character.

Set value of arg `output_type_id_datatype` in config

To enable hub admins to be able to configure and communicate this setting at a hub level, I propose an output_type_id_datatype enum optional property at the top level of tasks.json config (i.e. sibling to rounds and schema_version) that would accept valid hubData::create_hub_schema() output_type_id_datatype argument values. If this property exists in the config, the default behaviour (unless specifically overridden when calling) of hubData::create_hub_schema() would be to set the output type id column to the data type specified in the config.

This would give admins the ability to future proof their hubs by setting the column to character if they are unsure whether they may start collecting an output type that could affect the schema.

The text was updated successfully, but these errors were encountered:

Add output_type_id_datatype property to v3.0.1. Resolves #87

github-project-automation bot added this to hubverse Development overview Jul 1, 2024

github-project-automation bot moved this to Todo in hubverse Development overview Jul 1, 2024

annakrystalli added this to the robust-hub-schema milestone Jul 1, 2024

annakrystalli moved this from Todo to Up Next in hubverse Development overview Jul 17, 2024

annakrystalli mentioned this issue Jul 17, 2024

Validation fails matching output_type_id types hubverse-org/hubValidations#94

Closed

annakrystalli self-assigned this Jul 22, 2024

annakrystalli moved this from Up Next to In Progress in hubverse Development overview Jul 22, 2024

annakrystalli moved this from In Progress to Ready for Review in hubverse Development overview Jul 22, 2024

annakrystalli mentioned this issue Jul 22, 2024

Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

Merged

annakrystalli linked a pull request Jul 24, 2024 that will close this issue

Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

Merged

annakrystalli closed this as completed in 5535b56 Jul 24, 2024

annakrystalli closed this as completed in #88 Jul 24, 2024

annakrystalli added a commit that referenced this issue Jul 24, 2024

Merge pull request #88 from hubverse-org/v3.0.1-branch

ce9efd4

Add output_type_id_datatype property to v3.0.1. Resolves #87

github-project-automation bot moved this from Ready for Review to Done in hubverse Development overview Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a property to fix the `output_type_id` column data type across the hub #87

Introduce a property to fix the `output_type_id` column data type across the hub #87

annakrystalli commented Jul 1, 2024

Introduce a property to fix the output_type_id column data type across the hub #87

Introduce a property to fix the output_type_id column data type across the hub #87

Comments

annakrystalli commented Jul 1, 2024

Background

Set value of arg output_type_id_datatype in config

Introduce a property to fix the `output_type_id` column data type across the hub #87

Introduce a property to fix the `output_type_id` column data type across the hub #87

Set value of arg `output_type_id_datatype` in config