-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add output_type_id_datatype property to v3.0.1. Resolves #87 #88
Conversation
Not a review because I'm still swimming in the documentation sea, but here's a reference of the diff as of 0a93c9e, which I found helpful for comparison. diff -u --color=always v3.0.0/tasks-schema.json v3.0.1/tasks-schema.json --- v3.0.0/tasks-schema.json 2024-07-22 08:47:15.851365031 -0700
+++ v3.0.1/tasks-schema.json 2024-07-22 08:47:36.079408585 -0700
@@ -1,6 +1,6 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
- "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/tasks-schema.json",
+ "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
"title": "Schema for Modeling Hub model task definitions",
"description": "This is the schema of the tasks.json configuration file that defines the tasks within a modeling hub.",
"type": "object",
@@ -1439,6 +1439,18 @@
],
"additionalProperties": true
}
+ },
+ "output_type_id_datatype": {
+ "description": "The hub level data type of the output_type_id column.",
+ "examples": ["character"],
+ "type": "string",
+ "enum": [
+ "character",
+ "double",
+ "integer",
+ "logical",
+ "Date"
+ ]
}
},
"required": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little bit confused on that new optional parameters. I understand the idea behind it but I have some questions:
- Are we planning to use this parameter in other hubverse function?
- Also, I see why we want to have it at hub level but what happen if you want to have round with different output type id. I guess the solution here is to not use this parameter
Hey @LucieContamin ! There's a link to additional related issues to this change in the PR description above and a lot more detail in the issue. Also, this must be a hub level setting because all files, especially parquet must share the same schema to be opened as a dataset at a hub level. |
Thanks for the tip @zkamvar ! I reckon we should put that in the README as it's especially useful for these PRs |
Also I should mention that while the feature to override the column data type always existed in functions to create the schema it had not been exposed in the validations. This issue is the impetus for setting this work in action hubverse-org/hubValidations#94 |
Thanks for the quick reply and sorry I was not clear, I got a little bit lost in all the links, that's why I was asking too. But my understanding is that it will be used in the function to create the schema, validate the schema (of course), validate the PR and submission and validate the model output, with that new optional parameter as default. Is that correct?
That was my understanding for the hub level. |
Ah got it! If the parameter is not set, the rest of the functionality defaults to the previous default of autodetection. So no change to prior functionality if it's missing. Hopefully all this will be clearer in the downstream PRs in the related packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the schema itself looks good, and I understand, based on the comments, why this is necessary. Maybe we could just add some more of the context to the description of the new optional field, like
"output_type_id_datatype": {
"description": "An optional specification at the hub-level of the simplest data type for the output_type_id column that can accommodate all data types for all model tasks. If not specified, downstream tasks such as validations will default to autodetecting the data type.",
Thanks @nickreich & @LucieContamin for comments!
--- v3.0.0/tasks-schema.json 2024-07-22 17:59:17
+++ v3.0.1/tasks-schema.json 2024-07-23 10:56:08
@@ -1,6 +1,6 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
- "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/tasks-schema.json",
+ "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
"title": "Schema for Modeling Hub model task definitions",
"description": "This is the schema of the tasks.json configuration file that defines the tasks within a modeling hub.",
"type": "object",
@@ -1439,6 +1439,20 @@
],
"additionalProperties": true
}
+ },
+ "output_type_id_datatype": {
+ "description": "The hub level data type of the output_type_id column. This data type must be shared across all files in the hub and be able to represent all output type ID values across all hub output types and rounds. If not provided or set to 'auto', hub defaults to autodetecting the simplest hub level data type.",
+ "default": "auto",
+ "examples": ["character"],
+ "type": "string",
+ "enum": [
+ "auto",
+ "character",
+ "double",
+ "integer",
+ "logical",
+ "Date"
+ ]
}
},
"required": [ --- v3.0.0/admin-schema.json 2024-07-22 17:59:17
+++ v3.0.1/admin-schema.json 2024-07-22 18:15:09
@@ -1,6 +1,6 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
- "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/admin-schema.json",
+ "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/admin-schema.json",
"title": "Schema for Modeling Hub administrative settings",
"description": "This JSON file provides a schema for modeling hub administrative information.",
"type": "object",
% |
Here's also a related PR in |
Also, in case you are interested in downstream implementation details @LucieContamin, I've now opened the first related PR in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes reflecting my suggestions.
This PR introduces an optional
output_type_id_datatype
property to enable hub administrators to configure and communicate theoutput_type_id
column data type at a hub level. This will allow hubs to override default behaviour of automatically determinining the simplest data type that can accomodate output type IDs across all output types when creating hub schema.The setting is also useful for administrators to future proof the
output_type_id
column from potential issues arising by changes in data type, introduced by new output types after submissions have begun, by settingoutput_type_id_datatype
to the simplest data type from the start, i.e. character (#87).For more info on related work see https://github.com/orgs/hubverse-org/projects/3/views/12