Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

Merged
merged 14 commits into from
Jul 24, 2024

Conversation

annakrystalli
Copy link
Member

@annakrystalli annakrystalli commented Jul 22, 2024

This PR introduces an optional output_type_id_datatype property to enable hub administrators to configure and communicate the output_type_id column data type at a hub level. This will allow hubs to override default behaviour of automatically determinining the simplest data type that can accomodate output type IDs across all output types when creating hub schema.

The setting is also useful for administrators to future proof the output_type_id column from potential issues arising by changes in data type, introduced by new output types after submissions have begun, by setting output_type_id_datatype to the simplest data type from the start, i.e. character (#87).

For more info on related work see https://github.com/orgs/hubverse-org/projects/3/views/12

@zkamvar
Copy link
Member

zkamvar commented Jul 22, 2024

Not a review because I'm still swimming in the documentation sea, but here's a reference of the diff as of 0a93c9e, which I found helpful for comparison.

diff -u --color=always v3.0.0/tasks-schema.json v3.0.1/tasks-schema.json
--- v3.0.0/tasks-schema.json	2024-07-22 08:47:15.851365031 -0700
+++ v3.0.1/tasks-schema.json	2024-07-22 08:47:36.079408585 -0700
@@ -1,6 +1,6 @@
 {
     "$schema": "https://json-schema.org/draft/2020-12/schema",
-    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/tasks-schema.json",
+    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
     "title": "Schema for Modeling Hub model task definitions",
     "description": "This is the schema of the tasks.json configuration file that defines the tasks within a modeling hub.",
     "type": "object",
@@ -1439,6 +1439,18 @@
                 ],
                 "additionalProperties": true
             }
+        },
+        "output_type_id_datatype": {
+            "description": "The hub level data type of the output_type_id column.",
+            "examples": ["character"],
+            "type": "string",
+            "enum": [
+                "character", 
+                "double", 
+                "integer",
+                "logical", 
+                "Date"
+            ]
         }
     },
     "required": [

Copy link

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit confused on that new optional parameters. I understand the idea behind it but I have some questions:

  • Are we planning to use this parameter in other hubverse function?
  • Also, I see why we want to have it at hub level but what happen if you want to have round with different output type id. I guess the solution here is to not use this parameter

@annakrystalli
Copy link
Member Author

annakrystalli commented Jul 22, 2024

Hey @LucieContamin !

There's a link to additional related issues to this change in the PR description above and a lot more detail in the issue.

Also, this must be a hub level setting because all files, especially parquet must share the same schema to be opened as a dataset at a hub level.

@annakrystalli
Copy link
Member Author

Thanks for the tip @zkamvar ! I reckon we should put that in the README as it's especially useful for these PRs

@annakrystalli
Copy link
Member Author

Also I should mention that while the feature to override the column data type always existed in functions to create the schema it had not been exposed in the validations. This issue is the impetus for setting this work in action hubverse-org/hubValidations#94

@LucieContamin
Copy link

There's a link to additional related issues to this change in the PR description above and a lot more detail in the issue.

Thanks for the quick reply and sorry I was not clear, I got a little bit lost in all the links, that's why I was asking too. But my understanding is that it will be used in the function to create the schema, validate the schema (of course), validate the PR and submission and validate the model output, with that new optional parameter as default. Is that correct?
I need to maybe think a little bit more, but I am little confused by this behavior of a optional parameter being the default of other function.

Also, this must be a hub level setting because all files, especially parquet must share the same schema to be opened as a dataset at a hub level.

That was my understanding for the hub level.

@annakrystalli
Copy link
Member Author

Ah got it!

If the parameter is not set, the rest of the functionality defaults to the previous default of autodetection. So no change to prior functionality if it's missing.

Hopefully all this will be clearer in the downstream PRs in the related packages.

@nickreich nickreich self-requested a review July 22, 2024 20:16
Copy link
Contributor

@nickreich nickreich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the schema itself looks good, and I understand, based on the comments, why this is necessary. Maybe we could just add some more of the context to the description of the new optional field, like

       "output_type_id_datatype": {
           "description": "An optional specification at the hub-level of the simplest data type for the output_type_id column that can accommodate all data types for all model tasks. If not specified, downstream tasks such as validations will default to autodetecting the data type.",

@annakrystalli
Copy link
Member Author

annakrystalli commented Jul 23, 2024

Thanks @nickreich & @LucieContamin for comments!
I've gone ahead and made the following changes (diff below):

  • Added more detail to the property description. Note as well that I'm planning a whole section in hubDocs about the topic: Document the requirement for a stable hub schema across rounds hubDocs#143
  • I though about the property being optional too. We definitely want this to be optional because we don't want this change to be a breaking one (which it would be if we made it required). What I have done, though, is added an "auto" option. This is equivalent in behaviour to not including the property at all (i.e. when the property is missing, all downstream functionality defaults to "auto"), but it means we have a way to encode this default behaviour via a property value, which feels right to be possible. It also means we can include the property, set to the default "auto" in our hubTemplate, so new admins can be aware of it.
  • I also added a note in the README with code snippets for producing file diffs between versions for easier reviewing.
--- v3.0.0/tasks-schema.json    2024-07-22 17:59:17
+++ v3.0.1/tasks-schema.json    2024-07-23 10:56:08
@@ -1,6 +1,6 @@
 {
     "$schema": "https://json-schema.org/draft/2020-12/schema",
-    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/tasks-schema.json",
+    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
     "title": "Schema for Modeling Hub model task definitions",
     "description": "This is the schema of the tasks.json configuration file that defines the tasks within a modeling hub.",
     "type": "object",
@@ -1439,6 +1439,20 @@
                 ],
                 "additionalProperties": true
             }
+        },
+        "output_type_id_datatype": {
+            "description": "The hub level data type of the output_type_id column. This data type must be shared across all files in the hub and be able to represent all output type ID values across all hub output types and rounds. If not provided or set to 'auto', hub defaults to autodetecting the simplest hub level data type.",
+            "default": "auto",
+            "examples": ["character"],
+            "type": "string",
+            "enum": [
+                "auto",
+                "character", 
+                "double", 
+                "integer",
+                "logical", 
+                "Date"
+            ]
         }
     },
     "required": [
--- v3.0.0/admin-schema.json    2024-07-22 17:59:17
+++ v3.0.1/admin-schema.json    2024-07-22 18:15:09
@@ -1,6 +1,6 @@
 {
     "$schema": "https://json-schema.org/draft/2020-12/schema",
-    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/admin-schema.json",
+    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/admin-schema.json",
     "title": "Schema for Modeling Hub administrative settings",
     "description": "This JSON file provides a schema for modeling hub administrative information.",
     "type": "object",
%   

@annakrystalli
Copy link
Member Author

Here's also a related PR in hubTemplate introducing the property but setting it to the previous "auto" behaviour by default: hubverse-org/hubTemplate#17

@annakrystalli
Copy link
Member Author

Also, in case you are interested in downstream implementation details @LucieContamin, I've now opened the first related PR in hubData hubverse-org/hubData#49

Copy link
Contributor

@nickreich nickreich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes reflecting my suggestions.

…trictions

Remove restrictive epidemic week formatting requirements for CDF output_type_id values. Resolves #80
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
4 participants