Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

annakrystalli · 2024-07-22T15:23:57Z

This PR introduces an optional output_type_id_datatype property to enable hub administrators to configure and communicate the output_type_id column data type at a hub level. This will allow hubs to override default behaviour of automatically determinining the simplest data type that can accomodate output type IDs across all output types when creating hub schema.

The setting is also useful for administrators to future proof the output_type_id column from potential issues arising by changes in data type, introduced by new output types after submissions have begun, by setting output_type_id_datatype to the simplest data type from the start, i.e. character (#87).

For more info on related work see https://github.com/orgs/hubverse-org/projects/3/views/12

zkamvar · 2024-07-22T15:52:01Z

Not a review because I'm still swimming in the documentation sea, but here's a reference of the diff as of 0a93c9e, which I found helpful for comparison.

diff -u --color=always v3.0.0/tasks-schema.json v3.0.1/tasks-schema.json

--- v3.0.0/tasks-schema.json	2024-07-22 08:47:15.851365031 -0700
+++ v3.0.1/tasks-schema.json	2024-07-22 08:47:36.079408585 -0700
@@ -1,6 +1,6 @@
 {
     "$schema": "https://json-schema.org/draft/2020-12/schema",
-    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/tasks-schema.json",
+    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
     "title": "Schema for Modeling Hub model task definitions",
     "description": "This is the schema of the tasks.json configuration file that defines the tasks within a modeling hub.",
     "type": "object",
@@ -1439,6 +1439,18 @@
                 ],
                 "additionalProperties": true
             }
+        },
+        "output_type_id_datatype": {
+            "description": "The hub level data type of the output_type_id column.",
+            "examples": ["character"],
+            "type": "string",
+            "enum": [
+                "character", 
+                "double", 
+                "integer",
+                "logical", 
+                "Date"
+            ]
         }
     },
     "required": [

LucieContamin

I am a little bit confused on that new optional parameters. I understand the idea behind it but I have some questions:

Are we planning to use this parameter in other hubverse function?
Also, I see why we want to have it at hub level but what happen if you want to have round with different output type id. I guess the solution here is to not use this parameter

annakrystalli · 2024-07-22T16:18:22Z

Hey @LucieContamin !

There's a link to additional related issues to this change in the PR description above and a lot more detail in the issue.

Also, this must be a hub level setting because all files, especially parquet must share the same schema to be opened as a dataset at a hub level.

annakrystalli · 2024-07-22T16:20:56Z

Thanks for the tip @zkamvar ! I reckon we should put that in the README as it's especially useful for these PRs

annakrystalli · 2024-07-22T16:24:07Z

Also I should mention that while the feature to override the column data type always existed in functions to create the schema it had not been exposed in the validations. This issue is the impetus for setting this work in action hubverse-org/hubValidations#94

LucieContamin · 2024-07-22T16:29:56Z

There's a link to additional related issues to this change in the PR description above and a lot more detail in the issue.

Thanks for the quick reply and sorry I was not clear, I got a little bit lost in all the links, that's why I was asking too. But my understanding is that it will be used in the function to create the schema, validate the schema (of course), validate the PR and submission and validate the model output, with that new optional parameter as default. Is that correct?
I need to maybe think a little bit more, but I am little confused by this behavior of a optional parameter being the default of other function.

Also, this must be a hub level setting because all files, especially parquet must share the same schema to be opened as a dataset at a hub level.

That was my understanding for the hub level.

annakrystalli · 2024-07-22T16:36:06Z

Ah got it!

If the parameter is not set, the rest of the functionality defaults to the previous default of autodetection. So no change to prior functionality if it's missing.

Hopefully all this will be clearer in the downstream PRs in the related packages.

nickreich

I think the schema itself looks good, and I understand, based on the comments, why this is necessary. Maybe we could just add some more of the context to the description of the new optional field, like

       "output_type_id_datatype": {
           "description": "An optional specification at the hub-level of the simplest data type for the output_type_id column that can accommodate all data types for all model tasks. If not specified, downstream tasks such as validations will default to autodetecting the data type.",

annakrystalli · 2024-07-23T07:25:28Z

Thanks @nickreich & @LucieContamin for comments!
I've gone ahead and made the following changes (diff below):

Added more detail to the property description. Note as well that I'm planning a whole section in hubDocs about the topic: Document the requirement for a stable hub schema across rounds hubDocs#143
I though about the property being optional too. We definitely want this to be optional because we don't want this change to be a breaking one (which it would be if we made it required). What I have done, though, is added an "auto" option. This is equivalent in behaviour to not including the property at all (i.e. when the property is missing, all downstream functionality defaults to "auto"), but it means we have a way to encode this default behaviour via a property value, which feels right to be possible. It also means we can include the property, set to the default "auto" in our hubTemplate, so new admins can be aware of it.
I also added a note in the README with code snippets for producing file diffs between versions for easier reviewing.

--- v3.0.0/tasks-schema.json    2024-07-22 17:59:17
+++ v3.0.1/tasks-schema.json    2024-07-23 10:56:08
@@ -1,6 +1,6 @@
 {
     "$schema": "https://json-schema.org/draft/2020-12/schema",
-    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/tasks-schema.json",
+    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
     "title": "Schema for Modeling Hub model task definitions",
     "description": "This is the schema of the tasks.json configuration file that defines the tasks within a modeling hub.",
     "type": "object",
@@ -1439,6 +1439,20 @@
                 ],
                 "additionalProperties": true
             }
+        },
+        "output_type_id_datatype": {
+            "description": "The hub level data type of the output_type_id column. This data type must be shared across all files in the hub and be able to represent all output type ID values across all hub output types and rounds. If not provided or set to 'auto', hub defaults to autodetecting the simplest hub level data type.",
+            "default": "auto",
+            "examples": ["character"],
+            "type": "string",
+            "enum": [
+                "auto",
+                "character", 
+                "double", 
+                "integer",
+                "logical", 
+                "Date"
+            ]
         }
     },
     "required": [

--- v3.0.0/admin-schema.json    2024-07-22 17:59:17
+++ v3.0.1/admin-schema.json    2024-07-22 18:15:09
@@ -1,6 +1,6 @@
 {
     "$schema": "https://json-schema.org/draft/2020-12/schema",
-    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.0/admin-schema.json",
+    "$id": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/admin-schema.json",
     "title": "Schema for Modeling Hub administrative settings",
     "description": "This JSON file provides a schema for modeling hub administrative information.",
     "type": "object",
%

annakrystalli · 2024-07-23T07:59:36Z

Here's also a related PR in hubTemplate introducing the property but setting it to the previous "auto" behaviour by default: hubverse-org/hubTemplate#17

annakrystalli · 2024-07-23T09:37:22Z

Also, in case you are interested in downstream implementation details @LucieContamin, I've now opened the first related PR in hubData hubverse-org/hubData#49

nickreich

Thanks for the changes reflecting my suggestions.

…ut_type_id values. Resolves #80

…trictions Remove restrictive epidemic week formatting requirements for CDF output_type_id values. Resolves #80

Diff comment workflow

annakrystalli added 2 commits July 22, 2024 18:22

Add output_type_id_datatype property to v3.0.1. Resolves #87

5535b56

Clarify output_type_id_datatype is optional

0a93c9e

LucieContamin reviewed Jul 22, 2024

View reviewed changes

nickreich self-requested a review July 22, 2024 20:16

nickreich reviewed Jul 22, 2024

View reviewed changes

annakrystalli added 3 commits July 23, 2024 09:46

Add note about including diffs in a PR

7762ff1

Add more detail to output_type_id_datatype description

0275ff8

Add "auto" option

2347e1a

Better explain default

6120e95

annakrystalli added 2 commits July 23, 2024 11:23

Add tip about piping diffs to clipboard

7255671

Add tip about diff code blocks

d7cd67f

annakrystalli mentioned this pull request Jul 23, 2024

44/ Configure output_type_id from config hubverse-org/hubData#49

Merged

annakrystalli mentioned this pull request Jul 23, 2024

Add workflow to auto-generate diffs #90

Closed

zkamvar added 3 commits July 23, 2024 07:57

add diff comment workflow

a3442be

update README

71e68b9

typo

aa15450

annakrystalli mentioned this pull request Jul 23, 2024

Diff comment workflow #92

Merged

nickreich approved these changes Jul 24, 2024

View reviewed changes

annakrystalli added 3 commits July 24, 2024 17:09

Remove restrictive epidemic week formatting requirements for CDF outp…

e6550d7

…ut_type_id values. Resolves #80

Merge pull request #93 from hubverse-org/80/remove-epiweek-format-res…

0f5c1ed

…trictions Remove restrictive epidemic week formatting requirements for CDF output_type_id values. Resolves #80

Merge pull request #92 from hubverse-org/diff-comment-workflow

9734436

Diff comment workflow

This was linked to issues Jul 24, 2024

remove epidemic week formatting requirements for CDF category values #80

Closed

Introduce a property to fix the output_type_id column data type across the hub #87

Closed

annakrystalli merged commit ce9efd4 into main Jul 24, 2024
2 checks passed

annakrystalli deleted the v3.0.1-branch branch July 24, 2024 14:28

zkamvar mentioned this pull request Aug 13, 2024

add support for ascending CDF values with numeric output_type_id cast as character hubverse-org/hubValidations#105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

annakrystalli commented Jul 22, 2024 •

edited

Loading

zkamvar commented Jul 22, 2024 •

edited

Loading

LucieContamin left a comment

annakrystalli commented Jul 22, 2024 •

edited

Loading

annakrystalli commented Jul 22, 2024

annakrystalli commented Jul 22, 2024

LucieContamin commented Jul 22, 2024

annakrystalli commented Jul 22, 2024

nickreich left a comment

annakrystalli commented Jul 23, 2024 •

edited

Loading

annakrystalli commented Jul 23, 2024

annakrystalli commented Jul 23, 2024

nickreich left a comment

Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

Add output_type_id_datatype property to v3.0.1. Resolves #87 #88

Conversation

annakrystalli commented Jul 22, 2024 • edited Loading

zkamvar commented Jul 22, 2024 • edited Loading

LucieContamin left a comment

Choose a reason for hiding this comment

annakrystalli commented Jul 22, 2024 • edited Loading

annakrystalli commented Jul 22, 2024

annakrystalli commented Jul 22, 2024

LucieContamin commented Jul 22, 2024

annakrystalli commented Jul 22, 2024

nickreich left a comment

Choose a reason for hiding this comment

annakrystalli commented Jul 23, 2024 • edited Loading

annakrystalli commented Jul 23, 2024

annakrystalli commented Jul 23, 2024

nickreich left a comment

Choose a reason for hiding this comment

annakrystalli commented Jul 22, 2024 •

edited

Loading

zkamvar commented Jul 22, 2024 •

edited

Loading

annakrystalli commented Jul 22, 2024 •

edited

Loading

annakrystalli commented Jul 23, 2024 •

edited

Loading