143 / Update docs to reflect new schema version v3.0.1 #163

annakrystalli · 2024-08-02T15:33:24Z

This PR:

Adds section on the importance of a stable model output file schema
Introduces the output_type_id_datatype property

Both resolve #143 and bring documentation up to date with schema version v3.0.1

LucieContamin

I might have misunderstood something but I don't think the different format is always an issue. It might cause one but it's not always the case, so I think we don't need to be really "strict" on that. For example, you say:

"While validation of config files will alert hub administrations to discrepancies in task ID value data types across modeling tasks and rounds, any changes to a hub's config which has the potential to change the overall data type of model output columns after submissions have been collected could cause issues downstream and should be avoided. "

I don't think "SHOULD" is the word here, because in a lot of case it might be totally fine. Especially as the schema is generated on all the rounds and will often forced it to character and not cause any issue.

Output type id can be different format in multiple files and still work if set to character for example. So the output type id can be flexible and I think it's really important to have that, as some hubs might evolve and need that

LucieContamin · 2024-08-02T18:10:46Z

docs/source/user-guide/model-output.md

+
+ The benefit of this automatic detection is that it provides flexibility to the `output_type_id` column to adapt to the output types a hub is actually collecting. For example, a hub which only collects `mean` and `quantile` output types would, by default, have a `double` `output_type_id` column.
+
+ The risk of this automatic detection however arises if, in subsequent rounds -after submissions have begun-, the hub decides to also start collecting a `pmf` output type. This would change the default `output_type_id` column data type from `double` to `character` and cause a conflict between the `output_type_id` column data type in older and newer files when trying to open the hub as an `arrow` dataset.


I am not sure it will cause a conflict in this case. Arrow seems to only have issue in certain case. It needs more testing but it seems that casting double to character does not cause issue, but casting character to double does.

For example, I have 2 files with output type id as character or numeric and horizon as logical or numeric:

> str(arrow::read_parquet("test/team2-modelb/2023-11-12-team2-modelb.parquet")) Classes ‘data.table’ and 'data.frame': 1885 obs. of 9 variables: [...] $ output_type_id: chr "EW202422" "EW202421" "EW202420" "EW202419" ... $ horizon : num NA NA NA NA NA NA NA NA NA NA ... > str(arrow::read_parquet("test/teama-model1/2023-11-12-teama-model1.parquet")) Classes ‘data.table’ and 'data.frame': 1495 obs. of 9 variables: [...] $ output_type_id: num 0.01 0.025 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 ... $ horizon : logi NA NA NA NA NA NA ...

if I open the files with the output type id column expected to be "numeric" I got an error (as expected):

> schema Schema [...] horizon: int32 output_type_id: double > dc <- hubData::connect_model_output("test/", file_format = "parquet", schema = schema) > dplyr::collect(dc) Error in `compute.Dataset()`: ! Invalid: Failed to parse string: 'EW202346' as a scalar of type double

but it works well if the output type id is expected to be character:

> schema Schema [...] horizon: int32 output_type_id: string > dc <- hubData::connect_model_output("test/", file_format = "parquet", schema = schema) > dplyr::collect(dc) # A tibble: 3,380 × 10 origin_date scenario_id target horizon location age_group output_type output_type_id value <date> <chr> <chr> <int> <chr> <chr> <chr> <chr> <dbl> 1 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202422 1 2 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202421 1 3 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202420 1 4 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202419 1 5 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202418 1 6 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202417 1 7 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202416 1 8 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202415 1 9 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202414 1 10 2023-11-12 A-2023-10-27 peak time h… NA 06 0-130 cdf EW202413 1 # ℹ 3,370 more rows # ℹ 1 more variable: model_id <chr> # ℹ Use `print(n = ...)` to see more rows

And it seems that the horizon column does not encounter any issues neither. So some casting works but not all, and my guess is that forcing the columns to character will always work and can be use to by-pass some issue.

LucieContamin · 2024-08-02T18:42:45Z

docs/source/user-guide/model-output.md

+```
+If not supplied or if `"auto"` is set, the default behaviour of automatically detecting the data type from `output_type_id` values is used.
+
+This gives hub administrators the ability to future proof the `output_type_id` column in their model output files if they are unsure whether they may start collecting an output type that could affect the schema, by setting the column to `"character"` (the safest data type that all other values can be encoded as) at the start of data collection.


I don't think the "at the start" is necessary here. Setting it later on will also work I think

I think I agree with Lucie.

unrelated: "future proof" should be "future-proof".

I guess the issue with changing the specification after the start is that analysis/data retrieval/munging code that worked before might need to be changed. So maybe we should specify that it could be changed later, but any changes might impact other code that is in place. So maybe this fits in the "soft recommendation" category (maybe "should" is ok, with an explanation of why that is the recommendation.)

nickreich · 2024-08-03T16:02:12Z

docs/source/user-guide/model-output.md

+
+> Note the difference in the following discussion between [hubverse schema](https://github.com/hubverse-org/schemas) - the schema which hub config files are validated against - and [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) - the mapping of model output columns to data types.
+
+Because we store model output data as separate files but open them as a single `arrow` dataset using the `hubData` package, for a hub to be successfully accessed as an `arrow dataset`, it's necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected). 


minor suggestion: it's --> it is

nickreich

I think overall the documentation here is very clear and well-written. I added a few small grammatical suggestions (ignore them at risk of incurring the grumpiness of a recovering English major!). And see specific comments/replies to @LucieContamin 's comments about including some clarifications about why we make some specific recommendations.

annakrystalli · 2024-08-05T07:39:20Z

Thanks for the comments @LucieContamin and @nickreich !

Regarding your request to relax the language, note, firstly that I am careful to use language like should and could. I have also gone ahead and added a clarification about the potential implications. However I feel very strongly about being firm with requirements for model output data for the following reason:

We have collectively already spent far more time than I would have wanted banging our heads against arrow data types (e.g. hubverse-transform: supply a parquet schema hubverse-transform#14, Issues with filtering data from UnionDatasets/FileSystemDatasets when actual type_id data types in files are mixed hubData#9 etc). There is good reason to direct our users away from such experiences.
Even in your example @LucieContamin you are using collect() straight after opening the dataset. While this will work if an appropriate schema is supplied, trying to run a query before collecting on columns of parquet files with mixed data types does not work. This is a major reduction in intended functionality, especially if dealing with large and/or cloud hubs and I would go so far as to say that having to collect all data before being able to do any filtrering is a hack, as would be forcing all columns to be character to get around data type issues.
This means that a change in data type once collection has begun has more important implications than just the potential to affect downstream analysis code. It can effectively render columns unqueriable. Because of all this, I feel the strong language is justifiable.
Finally, the strong language is meant to also act as a disclaimer. Trying to understand / communicate / support the full extent of arrow idiosyncrasies is not something I feel maintainers of our packages should have to deal with. The advice is therefore in a way meant to also warn users of the conditions under which we can guarantee our tools will work but that if they don't follow our advice we can't be held responsible for reduction of functionality or having to introduce hacks to work with their data.

Having said that, once we have created functionality to inspect a hub for integrity , we are planning to add functionality that could repair any data type discrepancies and update files to conform to a changed schema hubverse-org/hubValidations#92. This could help admins in a situation where a breaking schema change needs to be introduced. However in general, and certainly until such functionality exists, I feel very strongly about being firm in our advice to steer users away from hub schema changes.

LucieContamin · 2024-08-05T12:56:00Z

THanks @annakrystalli for your reply, to follow up on that:

I agree but at the meantime I think it's quite an important recommendation we do here so I think it deserves to spend some time on it
So, I tested again with the same example I used in my comments and I can filter:

> dc <- hubData::connect_model_output("test/", file_format = "parquet", schema = schema)
> df <- dplyr::filter(dc, location == "06", output_type == "cdf")
> hubData::collect_hub(df)
# A tibble: 145 × 10
   model_id     origin_date scenario_id target horizon location age_group output_type output_type_id
 * <chr>        <date>      <chr>       <chr>    <int> <chr>    <chr>     <chr>       <chr>         
 1 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202422      
 2 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202421      
 3 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202420      
 4 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202419      
 5 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202418      
 6 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202417      
 7 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202416      
 8 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202415      
 9 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202414      
10 team2-modelb 2023-11-12  A-2023-10-… peak …      NA 06       0-130     cdf         EW202413      
# ℹ 135 more rows
# ℹ 1 more variable: value <dbl>
# ℹ Use `print(n = ...)` to see more rows

as I reminder that's my input files:

> str(arrow::read_parquet("test/team2-modelb/2023-11-12-team2-modelb.parquet"))
Classes ‘data.table’ and 'data.frame':	1885 obs. of  9 variables:
 [...]
 $ output_type_id: chr  "EW202422" "EW202421" "EW202420" "EW202419" ...
 $ horizon       : num  NA NA NA NA NA NA NA NA NA NA ...

> str(arrow::read_parquet("test/teama-model1/2023-11-12-teama-model1.parquet"))
Classes ‘data.table’ and 'data.frame':	1495 obs. of  9 variables:
[...]
 $ output_type_id: num  0.01 0.025 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 ...
 $ horizon       : logi  NA NA NA NA NA NA ...

So I don't know if something change in Arrow, if they fix something but It might be good to maybe re-test to be sure the issue is still here and deserve a "firm" recommendation. I think especially for the scenario hubs, where it can be really difficult to guess what you are going to do in the future rounds.

annakrystalli · 2024-08-05T13:06:28Z

Yes but you did not filter on output_type_id column which is likely the column that will have the problem. I also re-ran all the previous examples in the linked exploration above and none of them work, as before.

LucieContamin · 2024-08-05T13:20:37Z

I see yes filtering on output_type_id returns an error, sorry I miss that. But now that I think about it it does make sense, as they probably cast the column on collect.
However filtering the other column works fine, so you could technically by-pass it in certain filtering, to a certain degree. But I see your point.

However, I am still not sure about being "strict", for example:

Because we store model output data as separate files but open them as a single `arrow` dataset using the `hubData` package, for a hub to be successfully accessed as an `arrow dataset`, it is necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected).

I think is misleading. As we just saw, you can load your whole dataset and/or filter on other columns, and it will be fine. For a Scenario Hub, my guess is that this rule is going to be difficult to follow. And if you have a round that has a different output type, then they can still load the full hub, filter just this particular round and the output type (as only the id column will be different).

What I am trying to say is that I don't think it should not be on hubverse to be "strict", we can just explain quickly that behavior: saying the filtering on column with different format will not work but other filtering and load will work (might just need to provide a schema). And I can be on the hub maintainer side to deal with it.

The output_type_id format, is variable in SMH for example, and I think that's on us if we want to use the hubData tools to adapt to that. Does that make sense?

annakrystalli · 2024-08-05T14:05:56Z

As I said the language I use is should and could. If you can introduce the potential for errors in scenario hub and get away with not filtering on those particular columns then that is lucky for you but that should not affect what our general advice for best experience is. To me that is still a lucky hack. So I still stand by my original comment especially to act as a disclaimer for potential problems that might arise if the advice is not followed.

LucieContamin · 2024-08-05T14:40:31Z

I totally understand, and I am not saying we should not advice that, sorry if I am not clear. Also, as English is not my first language maybe I interpret the documentation a little wrong, if so I apologize about that.

I am going to try to be clearer in what I am trying to say:

I think it's good to recommend a hub to have column on the same format for all the rounds. Saying that it will ensure that all the hub tools should work perfectly on that.
However, for certain hub, it might be difficult to ensure that, I am thinking especially the scenario hubs that are dependent to scenarios and you might not be able to prepare for the target you are going to ask in the future. So, I am afraid that they might understand that they cannot use the hubverse tools with different columns format, which is not totally true (not wrong neither I agree), it might just need a work around and not being able to use the tool at full capacity. For example, when we say "it is necesssary to ensure that all files conform to the same [arrow schema]", for me I understand that we cannot have target with different output type format in the same schema, if we add the new format target later on. I understand what we don't like people doing that as it will potentially cause issue, however it can technically be by-pass by using the schema. And I totally understand that loading will work, but that they might encounter issue on a later step and it's can be a lucky hack to filter. What I am thinking is just that we can maybe either rephrase or add something saying that if you have columns of different format, you could still use some of our tools but we cannot guarantee it will all work as expected (for example: filtering before collect on a column with different format will not work) and might return error. I just thinking adding this kind of information might help some hubs.

Again, if I misinterpret the documentation update, I apologize. I am just thinking of the new hubs that might worry about setting up first rounds with a lot of uncertainties about future round, if that make sense

annakrystalli · 2024-08-05T14:42:18Z

@LucieContamin to your comment about

*for a hub to be successfully accessed as an arrow dataset, it is necesssary to ensure that all files conform to the same arrow schema (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema

being misleading, I can see your point so I clarified the meaning of what a successful interaction with a hub would be, i.e. for a hub to be successfully accessed and fully queryable across all columns as an arrow dataset...

I feel if a hub breaks these expectations, it should be up to the hub to communicate to their users what functionality will cause errors. Does that seem fair?

annakrystalli · 2024-08-05T15:00:18Z

BTW re detailing the different kinds of potentials for error, I did add a bit more detail here:

hubDocs/docs/source/user-guide/model-output.md

Line 138 in f330a80

    
           While validation of config files will alert hub administrations to discrepancies in task ID value data types across modeling tasks and rounds, any changes to a hub's config which has the potential to change the overall data type of model output columns after submissions have been collected could cause issues downstream and should be avoided. These issues can range from data type casting being required in downstream analysis code that used to work, not being able to filter on columns with data type discrepancies between files before collecting to an inability to open hub model output data as an `arrow` dataset. They are primarily a problem for parquet files, which encapsulate a schema within the file, but have a small chance to cause parsing errors in csvs too.

I think that should be enough as I don't think hubDocs is the place for detailed discussions of all the ways arrow can go wrong.

Regarding adding new output types, that's why we've introduced the output_type_id_datatype property for hubs who are not quite sure what might come, allowing them to future-proof the column from the start as character.

It sounds like scenario hub might in future make good use of the hub repair functionality once developed. And really I think that should be our best practice advice in the future, i.e. if you do get yourself in a corner use our hub repair functionality and correct your data rather than put up with limited or hacky ways to access your data.

LucieContamin · 2024-08-05T15:10:04Z

I agree, it should be enough. Thank you very much for the updates!

I might be wrong but I agree, I think scenario hub might make good use of the hub repair (US SMH will use it for past rounds) and yes good practice should be to try to avoid doing column of mixed format.

LucieContamin

Thank you again for the update and additional information!

annakrystalli · 2024-08-05T15:41:14Z

Thanks @LucieContamin ! Always appreciate your willingness to engage and think things through!

annakrystalli added 3 commits August 2, 2024 18:28

Bump version to v3.0.1

72abd59

Add section on stability of model out schema. Resolves #143

c613e84

Introduce output_type_id_datatype property. Resolves #143

10066b6

annakrystalli requested review from elray1 and LucieContamin August 2, 2024 15:33

LucieContamin reviewed Aug 2, 2024

View reviewed changes

nickreich reviewed Aug 3, 2024

View reviewed changes

nickreich approved these changes Aug 3, 2024

View reviewed changes

Clarify potential implications. Correct typos

48d98c7

Clarify what expectation of successful interaction with hub data is

f330a80

LucieContamin approved these changes Aug 5, 2024

View reviewed changes

annakrystalli merged commit a273815 into main Aug 5, 2024
1 check passed

annakrystalli deleted the br-v3.0.1 branch August 6, 2024 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

143 / Update docs to reflect new schema version v3.0.1 #163

143 / Update docs to reflect new schema version v3.0.1 #163

annakrystalli commented Aug 2, 2024 •

edited

Loading

LucieContamin left a comment •

edited

Loading

LucieContamin Aug 2, 2024 •

edited

Loading

LucieContamin Aug 2, 2024

nickreich Aug 3, 2024

nickreich Aug 3, 2024

nickreich Aug 3, 2024

nickreich left a comment

annakrystalli commented Aug 5, 2024 •

edited

Loading

LucieContamin commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

LucieContamin commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

LucieContamin commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

LucieContamin commented Aug 5, 2024

LucieContamin left a comment

annakrystalli commented Aug 5, 2024


		The benefit of this automatic detection is that it provides flexibility to the `output_type_id` column to adapt to the output types a hub is actually collecting. For example, a hub which only collects `mean` and `quantile` output types would, by default, have a `double` `output_type_id` column.

		The risk of this automatic detection however arises if, in subsequent rounds -after submissions have begun-, the hub decides to also start collecting a `pmf` output type. This would change the default `output_type_id` column data type from `double` to `character` and cause a conflict between the `output_type_id` column data type in older and newer files when trying to open the hub as an `arrow` dataset.


		> Note the difference in the following discussion between [hubverse schema](https://github.com/hubverse-org/schemas) - the schema which hub config files are validated against - and [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) - the mapping of model output columns to data types.

		Because we store model output data as separate files but open them as a single `arrow` dataset using the `hubData` package, for a hub to be successfully accessed as an `arrow dataset`, it's necesssary to ensure that all files conform to the same [`arrow schema`](https://arrow.apache.org/docs/11.0/r/reference/Schema.html) (i.e. share the same column data types) across the lifetime of the hub. This means that additions of new rounds should not change the overall hub schema at a later date (i.e. after submissions have already started being collected).

143 / Update docs to reflect new schema version v3.0.1 #163

143 / Update docs to reflect new schema version v3.0.1 #163

Conversation

annakrystalli commented Aug 2, 2024 • edited Loading

LucieContamin left a comment • edited Loading

Choose a reason for hiding this comment

LucieContamin Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

LucieContamin Aug 2, 2024

Choose a reason for hiding this comment

nickreich Aug 3, 2024

Choose a reason for hiding this comment

nickreich Aug 3, 2024

Choose a reason for hiding this comment

nickreich Aug 3, 2024

Choose a reason for hiding this comment

nickreich left a comment

Choose a reason for hiding this comment

annakrystalli commented Aug 5, 2024 • edited Loading

LucieContamin commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

LucieContamin commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

LucieContamin commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

annakrystalli commented Aug 5, 2024

LucieContamin commented Aug 5, 2024

LucieContamin left a comment

Choose a reason for hiding this comment

annakrystalli commented Aug 5, 2024

annakrystalli commented Aug 2, 2024 •

edited

Loading

LucieContamin left a comment •

edited

Loading

LucieContamin Aug 2, 2024 •

edited

Loading

annakrystalli commented Aug 5, 2024 •

edited

Loading