Use pyarrow dtype_backend #1781

visr · 2024-08-30T21:56:42Z

The dtype_backend part of this is not breaking. The only part that is technically breaking is that we specify a unit of milliseconds for the Arrow time type. Previously we used the default nanosecond precision, which was then truncated to milliseconds in the core. I think it is better to disallow precision higher than milliseconds if we cannot distinguish them in the core.

evetion · 2024-09-24T07:11:54Z

I'd like to see some extra tests with adding/setting a default pandas dataframe, without the correct dtypte and time. edit: Basically trying to break it ;)

DanielTollenaar · 2024-09-24T09:01:43Z

Tested it on De Dommel workflow and get a model that gets written with the Python-API and read and simulated without ValidationErrors or Exceptions. Didn't test/check any results, but the model + results are here: https://we.tl/t-Uxk9xra8i1

However, I do repeatedly get this FutureWarning which is messing up my logging quite alot:

(file:///D:/repositories/Ribasim/python/ribasim/ribasim/model.py:239): FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  pd.concat(df_chunks)

core/src/util.jl

core/src/validation.jl

core/test/validation_test.jl

docs/guide/examples.ipynb

pixi.toml

python/ribasim/ribasim/delwaq/generate.py

visr · 2024-09-24T12:56:24Z

python/ribasim/ribasim/input_base.py

-                df = gpd.read_file(path, layer=table, fid_as_index=True)
+                df = pyogrio.read_dataframe(
+                    path,
+                    layer=table,
+                    fid_as_index=True,
+                    use_arrow=True,
+                    arrow_to_pandas_kwargs={"types_mapper": pd.ArrowDtype},
+                )


Only pyogrio supported this directly, geopandas not yet.

Best to place an inline comment to the specific kwarg that's not supported yet?

Actually I misunderstood that all extra kwargs were passed from gpd.read_file to pyogrio.read_dataframe if engine="pyogrio", so I went back to gpd.read_file and added a comment on arrow_to_pandas_kwargs, which ends up in pyarrow, not pyogrio.

visr · 2024-09-24T12:59:03Z

python/ribasim_testmodels/ribasim_testmodels/continuous_control.py

-                time=pd.date_range(start="2020-01-01", end="2021-01-01", periods=100),
+                time=pd.date_range(
+                    start="2020-01-01", end="2021-01-01", periods=100, unit="ms"
+                ),


This needed unit="ms" otherwise the nanosecond precision times get a ValidationError now.

Can't we automatically convert such timestamps? I fear that users might hit this and have to find this fix manually.

The only point where we got nanosecond precision was with the data_range with periods pattern. I feel like this pattern will rarely be used for real models, maybe only synthetic ones like our docs. I don't think I've ever come across hydrological data with nanosecond precision.

Nanosecond units will automatically convert to millisecond as long as it doesn't lose data. The ValidationError is also quite good. I prefer not to throw a warning on this as it usually doesn't matter. If you know an easy way to round or truncate to the nearest millisecond I'd be ok with that as well. But I doubt many users will run into this.

evetion

Looking good, but I have my doubts on how we handle the defaults of pandas (both for time as the arrow dtype).

In terms of dtype, I think it's automatically converted, so we should be fine there. However, in terms of time I'd rather we convert (maybe with a warning), rather than erroring out. That still seems an improvement over silently dropping precision. That said it's mostly a usability concern for me here.

core/src/validation.jl

docs/guide/examples.ipynb

python/ribasim/ribasim/config.py

python/ribasim/ribasim/input_base.py

evetion · 2024-09-25T07:46:08Z

python/ribasim/ribasim/schemas.py

-    node_id: Series[Int32] = pa.Field(nullable=False, default=0)
-    area: Series[float] = pa.Field(nullable=False)
-    level: Series[float] = pa.Field(nullable=False)
+    node_id: Series[Annotated[pd.ArrowDtype, pyarrow.int32()]] = pa.Field(nullable=True)


You have lost the defaults here, was that on purpose?

Yes 0 was a dummy-default since np.int32 doesn't support missings, but pyarrow.int32 does.

Well its also because the node_id is a required field and can't be missing schema wise?

python/ribasim/ribasim/utils.py

visr added 3 commits August 30, 2024 23:55

Use pyarrow dtype_backend

8cbb1ed

Merge branch 'main' into pyarrow

bbe517f

Run codegen

ee64728

visr added the breaking A change that breaks existing models label Sep 23, 2024

visr added 5 commits September 23, 2024 17:06

Coalesce priority to 0

1f8e09a

Ensure timestamps are equal

5ea17fb

Fixes

1a6c087

Fix ms unit in example

ba469e5

Ensure data is available at the start, not 1 ms too late

9d0d0c7

visr added 2 commits September 24, 2024 10:04

Merge branch 'main' into pyarrow

a9fb8d1

try

d49d634

visr added 4 commits September 24, 2024 12:57

Put in fill_value in pivot_table

990f528

Try to address the futurewarning

e6fcaf3

Failing to break it

b519132

Update remaining read_feather

9d7790f

visr marked this pull request as ready for review September 24, 2024 12:46

visr commented Sep 24, 2024

View reviewed changes

Revert fix that didn't concat FutureWarning

320fb49

visr requested a review from evetion September 24, 2024 13:01

visr added 3 commits September 24, 2024 15:43

Silence pd.concat warning that doesn't affect the result

ada1f83

Remove global geopandas engine setting

71c13bd

Set CRS of Basin / area

71d9fbc

evetion requested changes Sep 25, 2024

View reviewed changes

visr added 4 commits September 25, 2024 20:57

Address review comments

fdf25d9

restore nullable

baa8756

Add default 0 for node_id

fd6908d

Priority is optional

cbc936b

visr requested a review from evetion September 26, 2024 11:05

evetion approved these changes Sep 26, 2024

View reviewed changes

visr merged commit f5bfb50 into main Sep 26, 2024
27 checks passed

visr deleted the pyarrow branch September 26, 2024 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pyarrow dtype_backend #1781

Use pyarrow dtype_backend #1781

visr commented Aug 30, 2024 •

edited

Loading

evetion commented Sep 24, 2024 •

edited

Loading

DanielTollenaar commented Sep 24, 2024

visr Sep 24, 2024

evetion Sep 25, 2024

visr Sep 25, 2024

visr Sep 24, 2024

evetion Sep 25, 2024

visr Sep 25, 2024

evetion left a comment

evetion Sep 25, 2024

visr Sep 25, 2024

evetion Sep 25, 2024

Use pyarrow dtype_backend #1781

Use pyarrow dtype_backend #1781

Conversation

visr commented Aug 30, 2024 • edited Loading

evetion commented Sep 24, 2024 • edited Loading

DanielTollenaar commented Sep 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evetion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

visr commented Aug 30, 2024 •

edited

Loading

evetion commented Sep 24, 2024 •

edited

Loading