Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyarrow dtype_backend #1781

Merged
merged 22 commits into from
Sep 26, 2024
Merged

Use pyarrow dtype_backend #1781

merged 22 commits into from
Sep 26, 2024

Conversation

visr
Copy link
Member

@visr visr commented Aug 30, 2024

Fixes #1721
Fixes #1394

The dtype_backend part of this is not breaking. The only part that is technically breaking is that we specify a unit of milliseconds for the Arrow time type. Previously we used the default nanosecond precision, which was then truncated to milliseconds in the core. I think it is better to disallow precision higher than milliseconds if we cannot distinguish them in the core.

@visr visr added the breaking A change that breaks existing models label Sep 23, 2024
@evetion
Copy link
Member

evetion commented Sep 24, 2024

I'd like to see some extra tests with adding/setting a default pandas dataframe, without the correct dtypte and time. edit: Basically trying to break it ;)

@DanielTollenaar
Copy link
Contributor

Tested it on De Dommel workflow and get a model that gets written with the Python-API and read and simulated without ValidationErrors or Exceptions. Didn't test/check any results, but the model + results are here: https://we.tl/t-Uxk9xra8i1

However, I do repeatedly get this FutureWarning which is messing up my logging quite alot:

(file:///D:/repositories/Ribasim/python/ribasim/ribasim/model.py:239): FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  pd.concat(df_chunks)

@visr visr marked this pull request as ready for review September 24, 2024 12:46
core/src/util.jl Outdated Show resolved Hide resolved
core/src/validation.jl Show resolved Hide resolved
core/test/validation_test.jl Outdated Show resolved Hide resolved
docs/guide/examples.ipynb Outdated Show resolved Hide resolved
pixi.toml Outdated Show resolved Hide resolved
python/ribasim/ribasim/delwaq/generate.py Show resolved Hide resolved
Comment on lines 377 to 387
df = gpd.read_file(path, layer=table, fid_as_index=True)
df = pyogrio.read_dataframe(
path,
layer=table,
fid_as_index=True,
use_arrow=True,
arrow_to_pandas_kwargs={"types_mapper": pd.ArrowDtype},
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only pyogrio supported this directly, geopandas not yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to place an inline comment to the specific kwarg that's not supported yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I misunderstood that all extra kwargs were passed from gpd.read_file to pyogrio.read_dataframe if engine="pyogrio", so I went back to gpd.read_file and added a comment on arrow_to_pandas_kwargs, which ends up in pyarrow, not pyogrio.

Comment on lines -28 to +30
time=pd.date_range(start="2020-01-01", end="2021-01-01", periods=100),
time=pd.date_range(
start="2020-01-01", end="2021-01-01", periods=100, unit="ms"
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needed unit="ms" otherwise the nanosecond precision times get a ValidationError now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we automatically convert such timestamps? I fear that users might hit this and have to find this fix manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only point where we got nanosecond precision was with the data_range with periods pattern. I feel like this pattern will rarely be used for real models, maybe only synthetic ones like our docs. I don't think I've ever come across hydrological data with nanosecond precision.

Nanosecond units will automatically convert to millisecond as long as it doesn't lose data. The ValidationError is also quite good. I prefer not to throw a warning on this as it usually doesn't matter. If you know an easy way to round or truncate to the nearest millisecond I'd be ok with that as well. But I doubt many users will run into this.

@visr visr requested a review from evetion September 24, 2024 13:01
Copy link
Member

@evetion evetion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but I have my doubts on how we handle the defaults of pandas (both for time as the arrow dtype).

In terms of dtype, I think it's automatically converted, so we should be fine there. However, in terms of time I'd rather we convert (maybe with a warning), rather than erroring out. That still seems an improvement over silently dropping precision. That said it's mostly a usability concern for me here.

core/src/validation.jl Show resolved Hide resolved
docs/guide/examples.ipynb Outdated Show resolved Hide resolved
python/ribasim/ribasim/config.py Outdated Show resolved Hide resolved
python/ribasim/ribasim/input_base.py Show resolved Hide resolved
node_id: Series[Int32] = pa.Field(nullable=False, default=0)
area: Series[float] = pa.Field(nullable=False)
level: Series[float] = pa.Field(nullable=False)
node_id: Series[Annotated[pd.ArrowDtype, pyarrow.int32()]] = pa.Field(nullable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have lost the defaults here, was that on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 0 was a dummy-default since np.int32 doesn't support missings, but pyarrow.int32 does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well its also because the node_id is a required field and can't be missing schema wise?

python/ribasim/ribasim/utils.py Show resolved Hide resolved
@visr visr requested a review from evetion September 26, 2024 11:05
@visr visr merged commit f5bfb50 into main Sep 26, 2024
27 checks passed
@visr visr deleted the pyarrow branch September 26, 2024 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking A change that breaks existing models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Pandas Arrow backend Mark table columns as optional as described in the docs
3 participants