Better schema validation errors in Ribasim Python #904

Fati-Mon · 2023-12-15T16:50:38Z

What

When adding the node_id, time, precipitation and potential_evaporation for basins in transient, the validation error is shown.
Makes sense, it needs also the columns 'drainage' etc. However, the validation error was not clear. Which columns or which types were needed? Especially, if you don't need them for your current model simulation.

Why and how

Making the error more specific. The user will, most likely, understand the error and what needs to be adjusted. In this case adding empty 'drainage', 'urban_runoff', and 'infiltration' columns, even though they are not used. Also, the value should then be 0.0 not 0, since it needs float type.

Additional context

model.basin.time.df = forcing
Traceback (most recent call last):

  Cell In[115], line 1
    model.basin.time.df = forcing

  File C:\Ribasim9\.pixi\env\Lib\site-packages\pydantic\main.py:786 in __setattr__
    self.__pydantic_validator__.validate_assignment(self, name, value)

ValidationError: 1 validation error for TableModel[BasinTimeSchema]
df
  Value error, Error while coercing 'BasinTimeSchema' to type <class 'ribasim.models.BasinTime'>: Could not coerce <class 'pandas.core.frame.DataFrame'> data_container into type <class 'ribasim.models.BasinTime'>
             column  index         failure_case
0              time      0  2016-01-01 00:00:00
1              time      1  2016-01-02 00:00:00
2              time      2  2016-01-03 00:00:00
3              time      3  2016-01-04 00:00:00
4              time      4  2016-01-05 00:00:00
            ...    ...                  ...
10229  infiltration   1457                  0.0
10230  infiltration   1458                  0.0
10231  infiltration   1459                  0.0
10232  infiltration   1460                  0.0
10233  infiltration   1461                  0.0

[10234 rows x 3 columns] [type=value_error, input_value=           time  node_id ...
[1462 rows x 7 columns], input_type=DataFrame]
    For further information visit https://errors.pydantic.dev/2.5/v/value_error

The text was updated successfully, but these errors were encountered:

Huite · 2024-01-22T14:29:11Z

I just ran across exactly the same issue and found it extremely confusing. I had to look at another example to figure out that I was missing a column. I'm somewhat surprised that pandera doesn't take care of this?

Secondly, the current validation code is not trivial.

We have:

our own ChildModel, NodeModel, and TableModel classes
use of pydantic
use of pandera
the data kept in (Geo)DataFrames

Which means there's a lot of concepts flying around.

At a minimum, we need some docstrings on the ChildModel, NodeModel, and TableModel classes. But I think it might also be good to sketch an overview in the Python developer docs.

Huite · 2024-01-22T14:58:29Z

For what it's worth, this seems to be due to the pydantic-pandera interaction? If I follow the pandera DataFrame docs and I run this example:

from typing import Optional

import pandas as pd
import pandera as pa
from pandera.typing import Series


class Schema(pa.DataFrameModel):
    a: Series[str]
    b: Optional[Series[int]]


df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)

I get a very reasonable error: SchemaError: column 'b' not in dataframe

Huite · 2024-01-22T15:29:10Z

So my gut feeling currently is: I think the main issue that we're using pydantic inside of pandera inside of pydantic.

If I understand correctly, we have the Julia generated JSON files, which can be parsed by pydantic. Then pandera can make use of those, but pandera supports this in a limited manner.
If we could generate the pandera schemata directly, or atleast not in the row-wise pydantic manner, we probably fix issues of performance and the opaque error messages.

Huite · 2024-01-22T15:47:25Z

To substantiate my gut feeling a bit, if we indeed skip the dtype = PydanticModel(Record) functionality, and define an "ordinary" pandera schema:

from typing import Optional

import pandera as pa


class PanderaPumpStaticSchema(pa.DataFrameModel):
    node_id: pa.typing.Series[int]
    active: Optional[pa.typing.Series[bool]]
    flow_rate: pa.typing.Series[float]
    min_flow_rate: Optional[pa.typing.Series[float]]
    max_flow_rate: Optional[pa.typing.Series[float]]
    control_state: Optional[pa.typing.Series[str]]


class Pump(NodeModel):
    #    static: TableModel[PumpStaticSchema] = Field(
    #        default_factory=TableModel[PumpStaticSchema]
    #    )
    static: pa.typing.DataFrame[PanderaPumpStaticSchema]
    _sort_keys: dict[str, list[str]] = {"static": ["node_id", "control_state"]}

And instantiate something incorrectly:

def test_missing_columns():
    static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
    pump_1 = Pump(static=static_data)

We get a perfectly good validation error:

py::test_missing_columns failed: def test_missing_columns():
        static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
>       pump_1 = Pump(static=static_data)
E       pydantic_core._pydantic_core.ValidationError: 1 validation error for Pump
E       static
E         Value error, column 'flow_rate' not in dataframe
E          node_id  id
E       0        1  -1
E       1        2  -2
E       2        3  -3 [type=value_error, input_value=   node_id  id
E       0        1...    2  -2
E       2        3  -3, input_type=DataFrame]
E           For further information visit https://errors.pydantic.dev/2.5/v/value_error

So if we can create "ordinary" pandera schemata programmatically, we should be good.

EDIT: the example above isn't perfectly accurate, since in reality static is a TableModel (wrapper) around the DataFrame.

visr · 2024-01-22T20:38:10Z

So if we can create "ordinary" pandera schemata programmatically, we should be good.

And if that is too hard, we should probably consider if we should retain the codegen at all, given we only generate models.py, not config.py or QGIS schemas.

Huite · 2024-01-23T09:35:56Z

It's not that hard to create the schemas, we can set the annotations:

def gen_schema(name, cls):
    cname = f"{name}Schema"

    keywords = {}
    annotations = {}
    for name, info in cls.__fields__.items():
        dtype = info.annotation
        if info.is_required():
            annotations[name] = pa.typing.Series[dtype]
        else:
            annotations[name] = Optional[pa.typing.Series[dtype]]
        keywords[name] = pa.Field()

    pandera_schema_type = type(
        cname,
        (pa.DataFrameModel,),
        keywords,
    )

    pandera_schema_type.__annotations__ = annotations
    setattr(sys.modules[__name__], cname, pandera_schema_type)


for name, cls in inspect.getmembers(models, inspect.isclass):
    if issubclass(cls, BaseModel):
        gen_schema(name, cls)

Although it's obviously ad hoc, and I just noticed there's UnionTypes in the discrete control as well.

However, despite this also resulting in a DataFrameModel, it's not accepted by the rest of the code yet...

Reproduce by trying:

import pandas as pd
import ribasim

static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
pump = ribasim.Pump(static=static_data)

evetion · 2024-02-05T17:04:30Z

I looked at where this comes from, and it indeed stems from the "dtype": PydanticModel(cls), which makes Pandera just fail on the Pydantic error, without telling us why. Moving to automatically generate the Pandera Schema ourselves would be good here. I will make a (yet another) bug report to pandera as well.

deltamarnix · 2024-02-09T07:51:08Z

I have regenerated the schemas.py via OteraEngine, which is a Jinja version for Julia. See #1096. This should resolve the issues with validation.

Remove intermittent json generation and dependency on the data model code generator Fixes #904 Fixes #1077 Fixes #887

github-project-automation bot added this to Ribasim Dec 15, 2023

github-project-automation bot moved this to To do in Ribasim Dec 15, 2023

visr added validation Related to model validation python Relates to one of the Ribasim python packages labels Dec 15, 2023

Hofer-Julian moved this from To do to What's next in Ribasim Jan 25, 2024

visr changed the title ~~Specifying validation errors 'BasinTimeSchema'~~ Better schema validation errors in Ribasim Python Feb 1, 2024

Jingru923 moved this from What's next to Sprint backlog in Ribasim Feb 1, 2024

visr assigned Hofer-Julian Feb 1, 2024

Hofer-Julian assigned evetion and unassigned Hofer-Julian Feb 5, 2024

Hofer-Julian mentioned this issue Feb 5, 2024

Re-evaluate code-generation in Ribasim Python #1077

Closed

evetion assigned Hofer-Julian and unassigned evetion Feb 5, 2024

deltamarnix mentioned this issue Feb 7, 2024

Generate python code for schemas directly from Julia #1096

Merged

deltamarnix moved this from Sprint backlog to 👀 In review in Ribasim Feb 8, 2024

deltamarnix unassigned Hofer-Julian Feb 9, 2024

Hofer-Julian closed this as completed in #1096 Feb 9, 2024

Hofer-Julian pushed a commit that referenced this issue Feb 9, 2024

Generate python code for schemas directly from Julia (#1096)

936ec87

Remove intermittent json generation and dependency on the data model code generator Fixes #904 Fixes #1077 Fixes #887

github-project-automation bot moved this from 👀 In review to ✅ Done in Ribasim Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better schema validation errors in Ribasim Python #904

Better schema validation errors in Ribasim Python #904

Fati-Mon commented Dec 15, 2023 •

edited by SouthEndMusic

Loading

Huite commented Jan 22, 2024

Huite commented Jan 22, 2024

Huite commented Jan 22, 2024 •

edited

Loading

Huite commented Jan 22, 2024 •

edited

Loading

visr commented Jan 22, 2024

Huite commented Jan 23, 2024 •

edited by Hofer-Julian

Loading

evetion commented Feb 5, 2024

deltamarnix commented Feb 9, 2024

Better schema validation errors in Ribasim Python #904

Better schema validation errors in Ribasim Python #904

Comments

Fati-Mon commented Dec 15, 2023 • edited by SouthEndMusic Loading

Huite commented Jan 22, 2024

Huite commented Jan 22, 2024

Huite commented Jan 22, 2024 • edited Loading

Huite commented Jan 22, 2024 • edited Loading

visr commented Jan 22, 2024

Huite commented Jan 23, 2024 • edited by Hofer-Julian Loading

evetion commented Feb 5, 2024

deltamarnix commented Feb 9, 2024

Fati-Mon commented Dec 15, 2023 •

edited by SouthEndMusic

Loading

Huite commented Jan 22, 2024 •

edited

Loading

Huite commented Jan 22, 2024 •

edited

Loading

Huite commented Jan 23, 2024 •

edited by Hofer-Julian

Loading