Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better schema validation errors in Ribasim Python #904

Closed
Fati-Mon opened this issue Dec 15, 2023 · 8 comments · Fixed by #1096
Closed

Better schema validation errors in Ribasim Python #904

Fati-Mon opened this issue Dec 15, 2023 · 8 comments · Fixed by #1096
Labels
python Relates to one of the Ribasim python packages validation Related to model validation

Comments

@Fati-Mon
Copy link
Collaborator

Fati-Mon commented Dec 15, 2023

What

When adding the node_id, time, precipitation and potential_evaporation for basins in transient, the validation error is shown.
Makes sense, it needs also the columns 'drainage' etc. However, the validation error was not clear. Which columns or which types were needed? Especially, if you don't need them for your current model simulation.

Why and how

Making the error more specific. The user will, most likely, understand the error and what needs to be adjusted. In this case adding empty 'drainage', 'urban_runoff', and 'infiltration' columns, even though they are not used. Also, the value should then be 0.0 not 0, since it needs float type.

Additional context

model.basin.time.df = forcing
Traceback (most recent call last):

  Cell In[115], line 1
    model.basin.time.df = forcing

  File C:\Ribasim9\.pixi\env\Lib\site-packages\pydantic\main.py:786 in __setattr__
    self.__pydantic_validator__.validate_assignment(self, name, value)

ValidationError: 1 validation error for TableModel[BasinTimeSchema]
df
  Value error, Error while coercing 'BasinTimeSchema' to type <class 'ribasim.models.BasinTime'>: Could not coerce <class 'pandas.core.frame.DataFrame'> data_container into type <class 'ribasim.models.BasinTime'>
             column  index         failure_case
0              time      0  2016-01-01 00:00:00
1              time      1  2016-01-02 00:00:00
2              time      2  2016-01-03 00:00:00
3              time      3  2016-01-04 00:00:00
4              time      4  2016-01-05 00:00:00
            ...    ...                  ...
10229  infiltration   1457                  0.0
10230  infiltration   1458                  0.0
10231  infiltration   1459                  0.0
10232  infiltration   1460                  0.0
10233  infiltration   1461                  0.0

[10234 rows x 3 columns] [type=value_error, input_value=           time  node_id ...
[1462 rows x 7 columns], input_type=DataFrame]
    For further information visit https://errors.pydantic.dev/2.5/v/value_error
@github-project-automation github-project-automation bot moved this to To do in Ribasim Dec 15, 2023
@visr visr added validation Related to model validation python Relates to one of the Ribasim python packages labels Dec 15, 2023
@Huite
Copy link
Contributor

Huite commented Jan 22, 2024

I just ran across exactly the same issue and found it extremely confusing. I had to look at another example to figure out that I was missing a column. I'm somewhat surprised that pandera doesn't take care of this?

Secondly, the current validation code is not trivial.

We have:

  • our own ChildModel, NodeModel, and TableModel classes
  • use of pydantic
  • use of pandera
  • the data kept in (Geo)DataFrames

Which means there's a lot of concepts flying around.

At a minimum, we need some docstrings on the ChildModel, NodeModel, and TableModel classes. But I think it might also be good to sketch an overview in the Python developer docs.

@Huite
Copy link
Contributor

Huite commented Jan 22, 2024

For what it's worth, this seems to be due to the pydantic-pandera interaction? If I follow the pandera DataFrame docs and I run this example:

from typing import Optional

import pandas as pd
import pandera as pa
from pandera.typing import Series


class Schema(pa.DataFrameModel):
    a: Series[str]
    b: Optional[Series[int]]


df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df)

I get a very reasonable error: SchemaError: column 'b' not in dataframe

@Huite
Copy link
Contributor

Huite commented Jan 22, 2024

So my gut feeling currently is: I think the main issue that we're using pydantic inside of pandera inside of pydantic.

If I understand correctly, we have the Julia generated JSON files, which can be parsed by pydantic. Then pandera can make use of those, but pandera supports this in a limited manner.
If we could generate the pandera schemata directly, or atleast not in the row-wise pydantic manner, we probably fix issues of performance and the opaque error messages.

@Huite
Copy link
Contributor

Huite commented Jan 22, 2024

To substantiate my gut feeling a bit, if we indeed skip the dtype = PydanticModel(Record) functionality, and define an "ordinary" pandera schema:

from typing import Optional

import pandera as pa


class PanderaPumpStaticSchema(pa.DataFrameModel):
    node_id: pa.typing.Series[int]
    active: Optional[pa.typing.Series[bool]]
    flow_rate: pa.typing.Series[float]
    min_flow_rate: Optional[pa.typing.Series[float]]
    max_flow_rate: Optional[pa.typing.Series[float]]
    control_state: Optional[pa.typing.Series[str]]


class Pump(NodeModel):
    #    static: TableModel[PumpStaticSchema] = Field(
    #        default_factory=TableModel[PumpStaticSchema]
    #    )
    static: pa.typing.DataFrame[PanderaPumpStaticSchema]
    _sort_keys: dict[str, list[str]] = {"static": ["node_id", "control_state"]}

And instantiate something incorrectly:

def test_missing_columns():
    static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
    pump_1 = Pump(static=static_data)

We get a perfectly good validation error:

py::test_missing_columns failed: def test_missing_columns():
        static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
>       pump_1 = Pump(static=static_data)
E       pydantic_core._pydantic_core.ValidationError: 1 validation error for Pump
E       static
E         Value error, column 'flow_rate' not in dataframe
E          node_id  id
E       0        1  -1
E       1        2  -2
E       2        3  -3 [type=value_error, input_value=   node_id  id
E       0        1...    2  -2
E       2        3  -3, input_type=DataFrame]
E           For further information visit https://errors.pydantic.dev/2.5/v/value_error

So if we can create "ordinary" pandera schemata programmatically, we should be good.

EDIT: the example above isn't perfectly accurate, since in reality static is a TableModel (wrapper) around the DataFrame.

@visr
Copy link
Member

visr commented Jan 22, 2024

So if we can create "ordinary" pandera schemata programmatically, we should be good.

And if that is too hard, we should probably consider if we should retain the codegen at all, given we only generate models.py, not config.py or QGIS schemas.

@Huite
Copy link
Contributor

Huite commented Jan 23, 2024

It's not that hard to create the schemas, we can set the annotations:

def gen_schema(name, cls):
    cname = f"{name}Schema"

    keywords = {}
    annotations = {}
    for name, info in cls.__fields__.items():
        dtype = info.annotation
        if info.is_required():
            annotations[name] = pa.typing.Series[dtype]
        else:
            annotations[name] = Optional[pa.typing.Series[dtype]]
        keywords[name] = pa.Field()

    pandera_schema_type = type(
        cname,
        (pa.DataFrameModel,),
        keywords,
    )

    pandera_schema_type.__annotations__ = annotations
    setattr(sys.modules[__name__], cname, pandera_schema_type)


for name, cls in inspect.getmembers(models, inspect.isclass):
    if issubclass(cls, BaseModel):
        gen_schema(name, cls)

Although it's obviously ad hoc, and I just noticed there's UnionTypes in the discrete control as well.

However, despite this also resulting in a DataFrameModel, it's not accepted by the rest of the code yet...

Reproduce by trying:

import pandas as pd
import ribasim

static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
pump = ribasim.Pump(static=static_data)

@Hofer-Julian Hofer-Julian moved this from To do to What's next in Ribasim Jan 25, 2024
@visr visr changed the title Specifying validation errors 'BasinTimeSchema' Better schema validation errors in Ribasim Python Feb 1, 2024
@Jingru923 Jingru923 moved this from What's next to Sprint backlog in Ribasim Feb 1, 2024
@Hofer-Julian Hofer-Julian assigned evetion and unassigned Hofer-Julian Feb 5, 2024
@evetion
Copy link
Member

evetion commented Feb 5, 2024

I looked at where this comes from, and it indeed stems from the "dtype": PydanticModel(cls), which makes Pandera just fail on the Pydantic error, without telling us why. Moving to automatically generate the Pandera Schema ourselves would be good here. I will make a (yet another) bug report to pandera as well.

@evetion evetion assigned Hofer-Julian and unassigned evetion Feb 5, 2024
@deltamarnix deltamarnix moved this from Sprint backlog to 👀 In review in Ribasim Feb 8, 2024
@deltamarnix
Copy link
Contributor

I have regenerated the schemas.py via OteraEngine, which is a Jinja version for Julia. See #1096. This should resolve the issues with validation.

Hofer-Julian pushed a commit that referenced this issue Feb 9, 2024
Remove intermittent json generation and dependency on the data model
code generator

Fixes #904
Fixes #1077 
Fixes #887
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Ribasim Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Relates to one of the Ribasim python packages validation Related to model validation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants