-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better schema validation errors in Ribasim Python #904
Comments
I just ran across exactly the same issue and found it extremely confusing. I had to look at another example to figure out that I was missing a column. I'm somewhat surprised that pandera doesn't take care of this? Secondly, the current validation code is not trivial. We have:
Which means there's a lot of concepts flying around. At a minimum, we need some docstrings on the ChildModel, NodeModel, and TableModel classes. But I think it might also be good to sketch an overview in the Python developer docs. |
For what it's worth, this seems to be due to the pydantic-pandera interaction? If I follow the pandera DataFrame docs and I run this example: from typing import Optional
import pandas as pd
import pandera as pa
from pandera.typing import Series
class Schema(pa.DataFrameModel):
a: Series[str]
b: Optional[Series[int]]
df = pd.DataFrame({"a": ["2001", "2002", "2003"]})
Schema.validate(df) I get a very reasonable error: |
So my gut feeling currently is: I think the main issue that we're using pydantic inside of pandera inside of pydantic. If I understand correctly, we have the Julia generated JSON files, which can be parsed by pydantic. Then pandera can make use of those, but pandera supports this in a limited manner. |
To substantiate my gut feeling a bit, if we indeed skip the from typing import Optional
import pandera as pa
class PanderaPumpStaticSchema(pa.DataFrameModel):
node_id: pa.typing.Series[int]
active: Optional[pa.typing.Series[bool]]
flow_rate: pa.typing.Series[float]
min_flow_rate: Optional[pa.typing.Series[float]]
max_flow_rate: Optional[pa.typing.Series[float]]
control_state: Optional[pa.typing.Series[str]]
class Pump(NodeModel):
# static: TableModel[PumpStaticSchema] = Field(
# default_factory=TableModel[PumpStaticSchema]
# )
static: pa.typing.DataFrame[PanderaPumpStaticSchema]
_sort_keys: dict[str, list[str]] = {"static": ["node_id", "control_state"]} And instantiate something incorrectly: def test_missing_columns():
static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
pump_1 = Pump(static=static_data) We get a perfectly good validation error:
So if we can create "ordinary" pandera schemata programmatically, we should be good. EDIT: the example above isn't perfectly accurate, since in reality |
And if that is too hard, we should probably consider if we should retain the codegen at all, given we only generate |
It's not that hard to create the schemas, we can set the annotations: def gen_schema(name, cls):
cname = f"{name}Schema"
keywords = {}
annotations = {}
for name, info in cls.__fields__.items():
dtype = info.annotation
if info.is_required():
annotations[name] = pa.typing.Series[dtype]
else:
annotations[name] = Optional[pa.typing.Series[dtype]]
keywords[name] = pa.Field()
pandera_schema_type = type(
cname,
(pa.DataFrameModel,),
keywords,
)
pandera_schema_type.__annotations__ = annotations
setattr(sys.modules[__name__], cname, pandera_schema_type)
for name, cls in inspect.getmembers(models, inspect.isclass):
if issubclass(cls, BaseModel):
gen_schema(name, cls) Although it's obviously ad hoc, and I just noticed there's UnionTypes in the discrete control as well. However, despite this also resulting in a DataFrameModel, it's not accepted by the rest of the code yet... Reproduce by trying: import pandas as pd
import ribasim
static_data = pd.DataFrame(data={"node_id": [1, 2, 3], "id": [-1, -2, -3]})
pump = ribasim.Pump(static=static_data) |
I looked at where this comes from, and it indeed stems from the |
I have regenerated the schemas.py via OteraEngine, which is a Jinja version for Julia. See #1096. This should resolve the issues with validation. |
What
When adding the node_id, time, precipitation and potential_evaporation for basins in transient, the validation error is shown.
Makes sense, it needs also the columns 'drainage' etc. However, the validation error was not clear. Which columns or which types were needed? Especially, if you don't need them for your current model simulation.
Why and how
Making the error more specific. The user will, most likely, understand the error and what needs to be adjusted. In this case adding empty 'drainage', 'urban_runoff', and 'infiltration' columns, even though they are not used. Also, the value should then be 0.0 not 0, since it needs float type.
Additional context
The text was updated successfully, but these errors were encountered: