Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File schema via pydantic #109

Open
3 tasks
gessulat opened this issue Sep 21, 2023 · 3 comments
Open
3 tasks

File schema via pydantic #109

gessulat opened this issue Sep 21, 2023 · 3 comments

Comments

@gessulat
Copy link
Contributor

gessulat commented Sep 21, 2023

The PsmSchema definition is currently implemented via dataclasses.
It's great to have the ability to validate a dataframe with a schema!

Pydantic is a library for defining schemas and validation that might offer additional useful functionality.
It might be overkill for the use case of Mokapot but I think it's worth evaluating.
I found these two articles showcasing how it could be done.

In case this is useful:
Tasks

  • define PsmDataset schema via pydantic
  • remove SchemaValidatorMixin
  • Identify standard files (e.g. PIN with tab separated Proteins) that should be supported. Support could be implemented as converters.
@jspaezp
Copy link
Collaborator

jspaezp commented Oct 17, 2023

[Full disclosure, I love pydantic and use it for json validation all the time]

In principle I really like this idea! Although I am not sure exactly where and how much validation would need to happen within mokapot in a way that would require exntensibility via pydantic. Would you mine elaborating on the use case/api you have in mind for it?

just FTI, for data frame validation I have been using this project https://docs.dagster.io/integrations/pandas, and really like the syntax they use for validation. (https://github.com/unionai-oss/pandera and https://github.com/JakobGM/patito are alternatives I have evaluated as well)

@gessulat
Copy link
Contributor Author

Sorry that the context was missing! This idea came up in a discussion with @wfondrie. Internally, we use schemas and validators a lot for various things, mostly for API definitions and complex configuration files (e. g. validating Sage configs).

I just noticed that currently validation on what defines a PsmDataset is implemented via data classes and pydantic would be one option for a generalized validation based on schemas, that might be also useful to validate others. One could image for example that instead of specifying flags via the command line (which might be cumbersome with a large set of flags that have dependencies and interactions) parameters could be specified in a configuration file as configuration files offer more flexibility to express parameter dependencies. In that case Pydantic could be used in a similar way for both: validating internal data structures and exposed APIs.

Dagsters type definitions also look good to me but I only skimmed to documentation. I assume they are specific to dataframes and don't generalize to more generic data structures, correct? If you intend to use dagster as a dependency in Mokapot anyway, this could be a great fit, but if it's only for validation, it feels like an out-of-place dependency to me.

Both dagster and pydantic seem to be good choices to me. It basically depends on whether a) pydantic schemas might be valuable in other places in the future, or b) if other dagster functionality would valuable in the future. You definitely have a better feeling for that ;)

@jspaezp
Copy link
Collaborator

jspaezp commented Oct 31, 2023

thanks for the context!

Just for the record the dagster reference was more regarding the interface than the actual implemeting package. I would love to go more into detail regarding the implementation details once we get the "mega-merge" done on the current development version of the project!

Best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants