Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic validator #1121

Merged
merged 17 commits into from
Sep 17, 2024
Merged

Conversation

cswartzvi
Copy link
Contributor

This PR adds pydantic integration via a new data validator and plugin. Resolves #473.

Changes

I added a file called data_quality/pydantic_validators.py with a new default validator PydanticModelValidator. This validator is dynamically added to the list of available default validators if pydantic is available - very similar, some would say identical 😄, to how the pandera validators are added. The pydantic validator is passed a model parameter that is then used to validated the output of the decorated function:

class MyModel(BaseModel):
    name: str

@check_output(model=MyModel)
def foo() -> dict:
    return {"name": "hamilton"}

I also added a plugin file plugins/h_pydantic.py with a variant of the check_output decorator that uses the return type annotation to establish the pydantic model to validate:

class MyModel(BaseModel):
    name: str

@h_pydantic.check_output()
def foo() -> MyModel:
    return MyModel(name="hamilton")

My implementation uses the pydantic TypeAdapter - mainly because pydantic.validate_call does not have an option to only check the return value. TypeAdapter allows you to specify a strict mode where type coercion is turned off, since modifying the outputs is not currently allowed (that is correct, right?). I have enabled strict mode for both the validator and the plugin.

Although I think that it is idiomatic to pydantic, I should point out that strictmode does not stop you from doing this (i.e. this passes validation):

class MyModel(BaseModel):
    name: str

@h_pydantic.check_output()
def foo() -> MyModel:
    return {name: "hamilton"}

One last thing to mention is that h_pydantic.check_output currently checks that the return type annotation is a subclass of pydantic.BaseModel. In theory, you could use pydantic to check all kinds of things (builtins, Annotated types, ...), however I was having trouble getting it to play nicely with validator resolution so I scraped it.

How I tested this

I added a file to the testing suite test_pydantic_data_quality.py that tests the validator and the check_out plugin decorator for both basic and complex cases.

Notes

There are a couple of points I wanted to bring up:

  • TypeAdapter is a pydantic 2.0 feature and in conflict with the [vaex] extra dependency 😞. I didn't notice this until I was updating pyproject.toml, let me know if this is a deal breaker and I will come up with an alternative implementation
  • I deviated from the spec in Pydantic datatype validation for hamilton nodes #473 and used model (vice schema) and pydantic.check_output (vice pydantic.check_output_schema) - I can change them back if desired, I just thought they fit better with the terminology of pydantic and the ergonomics of the pandera plugin, respectively.
  • I will update the documentation and the plugin docstring if you are good with the above notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Thanks for the opportunity to dig into this!

@skrawcz
Copy link
Collaborator

skrawcz commented Sep 4, 2024

@cswartzvi would you mind creating an example under data_quality showing it in action?

Otherwise just need to make sure the decorator turns up in our documentation.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is really cool what you did. Nice work! Really appreciate it!

None of those strike me as blockers or even problems. Vaex is just testing, if it works then I'm not worried. Happy to support pydantic>=2.0 only. So yeah! Docs will be great, let us know if you need help getting them set up!

hamilton/data_quality/pydantic_validators.py Show resolved Hide resolved
hamilton/data_quality/pydantic_validators.py Outdated Show resolved Hide resolved
hamilton/data_quality/pydantic_validators.py Outdated Show resolved Hide resolved
hamilton/data_quality/pydantic_validators.py Outdated Show resolved Hide resolved
hamilton/data_quality/pydantic_validators.py Show resolved Hide resolved
@elijahbenizzy
Copy link
Collaborator

Also, I don't think the tests are getting run. Can dig in tomorrow(this is a bit confusing), but it looks like:

  1. We check for changes (this is not getting hit)
  2. We launch these 5 jobs: https://github.com/DAGWorks-Inc/hamilton/blob/main/.circleci/config.yml#L137
  3. Calls this...
    if [[ ${TASK} == "integrations" ]]; then
  4. Note we only have 8 tests run on main: https://app.circleci.com/pipelines/github/DAGWorks-Inc/hamilton/4033/workflows/dfad0448-70b9-4a89-bd80-e6e6c8265640/jobs/68530. And this one doesn't actually detect a change..

@cswartzvi
Copy link
Contributor Author

@elijahbenizzy sorry, I got overloaded with work the past few days. I am going to add the documentation either today or tomorrow - I will take a crack at it and let you know if I run into issues. Did you ever figure out why my test were not running? If not, I can take a look into that as well.

@elijahbenizzy
Copy link
Collaborator

@elijahbenizzy sorry, I got overloaded with work the past few days. I am going to add the documentation either today or tomorrow - I will take a crack at it and let you know if I run into issues. Did you ever figure out why my test were not running? If not, I can take a look into that as well.

Thank you! I have not yet -- if you don't mind that would be much appreciated, otherwise happy to carve out some time to figure out what's happening (lots of bash scripts...).

@cswartzvi
Copy link
Contributor Author

I was looking into why my tests didn't run when I pushed the initial set of commits: https://app.circleci.com/pipelines/github/DAGWorks-Inc/hamilton/4035/workflows/c7dbcc07-b6df-4fd1-9a38-a76e81b9cb53

I don't have a lot of experience with CircleCI, so please forgive me if I am missing something basic, but I noticed that the check_for_changes job in .circleci/config.yml uses the following form of the git diff command:

git diff --name-only HEAD^ HEAD | grep '^.ci\|^.circleci\|^graph_adapter_tests\|^hamilton\|^plugin_tests\|^tests\|^requirements\|setup' > /dev/null

This command compares the current commit (HEAD) to the previous commit (HEAD^) before piping it to grep. However, I pushed multiple commits at once, so if my very last commit didn't change one of the grepped directories wouldn't this test fail? It is my understanding that CircleCI runs on a push and not for each commit, is that correct?

If my hunch is right, I see three solutions

  1. Do nothing and encourage people to make single commit pushes - might miss some tests if people forget
  2. Change the git diff command to something like git diff --name-only origin/main...HEAD and check for changes against the main branch - surely this will result in more tests being run within a PR, but it seems safer
  3. Beef up the check for changes with something like https://github.com/emmeowzing/dynamic-continuation-orb

Thoughts? Still working on those docs BTW

@elijahbenizzy
Copy link
Collaborator

I was looking into why my tests didn't run when I pushed the initial set of commits: https://app.circleci.com/pipelines/github/DAGWorks-Inc/hamilton/4035/workflows/c7dbcc07-b6df-4fd1-9a38-a76e81b9cb53

I don't have a lot of experience with CircleCI, so please forgive me if I am missing something basic, but I noticed that the check_for_changes job in .circleci/config.yml uses the following form of the git diff command:

git diff --name-only HEAD^ HEAD | grep '^.ci\|^.circleci\|^graph_adapter_tests\|^hamilton\|^plugin_tests\|^tests\|^requirements\|setup' > /dev/null

This command compares the current commit (HEAD) to the previous commit (HEAD^) before piping it to grep. However, I pushed multiple commits at once, so if my very last commit didn't change one of the grepped directories wouldn't this test fail? It is my understanding that CircleCI runs on a push and not for each commit, is that correct?

If my hunch is right, I see three solutions

  1. Do nothing and encourage people to make single commit pushes - might miss some tests if people forget
  2. Change the git diff command to something like git diff --name-only origin/main...HEAD and check for changes against the main branch - surely this will result in more tests being run within a PR, but it seems safer
  3. Beef up the check for changes with something like https://github.com/emmeowzing/dynamic-continuation-orb

Thoughts? Still working on those docs BTW

I think the second one is probably the cleanest -- better to run more tests than needed than undertest... I can make the change, or if you're in the mood to, go for it! These are tests we'll catch when merging to main (it should be smart enough to diff against main), but it's nice to catch earlier.

As an edge-case, we can probably get the comparison branch rather than main. That said, at some point we'll probably switch to github actions as our CI system, so I think the simplest code-only change is easy and we shouldn't solve too much.

@cswartzvi
Copy link
Contributor Author

@elijahbenizzy I would be happy to make the change. I will start with option 2 and then see how hard it is to incorporate that edge case.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically there, just a few notes about docs (which I know is a WIP). To get shipped:

  • Add some more docstrings + add to the right docs reference (ping if you want help, but start with this slack comment)
  • Rebase against the check_for_changes PR after we merge to main ensure that it does as intended on this

hamilton/data_quality/default_validators.py Outdated Show resolved Hide resolved
hamilton/plugins/h_pydantic.py Show resolved Hide resolved
@cswartzvi
Copy link
Contributor Author

cswartzvi commented Sep 14, 2024

@elijahbenizzy I rebased and added that missing docstring - looks like changes were detected correctly in CI. How deep do you think I should go with the documentation? I see three potential places it could be added:

Does that sound reasonable?

@elijahbenizzy
Copy link
Collaborator

@elijahbenizzy I rebased and added that missing docstring - looks like changes were detected correctly in CI. How deep do you think I should go with the documentation? I see three potential places it could be added:

Does that sound reasonable?

Sorry for the back and forth, this is ready! Only thing, re: docs:

I think what you did is good enough out of those three. If you want to add an example it would obviously be appreciated but don't want to slow you down. The place that I'd add it is here: https://hamilton.dagworks.io/en/latest/reference/decorators/. This way we have a nice reference with the docstring.

To do so you can:

  1. Add a file here with a brief description (call it pydantic.rst or something
  2. Make it reference the relevant decorator (h_pydantic.check_output). mention that you can get it with check_output to make that clear.
  3. Add a newline between this and the next line (.rst is an epic PITA): https://github.com/DAGWorks-Inc/hamilton/pull/1121/files#diff-c0dc4b429a8dd227d562cd69242be34a5b19fc49a48f98b7267dc722767d34a3R30 -- otherwise it won't compile

If this is too much for you/you don't have time, let me know and I can add it in easily after merging. You've already done a ton! Otherwise I want to get this out in the next release (tuesday if we can!).

Really appreciate it 🫡

@cswartzvi
Copy link
Contributor Author

cswartzvi commented Sep 17, 2024

Sorry for the back and forth, this is ready! Only thing, re: docs:

I think what you did is good enough out of those three. If you want to add an example it would obviously be appreciated but don't want to slow you down. The place that I'd add it is here: https://hamilton.dagworks.io/en/latest/reference/decorators/. This way we have a nice reference with the docstring.

To do so you can:

  1. Add a file here with a brief description (call it pydantic.rst or something
  2. Make it reference the relevant decorator (h_pydantic.check_output). mention that you can get it with check_output to make that clear.
  3. Add a newline between this and the next line (.rst is an epic PITA): https://github.com/DAGWorks-Inc/hamilton/pull/1121/files#diff-c0dc4b429a8dd227d562cd69242be34a5b19fc49a48f98b7267dc722767d34a3R30 -- otherwise it won't compile

If this is too much for you/you don't have time, let me know and I can add it in easily after merging. You've already done a ton! Otherwise I want to get this out in the next release (tuesday if we can!).

Really appreciate it 🫡

No problem, I enjoy helping! I would like to add at least one more doc - I have something queued up, just one quick question. You mentioned adding pydantic.rst to https://hamilton.dagworks.io/en/latest/reference/decorators/ do you think adding it directly to https://hamilton.dagworks.io/en/latest/reference/decorators/check_output/ might be better? This was my original idea because it doesn't break the meaning of the listed decoroator and it fits in nicely with check_output.rst (specifically I would up have to update this paragraph and add a reference). Either way works for me, just let me know.

Edit: I pushed an example of what I am talking about. If that's not what you're looking for just let me know 😄

@elijahbenizzy
Copy link
Collaborator

Sorry for the back and forth, this is ready! Only thing, re: docs:
I think what you did is good enough out of those three. If you want to add an example it would obviously be appreciated but don't want to slow you down. The place that I'd add it is here: https://hamilton.dagworks.io/en/latest/reference/decorators/. This way we have a nice reference with the docstring.
To do so you can:

  1. Add a file here with a brief description (call it pydantic.rst or something
  2. Make it reference the relevant decorator (h_pydantic.check_output). mention that you can get it with check_output to make that clear.
  3. Add a newline between this and the next line (.rst is an epic PITA): https://github.com/DAGWorks-Inc/hamilton/pull/1121/files#diff-c0dc4b429a8dd227d562cd69242be34a5b19fc49a48f98b7267dc722767d34a3R30 -- otherwise it won't compile

If this is too much for you/you don't have time, let me know and I can add it in easily after merging. You've already done a ton! Otherwise I want to get this out in the next release (tuesday if we can!).
Really appreciate it 🫡

No problem, I enjoy helping! I would like to add at least one more doc - I have something queued up, just one quick question. You mentioned adding pydantic.rst to https://hamilton.dagworks.io/en/latest/reference/decorators/ do you think adding it directly to https://hamilton.dagworks.io/en/latest/reference/decorators/check_output/ might be better? This was my original idea because it doesn't break the meaning of the listed decoroator and it fits in nicely with check_output.rst (specifically I would up have to update this paragraph and add a reference). Either way works for me, just let me know.

Edit: I pushed an example of what I am talking about. If that's not what you're looking for just let me know 😄

directly to https://hamilton.dagworks.io/en/latest/reference/decorators/check_output/ might be better? This was my original idea because it doesn't break the meaning of the listed decoroator and it fits in nicely with check_output.rst (specifically I would

Yes, I think that's pretty reasonable, perhaps a better place to put it! Only thing is to add the autoclass or whatever there (and just make it clear which maps to which). All nits though, this is pretty much good to go as far as I'm concerned!

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's do the last docs stuff then merge! Appoving so all I have to do is click merge next.

@cswartzvi
Copy link
Contributor Author

I added the autoclass to the bottom of the page and tightened up a few of the examples. Let me know if it needs anything else!

@elijahbenizzy
Copy link
Collaborator

I added the autoclass to the bottom of the page and tightened up a few of the examples. Let me know if it needs anything else!

Looks great, thank you! Merging!

@elijahbenizzy elijahbenizzy merged commit ee9e4ae into DAGWorks-Inc:main Sep 17, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pydantic datatype validation for hamilton nodes
3 participants