-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Versioning docs: DeltaLake #4483
Conversation
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
cc @pascalwhoop |
|
We do have support for Polars actually! There are a few unaddressed issues, that's true... https://github.com/kedro-org/kedro-plugins/issues?q=is%3Aissue%20state%3Aopen%20polars but it's very much an officially supported dataset For Dask there's a dataset too but I don't see it mentioned very often https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-6.0.0/api/kedro_datasets.dask.CSVDataset.html |
Indeed... unfortunately kedro-org/kedro-plugins#542 is still unaddressed, so I think this doc page should limit itself to describing what the user can do today, and suggest creating custom datasets when appropriate. We might want to tackle kedro-org/kedro-plugins#542 later in the year and go back to these docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a first quick pass, thanks a lot @ankatiyar ! Dropped a few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @ankatiyar! I like the description - just left a small question.
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]>
|
||
Kedro offers various connectors in the `kedro-datasets` package to interact with Delta tables: [`pandas.DeltaTableDataset`](https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py), [`spark.DeltaTableDataset`](), [`spark.SparkDataset`](), [`databricks.ManagedTableDataset`](), and [`ibis.FileDataset`]()ß support the delta table format. In this tutorial, we will use the `pandas.DeltaTableDataset` connector to interact with Delta tables using Pandas DataFrames. To install `kedro-datasets` alongwith dependencies required for Delta Lake, add the following line to your `requirements.txt`: | ||
|
||
```bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No extra jar dependencies needed for spark? Eg when running delta on something that isn't a databricks runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've limited the scope of this documentation page to delta-rs
and pandas.DeltaTableDataset
which deals with delta table and converts into pandas DataFrame. There's actually already a section for spark in the docs https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#spark-and-delta-lake-interaction
I meant support for versioning with Delta Tables specifically |
Oh, right kedro-org/kedro-plugins#444 |
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Actually turns out, there's already a section for Delta lake + Spark https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#spark-and-delta-lake-interaction, shall I move it to this page? cc @astrojuanlu |
First time I see that 😂 Yeah! Maybe bring that content to the page you're writing, and replace it by "if you want to work on Delta with PySpark, check out these docs [link]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a few style comments, will hold off until you consolidate the Spark Delta stuff in here too
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
├── part-00001-0d522679-916c-4283-ad06-466c27025bcf-c000.snappy.parquet | ||
└── part-00001-42733095-97f4-46ef-bdfd-3afef70ee9d8-c000.snappy.parquet | ||
``` | ||
### Load a specific version of the dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Markdown not rendering correctly here?
Looks good, easy enough to follow! Pointed out a couple minor details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, nice job!
Answering questions for reviewers: I think for the first version of docs it is deep enough. What might be useful if we mention other use-cases with a note that can be done with custom datasets.
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor suggestions, but this is good to go already! Thanks @ankatiyar 🙏🏼
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
…into docs/versioning
Signed-off-by: Ankita Katiyar <[email protected]>
Description
Partial solution #4468
Development notes
pandas.DeltaTableDataset
to interact with Delta tablesQuestions for reviewers:
delta-rs
for eg Load with datetime, restore to a prev version. Not sure how far I should go, for example, suggesting creating a custom Dataset on top ofpandas.DeltaTableDataset
? Update thepandas.DeltaTableDataset
itself?spark.SparkDataset
can save delta format tables.spark.DeltaTableDataset
can simply read the datasets but not specific versions etc. We can also suggest building custom datasets to interact with spark but would that be overkill?dask
,polars
for which we have no/limited support currently inkedro-datasets
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file