Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioning docs: DeltaLake #4483

Merged
merged 15 commits into from
Feb 24, 2025
Merged

Versioning docs: DeltaLake #4483

merged 15 commits into from
Feb 24, 2025

Conversation

ankatiyar
Copy link
Contributor

@ankatiyar ankatiyar commented Feb 14, 2025

Description

Partial solution #4468

Development notes

  • Move DVC versioning docs under "integrations" header
  • Documented the use of pandas.DeltaTableDataset to interact with Delta tables

Questions for reviewers:

  • There's slightly more you can do with Delta lake using delta-rs for eg Load with datetime, restore to a prev version. Not sure how far I should go, for example, suggesting creating a custom Dataset on top of pandas.DeltaTableDataset? Update the pandas.DeltaTableDataset itself?
  • How much should we document using delta tables with other datasets eg. spark.SparkDataset can save delta format tables. spark.DeltaTableDataset can simply read the datasets but not specific versions etc. We can also suggest building custom datasets to interact with spark but would that be overkill?
  • Similarly, for other frameworks eg dask, polars for which we have no/limited support currently in kedro-datasets

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar changed the title [WIP] Versioning docs: DeltaLake + Iceberg Versioning docs: DeltaLake Feb 20, 2025
@ankatiyar ankatiyar marked this pull request as ready for review February 20, 2025 12:06
@ankatiyar ankatiyar self-assigned this Feb 20, 2025
@ankatiyar
Copy link
Contributor Author

cc @pascalwhoop

@astrojuanlu
Copy link
Member

 sphinx.errors.SphinxWarning: /home/docs/checkouts/readthedocs.org/user_builds/kedro/checkouts/4483/docs/source/integrations/deltalake_versioning.md:15:'myst' cross-reference target not found: '' [myst.xref_missing] 

@astrojuanlu
Copy link
Member

Similarly, for other frameworks eg dask, polars for which we have no/limited support currently in kedro-datasets

We do have support for Polars actually! There are a few unaddressed issues, that's true... https://github.com/kedro-org/kedro-plugins/issues?q=is%3Aissue%20state%3Aopen%20polars but it's very much an officially supported dataset

For Dask there's a dataset too but I don't see it mentioned very often https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-6.0.0/api/kedro_datasets.dask.CSVDataset.html

@astrojuanlu
Copy link
Member

Not sure how far I should go, for example, suggesting creating a custom Dataset on top of pandas.DeltaTableDataset? Update the pandas.DeltaTableDataset itself?
How much should we document using delta tables with other datasets eg. spark.SparkDataset can save delta format tables. spark.DeltaTableDataset can simply read the datasets but not specific versions etc. We can also suggest building custom datasets to interact with spark but would that be overkill?

Indeed... unfortunately kedro-org/kedro-plugins#542 is still unaddressed, so I think this doc page should limit itself to describing what the user can do today, and suggest creating custom datasets when appropriate. We might want to tackle kedro-org/kedro-plugins#542 later in the year and go back to these docs.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a first quick pass, thanks a lot @ankatiyar ! Dropped a few comments

Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ankatiyar! I like the description - just left a small question.

Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>

Kedro offers various connectors in the `kedro-datasets` package to interact with Delta tables: [`pandas.DeltaTableDataset`](https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py), [`spark.DeltaTableDataset`](), [`spark.SparkDataset`](), [`databricks.ManagedTableDataset`](), and [`ibis.FileDataset`]()ß support the delta table format. In this tutorial, we will use the `pandas.DeltaTableDataset` connector to interact with Delta tables using Pandas DataFrames. To install `kedro-datasets` alongwith dependencies required for Delta Lake, add the following line to your `requirements.txt`:

```bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra jar dependencies needed for spark? Eg when running delta on something that isn't a databricks runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've limited the scope of this documentation page to delta-rs and pandas.DeltaTableDataset which deals with delta table and converts into pandas DataFrame. There's actually already a section for spark in the docs https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#spark-and-delta-lake-interaction

@ankatiyar
Copy link
Contributor Author

Similarly, for other frameworks eg dask, polars for which we have no/limited support currently in kedro-datasets

We do have support for Polars actually! There are a few unaddressed issues, that's true... https://github.com/kedro-org/kedro-plugins/issues?q=is%3Aissue%20state%3Aopen%20polars but it's very much an officially supported dataset

For Dask there's a dataset too but I don't see it mentioned very often https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-6.0.0/api/kedro_datasets.dask.CSVDataset.html

I meant support for versioning with Delta Tables specifically

@astrojuanlu
Copy link
Member

Oh, right kedro-org/kedro-plugins#444

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar
Copy link
Contributor Author

Actually turns out, there's already a section for Delta lake + Spark https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#spark-and-delta-lake-interaction, shall I move it to this page? cc @astrojuanlu

@astrojuanlu
Copy link
Member

First time I see that 😂 Yeah! Maybe bring that content to the page you're writing, and replace it by "if you want to work on Delta with PySpark, check out these docs [link]"

@astrojuanlu
Copy link
Member

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few style comments, will hold off until you consolidate the Spark Delta stuff in here too

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
├── part-00001-0d522679-916c-4283-ad06-466c27025bcf-c000.snappy.parquet
└── part-00001-42733095-97f4-46ef-bdfd-3afef70ee9d8-c000.snappy.parquet
```
### Load a specific version of the dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markdown not rendering correctly here?

@lrcouto
Copy link
Contributor

lrcouto commented Feb 21, 2025

Looks good, easy enough to follow! Pointed out a couple minor details.

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice job!

Answering questions for reviewers: I think for the first version of docs it is deep enough. What might be useful if we mention other use-cases with a note that can be done with custom datasets.

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor suggestions, but this is good to go already! Thanks @ankatiyar 🙏🏼

ankatiyar and others added 4 commits February 24, 2025 11:47
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar merged commit 4134f81 into main Feb 24, 2025
10 checks passed
@ankatiyar ankatiyar deleted the docs/versioning branch February 24, 2025 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants