Versioning docs: DeltaLake #4483

ankatiyar · 2025-02-14T16:59:52Z

Description

Partial solution #4468

Development notes

Move DVC versioning docs under "integrations" header
Documented the use of pandas.DeltaTableDataset to interact with Delta tables

Questions for reviewers:

There's slightly more you can do with Delta lake using delta-rs for eg Load with datetime, restore to a prev version. Not sure how far I should go, for example, suggesting creating a custom Dataset on top of pandas.DeltaTableDataset? Update the pandas.DeltaTableDataset itself?
How much should we document using delta tables with other datasets eg. spark.SparkDataset can save delta format tables. spark.DeltaTableDataset can simply read the datasets but not specific versions etc. We can also suggest building custom datasets to interact with spark but would that be overkill?
Similarly, for other frameworks eg dask, polars for which we have no/limited support currently in kedro-datasets

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar · 2025-02-20T12:13:36Z

cc @pascalwhoop

astrojuanlu · 2025-02-20T12:19:30Z

 sphinx.errors.SphinxWarning: /home/docs/checkouts/readthedocs.org/user_builds/kedro/checkouts/4483/docs/source/integrations/deltalake_versioning.md:15:'myst' cross-reference target not found: '' [myst.xref_missing]

astrojuanlu · 2025-02-20T12:21:07Z

Similarly, for other frameworks eg dask, polars for which we have no/limited support currently in kedro-datasets

We do have support for Polars actually! There are a few unaddressed issues, that's true... https://github.com/kedro-org/kedro-plugins/issues?q=is%3Aissue%20state%3Aopen%20polars but it's very much an officially supported dataset

For Dask there's a dataset too but I don't see it mentioned very often https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-6.0.0/api/kedro_datasets.dask.CSVDataset.html

astrojuanlu · 2025-02-20T12:22:54Z

Not sure how far I should go, for example, suggesting creating a custom Dataset on top of pandas.DeltaTableDataset? Update the pandas.DeltaTableDataset itself?
How much should we document using delta tables with other datasets eg. spark.SparkDataset can save delta format tables. spark.DeltaTableDataset can simply read the datasets but not specific versions etc. We can also suggest building custom datasets to interact with spark but would that be overkill?

Indeed... unfortunately kedro-org/kedro-plugins#542 is still unaddressed, so I think this doc page should limit itself to describing what the user can do today, and suggest creating custom datasets when appropriate. We might want to tackle kedro-org/kedro-plugins#542 later in the year and go back to these docs.

astrojuanlu

Gave this a first quick pass, thanks a lot @ankatiyar ! Dropped a few comments

docs/source/integrations/deltalake_versioning.md

docs/source/index.rst

DimedS

Thanks, @ankatiyar! I like the description - just left a small question.

docs/source/integrations/deltalake_versioning.md

Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]>

pascalwhoop · 2025-02-20T14:20:31Z

docs/source/integrations/deltalake_versioning.md

+
+Kedro offers various connectors in the `kedro-datasets` package to interact with Delta tables: [`pandas.DeltaTableDataset`](https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py), [`spark.DeltaTableDataset`](), [`spark.SparkDataset`](), [`databricks.ManagedTableDataset`](), and [`ibis.FileDataset`]()ß support the delta table format. In this tutorial, we will use the `pandas.DeltaTableDataset` connector to interact with Delta tables using Pandas DataFrames. To install `kedro-datasets` alongwith dependencies required for Delta Lake, add the following line to your `requirements.txt`:
+
+```bash


No extra jar dependencies needed for spark? Eg when running delta on something that isn't a databricks runtime.

I've limited the scope of this documentation page to delta-rs and pandas.DeltaTableDataset which deals with delta table and converts into pandas DataFrame. There's actually already a section for spark in the docs https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#spark-and-delta-lake-interaction

ankatiyar · 2025-02-20T15:21:06Z

Similarly, for other frameworks eg dask, polars for which we have no/limited support currently in kedro-datasets

We do have support for Polars actually! There are a few unaddressed issues, that's true... https://github.com/kedro-org/kedro-plugins/issues?q=is%3Aissue%20state%3Aopen%20polars but it's very much an officially supported dataset

For Dask there's a dataset too but I don't see it mentioned very often https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-6.0.0/api/kedro_datasets.dask.CSVDataset.html

I meant support for versioning with Delta Tables specifically

astrojuanlu · 2025-02-20T16:40:28Z

Oh, right kedro-org/kedro-plugins#444

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar · 2025-02-21T13:28:37Z

Actually turns out, there's already a section for Delta lake + Spark https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#spark-and-delta-lake-interaction, shall I move it to this page? cc @astrojuanlu

astrojuanlu · 2025-02-21T13:36:56Z

First time I see that 😂 Yeah! Maybe bring that content to the page you're writing, and replace it by "if you want to work on Delta with PySpark, check out these docs [link]"

astrojuanlu · 2025-02-21T15:23:54Z

Rendered version https://kedro--4483.org.readthedocs.build/en/4483/integrations/deltalake_versioning.html

astrojuanlu

Made a few style comments, will hold off until you consolidate the Spark Delta stuff in here too

docs/source/integrations/deltalake_versioning.md

Signed-off-by: Ankita Katiyar <[email protected]>

docs/source/integrations/deltalake_versioning.md

lrcouto · 2025-02-21T19:01:48Z

docs/source/integrations/deltalake_versioning.md

+    ├── part-00001-0d522679-916c-4283-ad06-466c27025bcf-c000.snappy.parquet
+    └── part-00001-42733095-97f4-46ef-bdfd-3afef70ee9d8-c000.snappy.parquet
+```
+### Load a specific version of the dataset


Markdown not rendering correctly here?

lrcouto · 2025-02-21T19:02:43Z

Looks good, easy enough to follow! Pointed out a couple minor details.

ElenaKhaustova

Looks good, nice job!

Answering questions for reviewers: I think for the first version of docs it is deep enough. What might be useful if we mention other use-cases with a note that can be done with custom datasets.

Signed-off-by: Ankita Katiyar <[email protected]>

astrojuanlu

Added some minor suggestions, but this is good to go already! Thanks @ankatiyar 🙏🏼

docs/source/integrations/deltalake_versioning.md

Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]>

Signed-off-by: Ankita Katiyar <[email protected]>

…into docs/versioning

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar added 3 commits February 14, 2025 16:59

Structure changes

f285156

Signed-off-by: Ankita Katiyar <[email protected]>

Sphinx stuff

950f36f

Signed-off-by: Ankita Katiyar <[email protected]>

Add docs page for delta lake

cec33ba

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar changed the title ~~[WIP] Versioning docs: DeltaLake + Iceberg~~ Versioning docs: DeltaLake Feb 20, 2025

Merge branch 'main' into docs/versioning

76f5b99

ankatiyar marked this pull request as ready for review February 20, 2025 12:06

ankatiyar requested review from yetudada and astrojuanlu as code owners February 20, 2025 12:06

ankatiyar requested review from DimedS, merelcht, lrcouto and ElenaKhaustova February 20, 2025 12:06

ankatiyar self-assigned this Feb 20, 2025

astrojuanlu reviewed Feb 20, 2025

View reviewed changes

DimedS approved these changes Feb 20, 2025

View reviewed changes

docs/source/integrations/deltalake_versioning.md Show resolved Hide resolved

Update docs/source/index.rst

bed669e

Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]>

pascalwhoop reviewed Feb 20, 2025

View reviewed changes

ankatiyar added 2 commits February 21, 2025 11:43

Update with feedback

e0c2d8b

Signed-off-by: Ankita Katiyar <[email protected]>

remove index

a89b40e

Signed-off-by: Ankita Katiyar <[email protected]>

astrojuanlu reviewed Feb 21, 2025

View reviewed changes

Move section to Delta Lake page

8a4aed3

Signed-off-by: Ankita Katiyar <[email protected]>

Small change + Release notes

0d1d5de

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar requested review from astrojuanlu and noklam February 21, 2025 17:26

lrcouto reviewed Feb 21, 2025

View reviewed changes

docs/source/integrations/deltalake_versioning.md Outdated Show resolved Hide resolved

lrcouto reviewed Feb 21, 2025

View reviewed changes

ElenaKhaustova approved these changes Feb 21, 2025

View reviewed changes

ankatiyar added 2 commits February 24, 2025 11:18

Try to fix markdown rendering

90b87ef

Signed-off-by: Ankita Katiyar <[email protected]>

Try to fix markdown rendering

e227d77

Signed-off-by: Ankita Katiyar <[email protected]>

astrojuanlu approved these changes Feb 24, 2025

View reviewed changes

ankatiyar and others added 4 commits February 24, 2025 11:47

Apply suggestions from code review

09674cc

Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]>

Try to fix markdown rendering

a41a10a

Signed-off-by: Ankita Katiyar <[email protected]>

Merge branch 'docs/versioning' of https://github.com/kedro-org/kedro …

55dcb33

…into docs/versioning

Update links

bd3c9d2

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar merged commit 4134f81 into main Feb 24, 2025
10 checks passed

ankatiyar deleted the docs/versioning branch February 24, 2025 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioning docs: DeltaLake #4483

Versioning docs: DeltaLake #4483

ankatiyar commented Feb 14, 2025 •

edited

Loading

ankatiyar commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

astrojuanlu left a comment

DimedS left a comment

pascalwhoop Feb 20, 2025

ankatiyar Feb 21, 2025

ankatiyar commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

ankatiyar commented Feb 21, 2025

astrojuanlu commented Feb 21, 2025

astrojuanlu commented Feb 21, 2025

astrojuanlu left a comment

lrcouto Feb 21, 2025

lrcouto commented Feb 21, 2025

ElenaKhaustova left a comment

astrojuanlu left a comment


		Kedro offers various connectors in the `kedro-datasets` package to interact with Delta tables: [`pandas.DeltaTableDataset`](https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py), [`spark.DeltaTableDataset`](), [`spark.SparkDataset`](), [`databricks.ManagedTableDataset`](), and [`ibis.FileDataset`]()ß support the delta table format. In this tutorial, we will use the `pandas.DeltaTableDataset` connector to interact with Delta tables using Pandas DataFrames. To install `kedro-datasets` alongwith dependencies required for Delta Lake, add the following line to your `requirements.txt`:

		```bash

Versioning docs: DeltaLake #4483

Versioning docs: DeltaLake #4483

Conversation

ankatiyar commented Feb 14, 2025 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

ankatiyar commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

astrojuanlu left a comment

Choose a reason for hiding this comment

DimedS left a comment

Choose a reason for hiding this comment

pascalwhoop Feb 20, 2025

Choose a reason for hiding this comment

ankatiyar Feb 21, 2025

Choose a reason for hiding this comment

ankatiyar commented Feb 20, 2025

astrojuanlu commented Feb 20, 2025

ankatiyar commented Feb 21, 2025

astrojuanlu commented Feb 21, 2025

astrojuanlu commented Feb 21, 2025

astrojuanlu left a comment

Choose a reason for hiding this comment

lrcouto Feb 21, 2025

Choose a reason for hiding this comment

lrcouto commented Feb 21, 2025

ElenaKhaustova left a comment

Choose a reason for hiding this comment

astrojuanlu left a comment

Choose a reason for hiding this comment

ankatiyar commented Feb 14, 2025 •

edited

Loading