Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/v0.0.9 #109

Merged
merged 65 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
1fdb82c
Added uc volume for fanout demo
ravi-databricks Sep 25, 2024
9ca140c
Added uc volumes for fanout demo
ravi-databricks Sep 25, 2024
0db98c8
Added:
ravi-databricks Sep 25, 2024
cbf8b29
Removed linting errors
ravi-databricks Sep 25, 2024
3bd552a
Added:
ravi-databricks Sep 26, 2024
edd2bf1
Added uc_volume_path for eventhub demo
ravi-databricks Sep 26, 2024
8529333
1.Fixed uc volume path for eventhub
ravi-databricks Sep 26, 2024
d1aa961
Fixed int tests with uc volume paths and use serverless
ravi-databricks Sep 26, 2024
514a974
Added:
ravi-databricks Sep 27, 2024
6cd66ad
Added basic implementation for apply_changes_from_snapshot
ravi-databricks Sep 27, 2024
81a8b1b
Added:
ravi-databricks Sep 27, 2024
9b7b406
Updated snapshot initializing exception handling
ravi-databricks Sep 30, 2024
437ed56
Corrected:
ravi-databricks Oct 1, 2024
7a6b28f
added integration test outputs to gitinore
dvanderwood Oct 1, 2024
a75ccba
Corrected dataflowpipeline apply_from_snapshot api
ravi-databricks Oct 1, 2024
36ed8ad
Merge pull request #99 from databrickslabs/issue_97
ravi-databricks Oct 1, 2024
8fc401d
uc catalog name consistency in sample code
dvanderwood Oct 1, 2024
04adc91
updated runner notebook upload to .py files from .dbc so updates can …
dvanderwood Oct 2, 2024
4b9bffc
lower the source during the arg parse instead of at each time it is used
dvanderwood Oct 2, 2024
5a38c85
Complete rewrite of the arg parser for the intgegration tests
dvanderwood Oct 2, 2024
72fe9dc
Update args parsing
dvanderwood Oct 2, 2024
c337f59
begin movign to uc volumes only and getting rid of dbfs as well as si…
dvanderwood Oct 2, 2024
3e1e486
uc resource generation update
dvanderwood Oct 2, 2024
6a8eb47
continued unused file, and dbfs code removal along with code clean-up…
dvanderwood Oct 3, 2024
6e1d831
Added:
ravi-databricks Oct 3, 2024
f360160
update upload to databricks
dvanderwood Oct 4, 2024
c678ac8
only upload wheel to workspace location
dvanderwood Oct 4, 2024
d6020b5
added uc volume upload of the wheel if provided
dvanderwood Oct 4, 2024
acee79b
wheel upload added, all data uploads reworked
dvanderwood Oct 4, 2024
d316554
initial upload and setup for integration testing for cloud files is done
dvanderwood Oct 4, 2024
5ed9632
Added:
ravi-databricks Oct 4, 2024
9287e65
Added image inside demo readme
ravi-databricks Oct 4, 2024
9710e57
Updated demo readme with description matching docs site
ravi-databricks Oct 4, 2024
2db4a0d
Fixed CLI Unit tests
ravi-databricks Oct 5, 2024
f4bd237
Added unit test coverage
ravi-databricks Oct 6, 2024
7b71fc9
job workflow clean-up
dvanderwood Oct 7, 2024
5ee1843
cloud files testing, added back accidently removed json
dvanderwood Oct 7, 2024
e70f85d
cloud files integration test works
dvanderwood Oct 7, 2024
9e77b1f
formatting
dvanderwood Oct 7, 2024
e78c751
continuing code simplication between 3 integration run types
dvanderwood Oct 8, 2024
1cdb503
redundant doc strings removal
dvanderwood Oct 8, 2024
37652f9
remove early exit
dvanderwood Oct 8, 2024
a6249fe
eventhub_accesskey_name updates
dvanderwood Oct 8, 2024
c740093
fixed job create
dvanderwood Oct 8, 2024
b769973
fixed clean up
dvanderwood Oct 8, 2024
f6bb5fe
redundant message removed
dvanderwood Oct 8, 2024
381531c
formatting
dvanderwood Oct 8, 2024
eb44fa5
non cloud files testing, niether can be confirmed right now
dvanderwood Oct 8, 2024
f6954fb
af cloud demo update
dvanderwood Oct 9, 2024
ec9a317
formatting
dvanderwood Oct 9, 2024
ae27548
formatting
dvanderwood Oct 9, 2024
a60d0e2
linting fixes
dvanderwood Oct 10, 2024
7b17edf
formatting and linting
dvanderwood Oct 14, 2024
0d9975a
removed match syntax since testing on python 3.9
dvanderwood Oct 14, 2024
386db9d
removed other match statement
dvanderwood Oct 14, 2024
05e8b3a
uc volume path for onboarding file test
dvanderwood Oct 14, 2024
b87620a
Merge pull request #101 from dvanderwood/issue_97_98
ravi-databricks Oct 14, 2024
27dadc8
fixed cli related tests and added new unit test for onboard and deploy
ravi-databricks Oct 15, 2024
0da22ce
Added unit test coverage for cli.py
ravi-databricks Oct 15, 2024
93b6fa4
Merge branch 'issue_86' into issue_97
ravi-databricks Oct 16, 2024
f134a71
Merge pull request #105 from databrickslabs/issue_97
ravi-databricks Oct 22, 2024
4b4bc91
fixed:
ravi-databricks Oct 23, 2024
024ef5a
Merge branch 'issue_94' into feature/v0.0.9
ravi-databricks Oct 24, 2024
d0e389d
Revert "Issue 94"
ravi-databricks Oct 24, 2024
4307fdf
Merge pull request #110 from databrickslabs/revert-108-issue_94
ravi-databricks Oct 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ omit =
src/install.py
src/uninstall.py
src/config.py
src/cli.py

[report]
exclude_lines =
Expand Down
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[flake8]
ignore = BLK100,E402,W503
exclude = .git,__pycache__,docs/source/conf.py,old,build,dist,dist,.eggs
exclude = .git,__pycache__,docs/source/conf.py,old,build,dist,dist,.eggs,integration_tests/notebooks/*/*.py,demo/notebooks/*/*.py,.venv
builtins = dlt,dbutils,spark,display,log_integration_test,pyspark.dbutils
max-line-length = 120
per-file-ignores =
Expand Down
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -154,4 +154,10 @@ deployment-merged.yaml
.databricks
.databricks-login.json
demo/conf/onboarding.json
integration_tests/conf/onboarding.json
integration_tests/conf/onboarding*.json
demo/conf/onboarding*.json
databricks.yaml
integration_test_output*.csv

.databricks
databricks.yaml
142 changes: 70 additions & 72 deletions demo/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,27 @@
# [DLT-META](https://github.com/databrickslabs/dlt-meta) DEMO's
# [DLT-META](https://github.com/databrickslabs/dlt-meta) DEMO's
1. [DAIS 2023 DEMO](#dais-2023-demo): Showcases DLT-META's capabilities of creating Bronze and Silver DLT pipelines with initial and incremental mode automatically.
2. [Databricks Techsummit Demo](#databricks-tech-summit-fy2024-demo): 100s of data sources ingestion in bronze and silver DLT pipelines automatically.
3. [Append FLOW Autoloader Demo](#append-flow-autoloader-file-metadata-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows) and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
4. [Append FLOW Eventhub Demo](#append-flow-eventhub-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows) and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
5. [Silver Fanout Demo](#silver-fanout-demo): This demo showcases the implementation of fanout architecture in the silver layer.
6. [Apply Changes From Snapshot Demo](#Apply-changes-from-snapshot-demo): This demo showcases the implementation of ingesting from snapshots in bronze layer



# DAIS 2023 DEMO
# DAIS 2023 DEMO
## [DAIS 2023 Session Recording](https://www.youtube.com/watch?v=WYv5haxLlfA)
This Demo launches Bronze and Silver DLT pipelines with following activities:
- Customer and Transactions feeds for initial load
- Adds new feeds Product and Stores to existing Bronze and Silver DLT pipelines with metadata changes.
- Runs Bronze and Silver DLT for incremental load for CDC events

### Steps:
1. Launch Terminal/Command prompt
1. Launch Command Prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
Expand All @@ -36,40 +37,26 @@ This Demo launches Bronze and Silver DLT pipelines with following activities:
export PYTHONPATH=$dlt_meta_home
```

6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated```
- cloud_provider_name : aws or azure or gcp
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
6. ```commandline
python demo/launch_dais_demo.py --uc_catalog_name=<<uc catalog name>>
```
- uc_catalog_name : Unity catalog name
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token.

- - 6a. Databricks Workspace URL:
- - Enter your workspace URL, with the format https://<instance-name>.cloud.databricks.com. To get your workspace URL, see Workspace instance names, URLs, and IDs.

- - 6b. Token:
- In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.

- On the Access tokens tab, click Generate new token.

- (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).

- Click Generate.

- Copy the displayed token

- Paste to command prompt
![dais_demo.png](../docs/static/images/dais_demo.png)

# Databricks Tech Summit FY2024 DEMO:
This demo will launch auto generated tables(100s) inside single bronze and silver DLT pipeline using dlt-meta.

1. Launch Terminal/Command promt
1. Launch Command Prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
4. ```commandline
cd dlt-meta
```

Expand All @@ -82,30 +69,13 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
export PYTHONPATH=$dlt_meta_home
```

6. Run the command
```commandline
python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated
6. ```commandline
python demo/launch_techsummit_demo.py --uc_catalog_name=<<uc catalog name>>
```
- cloud_provider_name : aws or azure
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
- uc_catalog_name : Unity catalog name
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token

- - 6a. Databricks Workspace URL:
- Enter your workspace URL, with the format https://<instance-name>.cloud.databricks.com. To get your workspace URL, see Workspace instance names, URLs, and IDs.

- - 6b. Token:
- In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.

- On the Access tokens tab, click Generate new token.

- (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).

- Click Generate.

- Copy the displayed token

- Paste to command prompt
![tech_summit_demo.png](../docs/static/images/tech_summit_demo.png)


# Append Flow Autoloader file metadata demo:
Expand All @@ -114,12 +84,12 @@ This demo will perform following tasks:
- Read from different delta tables and write to same silver table using append_flow API
- Add file_name and file_path to target bronze table for autoloader source using [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)

1. Launch Terminal/Command prompt
1. Launch Command Prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
Expand All @@ -136,27 +106,24 @@ This demo will perform following tasks:
```

6. ```commandline
python demo/launch_af_cloudfiles_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc
python demo/launch_af_cloudfiles_demo.py --uc_catalog_name=<<uc catalog name>> --source=cloudfiles --cloud_provider_name=aws --profile=<<DEFAULT>>
```

- cloud_provider_name : aws or azure or gcp
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
- uc_catalog_name: Unity catalog name
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
- uc_catalog_name : Unity Catalog name
- cloud_provier_name : Which cloud you are using, either AWS, Azure, or GCP
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token

![af_am_demo.png](../docs/static/images/af_am_demo.png)

# Append Flow Eventhub demo:
- Read from different eventhub topics and write to same target tables using append_flow API

### Steps:
1. Launch Terminal/Command prompt
1. Launch Command Prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
Expand All @@ -176,20 +143,17 @@ This demo will perform following tasks:
- ```
commandline databricks secrets create-scope eventhubs_dltmeta_creds
```
- ```commandline
- ```commandline
databricks secrets put-secret --json '{
"scope": "eventhubs_dltmeta_creds",
"key": "RootManageSharedAccessKey",
"string_value": "<<value>>"
}'
}'
```
- Create databricks secrets to store producer and consumer keys using the scope created in step 2
- Create databricks secrets to store producer and consumer keys using the scope created in step 2

- Following are the mandatory arguments for running EventHubs demo
- cloud_provider_name: Cloud provider name e.g. aws or azure
- dbr_version: Databricks Runtime Version e.g. 15.3.x-scala2.12
- uc_catalog_name : unity catalog name e.g. ravi_dlt_meta_uc
- dbfs_path: Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines e.g. dbfs:/tmp/DLT-META/demo/
- eventhub_namespace: Eventhub namespace e.g. dltmeta
- eventhub_name : Primary Eventhubname e.g. dltmeta_demo
- eventhub_name_append_flow: Secondary eventhub name for appendflow feed e.g. dltmeta_demo_af
Expand All @@ -198,8 +162,8 @@ This demo will perform following tasks:
- eventhub_secrets_scope_name: Databricks secret scope name e.g. eventhubs_dltmeta_creds
- eventhub_port: Eventhub port

7. ```commandline
python3 demo/launch_af_eventhub_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc --eventhub_name=dltmeta_demo --eventhub_name_append_flow=dltmeta_demo_af --eventhub_secrets_scope_name=dltmeta_eventhub_creds --eventhub_namespace=dltmeta --eventhub_port=9093 --eventhub_producer_accesskey_name=RootManageSharedAccessKey --eventhub_consumer_accesskey_name=RootManageSharedAccessKey --eventhub_accesskey_secret_name=RootManageSharedAccessKey --uc_catalog_name=ravi_dlt_meta_uc
7. ```commandline
python3 demo/launch_af_eventhub_demo.py --uc_catalog_name=<<uc catalog name>> --eventhub_name=dltmeta_demo --eventhub_name_append_flow=dltmeta_demo_af --eventhub_secrets_scope_name=dltmeta_eventhub_creds --eventhub_namespace=dltmeta --eventhub_port=9093 --eventhub_producer_accesskey_name=RootManageSharedAccessKey --eventhub_consumer_accesskey_name=RootManageSharedAccessKey --eventhub_accesskey_secret_name=RootManageSharedAccessKey
```

![af_eh_demo.png](../docs/static/images/af_eh_demo.png)
Expand All @@ -210,15 +174,15 @@ This demo will perform following tasks:
- Run the onboarding process for the bronze cars table, which contains data from various countries.
- Run the onboarding process for the silver tables, which have a `where_clause` based on the country condition specified in [silver_transformations_cars.json](https://github.com/databrickslabs/dlt-meta/blob/main/demo/conf/silver_transformations_cars.json).
- Run the Bronze DLT pipeline which will produce cars table.
- Run Silver DLT pipeline, fanning out from the bronze cars table to country-specific tables such as cars_usa, cars_uk, cars_germany, and cars_japan.
- Run Silver DLT pipeline, fanning out from the bronze cars table to country-specific tables such as cars_usa, cars_uk, cars_germany, and cars_japan.

### Steps:
1. Launch Terminal/Command prompt
1. Launch Command Prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
Expand All @@ -231,8 +195,7 @@ This demo will perform following tasks:
```commandline
export PYTHONPATH=$dlt_meta_home

6. Run the command ```python demo/launch_silver_fanout_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-silver-fanout```
- cloud_provider_name : aws or azure
6. Run the command ```python demo/launch_silver_fanout_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-silver-fanout```
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token.
Expand All @@ -255,4 +218,39 @@ This demo will perform following tasks:

![silver_fanout_workflow.png](../docs/static/images/silver_fanout_workflow.png)

![silver_fanout_dlt.png](../docs/static/images/silver_fanout_dlt.png)
![silver_fanout_dlt.png](../docs/static/images/silver_fanout_dlt.png)


# Apply Changes From Snapshot Demo
- This demo will perform following steps
- Showcase onboarding process for apply changes from snapshot pattern
- Run onboarding for the bronze stores and products tables, which contains data snapshot data in csv files.
- Run Bronze DLT to load initial snapshot (LOAD_1.csv)
- Upload incremental snapshot LOAD_2.csv version=2 for stores and product
- Run Bronze DLT to load incremental snapshot (LOAD_2.csv). Stores is scd_type=2 so updated records will expired and added new records with version_number. Products is scd_type=1 so in case records missing for scd_type=1 will be deleted.
- Upload incremental snapshot LOAD_3.csv version=3 for stores and product
- Run Bronze DLT to load incremental snapshot (LOAD_3.csv). Stores is scd_type=2 so updated records will expired and added new records with version_number. Products is scd_type=1 so in case records missing for scd_type=1 will be deleted.
### Steps:
1. Launch Command Prompt

2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)

3. ```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
```

4. ```commandline
cd dlt-meta
```
5. Set python environment variable into terminal
```commandline
dlt_meta_home=$(pwd)
```
```commandline
export PYTHONPATH=$dlt_meta_home

6. Run the command
```commandline
python demo/launch_acfs_demo.py --uc_catalog_name=<<>>
```
![acfs.png](../docs/static/images/acfs.png)
Loading
Loading