databrickslabs · ravi-databricks · Oct 16, 2024 · Oct 1, 2024 · Oct 1, 2024 · Oct 2, 2024
diff --git a/.coveragerc b/.coveragerc
@@ -8,7 +8,6 @@ omit =
     src/install.py
     src/uninstall.py
     src/config.py
-    src/cli.py
 
 [report]
 exclude_lines =

diff --git a/.flake8 b/.flake8
@@ -1,6 +1,6 @@
 [flake8]
 ignore = BLK100,E402,W503
-exclude = .git,__pycache__,docs/source/conf.py,old,build,dist,dist,.eggs
+exclude = .git,__pycache__,docs/source/conf.py,old,build,dist,dist,.eggs,integration_tests/notebooks/*/*.py,demo/notebooks/*/*.py,.venv
 builtins = dlt,dbutils,spark,display,log_integration_test,pyspark.dbutils
 max-line-length = 120
 per-file-ignores =

diff --git a/.gitignore b/.gitignore
@@ -154,6 +154,10 @@ deployment-merged.yaml
 .databricks
 .databricks-login.json
 demo/conf/onboarding.json
-integration_tests/conf/onboarding.json
+integration_tests/conf/onboarding*.json
+demo/conf/onboarding*.json
 databricks.yaml
+integration_test_output*.csv
 
+.databricks
+databricks.yaml
diff --git a/demo/README.md b/demo/README.md
@@ -1,4 +1,4 @@
- # [DLT-META](https://github.com/databrickslabs/dlt-meta) DEMO's 
+ # [DLT-META](https://github.com/databrickslabs/dlt-meta) DEMO's
  1. [DAIS 2023 DEMO](#dais-2023-demo): Showcases DLT-META's capabilities of creating Bronze and Silver DLT pipelines with initial and incremental mode automatically.
  2. [Databricks Techsummit Demo](#databricks-tech-summit-fy2024-demo): 100s of data sources ingestion in bronze and silver DLT pipelines automatically.
  3. [Append FLOW Autoloader Demo](#append-flow-autoloader-file-metadata-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows)  and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
@@ -8,7 +8,7 @@
 
 
 
-# DAIS 2023 DEMO 
+# DAIS 2023 DEMO
 ## [DAIS 2023 Session Recording](https://www.youtube.com/watch?v=WYv5haxLlfA)
 This Demo launches Bronze and Silver DLT pipelines with following activities:
 - Customer and Transactions feeds for initial load
@@ -21,7 +21,7 @@ This Demo launches Bronze and Silver DLT pipelines with following activities:
 2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
 
 3. ```commandline
-    git clone https://github.com/databrickslabs/dlt-meta.git 
+    git clone https://github.com/databrickslabs/dlt-meta.git
     ```
 
 4. ```commandline
@@ -53,10 +53,10 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
 2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
 
 3. ```commandline
-    git clone https://github.com/databrickslabs/dlt-meta.git 
+    git clone https://github.com/databrickslabs/dlt-meta.git
     ```
 
-4. ```commandline 
+4. ```commandline
     cd dlt-meta
     ```
 
@@ -69,8 +69,8 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
     export PYTHONPATH=$dlt_meta_home
     ```
 
-6. ```commandline 
-    python demo/launch_techsummit_demo.py --uc_catalog_name=<<>>
+6. ```commandline
+    python demo/launch_techsummit_demo.py --uc_catalog_name=<<uc catalog name>>
     ```
     - uc_catalog_name : Unity catalog name
     - you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
@@ -89,7 +89,7 @@ This demo will perform following tasks:
 2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
 
 3. ```commandline
-    git clone https://github.com/databrickslabs/dlt-meta.git 
+    git clone https://github.com/databrickslabs/dlt-meta.git
     ```
 
 4. ```commandline
@@ -106,9 +106,10 @@ This demo will perform following tasks:
     ```
 
 6. ```commandline
-    python demo/launch_af_cloudfiles_demo.py --uc_catalog_name=<<>>
+    python demo/launch_af_cloudfiles_demo.py --uc_catalog_name=<<uc catalog name>> --source=cloudfiles --cloud_provider_name=aws --profile=<<DEFAULT>>
     ```
     - uc_catalog_name : Unity Catalog name
+    - cloud_provier_name : Which cloud you are using, either AWS, Azure, or GCP
     - you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
 
 ![af_am_demo.png](../docs/static/images/af_am_demo.png)
@@ -122,7 +123,7 @@ This demo will perform following tasks:
 2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
 
 3. ```commandline
-    git clone https://github.com/databrickslabs/dlt-meta.git 
+    git clone https://github.com/databrickslabs/dlt-meta.git
     ```
 
 4. ```commandline
@@ -142,14 +143,14 @@ This demo will perform following tasks:
     - ```
             commandline databricks secrets create-scope eventhubs_dltmeta_creds
         ```
-    - ```commandline 
+    - ```commandline
             databricks secrets put-secret --json '{
                 "scope": "eventhubs_dltmeta_creds",
                 "key": "RootManageSharedAccessKey",
                 "string_value": "<<value>>"
-                }' 
+                }'
         ```
-- Create databricks secrets to store producer and consumer keys using the scope created in step 2 
+- Create databricks secrets to store producer and consumer keys using the scope created in step 2
 
 - Following are the mandatory arguments for running EventHubs demo
     - uc_catalog_name : unity catalog name e.g. ravi_dlt_meta_uc
@@ -161,8 +162,8 @@ This demo will perform following tasks:
     - eventhub_secrets_scope_name: Databricks secret scope name e.g. eventhubs_dltmeta_creds
     - eventhub_port: Eventhub port
 
-7. ```commandline 
-    python3 demo/launch_af_eventhub_demo.py --uc_catalog_name=<<>> --eventhub_name=dltmeta_demo --eventhub_name_append_flow=dltmeta_demo_af --eventhub_secrets_scope_name=dltmeta_eventhub_creds --eventhub_namespace=dltmeta --eventhub_port=9093 --eventhub_producer_accesskey_name=RootManageSharedAccessKey --eventhub_consumer_accesskey_name=RootManageSharedAccessKey --eventhub_accesskey_secret_name=RootManageSharedAccessKey
+7. ```commandline
+    python3 demo/launch_af_eventhub_demo.py --uc_catalog_name=<<uc catalog name>> --eventhub_name=dltmeta_demo --eventhub_name_append_flow=dltmeta_demo_af --eventhub_secrets_scope_name=dltmeta_eventhub_creds --eventhub_namespace=dltmeta --eventhub_port=9093 --eventhub_producer_accesskey_name=RootManageSharedAccessKey --eventhub_consumer_accesskey_name=RootManageSharedAccessKey --eventhub_accesskey_secret_name=RootManageSharedAccessKey
     ```
 
   ![af_eh_demo.png](../docs/static/images/af_eh_demo.png)
@@ -173,15 +174,15 @@ This demo will perform following tasks:
     - Run the onboarding process for the bronze cars table, which contains data from various countries.
     - Run the onboarding process for the silver tables, which have a `where_clause` based on the country condition specified in [silver_transformations_cars.json](https://github.com/databrickslabs/dlt-meta/blob/main/demo/conf/silver_transformations_cars.json).
     - Run the Bronze DLT pipeline which will produce cars table.
-    - Run Silver DLT pipeline, fanning out from the bronze cars table to country-specific tables such as cars_usa, cars_uk, cars_germany, and cars_japan.    
+    - Run Silver DLT pipeline, fanning out from the bronze cars table to country-specific tables such as cars_usa, cars_uk, cars_germany, and cars_japan.
 
 ### Steps:
 1. Launch Command Prompt
 
 2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
 
 3. ```commandline
-    git clone https://github.com/databrickslabs/dlt-meta.git 
+    git clone https://github.com/databrickslabs/dlt-meta.git
     ```
 
 4. ```commandline
@@ -252,4 +253,4 @@ This demo will perform following tasks:
     ```commandline
     python demo/launch_acfs_demo.py --uc_catalog_name=<<>>
     ```
-    ![acfs.png](../docs/static/images/acfs.png)
+    ![acfs.png](../docs/static/images/acfs.png)
diff --git a/demo/launch_af_cloudfiles_demo.py b/demo/launch_af_cloudfiles_demo.py
@@ -7,6 +7,7 @@
     get_workspace_api_client,
     process_arguments
 )
+import traceback
 
 
 class DLTMETAFCFDemo(DLTMETARunner):
@@ -30,6 +31,7 @@ def run(self, runner_conf: DLTMetaRunnerConf):
             self.launch_workflow(runner_conf)
         except Exception as e:
             print(e)
+            traceback.print_exc()
 
     def init_runner_conf(self) -> DLTMetaRunnerConf:
         """
@@ -44,7 +46,8 @@ def init_runner_conf(self) -> DLTMetaRunnerConf:
         runner_conf = DLTMetaRunnerConf(
             run_id=run_id,
             username=self.wsi._my_username,
-            int_tests_dir="file:./demo",
+            uc_catalog_name=self.args["uc_catalog_name"],
+            int_tests_dir="demo",
             dlt_meta_schema=f"dlt_meta_dataflowspecs_demo_{run_id}",
             bronze_schema=f"dlt_meta_bronze_demo_{run_id}",
             silver_schema=f"dlt_meta_silver_demo_{run_id}",
@@ -54,29 +57,24 @@ def init_runner_conf(self) -> DLTMetaRunnerConf:
             cloudfiles_A2_template="demo/conf/cloudfiles-onboarding_A2.template",
             onboarding_file_path="demo/conf/onboarding.json",
             onboarding_A2_file_path="demo/conf/onboarding_A2.json",
-            env="demo"
+            env="demo",
+            runners_full_local_path='./demo/notebooks/afam_cloudfiles_runners/',
+            test_output_file_path=(
+                f"/Users/{self.wsi._my_username}/dlt_meta_demo/"
+                f"{run_id}/demo-output.csv"
+            ),
         )
-        runner_conf.uc_catalog_name = self.args.__dict__['uc_catalog_name']
-        runner_conf.runners_full_local_path = './demo/dbc/afam_cloud_files_runners.dbc'
+
         return runner_conf
 
     def launch_workflow(self, runner_conf: DLTMetaRunnerConf):
-        created_job = self.create_snapshot_workflow_spec(runner_conf)
+        created_job = self.create_workflow_spec(runner_conf)
         self.open_job_url(runner_conf, created_job)
 
 
-afam_args_map = {
-    "--profile": "provide databricks cli profile name, if not provide databricks_host and token",
-    "--uc_catalog_name": "provide databricks uc_catalog name, this is required to create volume, schema, table"
-}
-
-afam_mandatory_args = [
-    "uc_catalog_name"]
-
-
 def main():
-    args = process_arguments(afam_args_map, afam_mandatory_args)
-    workspace_client = get_workspace_api_client(args.profile)
+    args = process_arguments()
+    workspace_client = get_workspace_api_client(args["profile"])
     dltmeta_afam_demo_runner = DLTMETAFCFDemo(args, workspace_client, "demo")
     print("initializing complete")
     runner_conf = dltmeta_afam_demo_runner.init_runner_conf()

diff --git a/demo/launch_dais_demo.py b/demo/launch_dais_demo.py
@@ -6,7 +6,8 @@
     DLTMETARunner,
     DLTMetaRunnerConf,
     get_workspace_api_client,
-    process_arguments
+    process_arguments,
+    cloud_node_type_id_dict
 )
 
 

diff --git a/demo/notebooks/afam_cloudfiles_runners/init_dlt_meta_pipeline.py b/demo/notebooks/afam_cloudfiles_runners/init_dlt_meta_pipeline.py
@@ -0,0 +1,10 @@
+# Databricks notebook source
+dlt_meta_whl = spark.conf.get("dlt_meta_whl")
+%pip install $dlt_meta_whl # noqa : E999
+
+# COMMAND ----------
+
+layer = spark.conf.get("layer", None)
+
+from src.dataflow_pipeline import DataflowPipeline
+DataflowPipeline.invoke_dlt_pipeline(spark, layer)
diff --git a/demo/notebooks/afam_cloudfiles_runners/validate.py b/demo/notebooks/afam_cloudfiles_runners/validate.py
@@ -0,0 +1,39 @@
+# Databricks notebook source
+import pandas as pd
+
+run_id = dbutils.widgets.get("run_id")
+uc_enabled = eval(dbutils.widgets.get("uc_enabled"))
+uc_catalog_name = dbutils.widgets.get("uc_catalog_name")
+bronze_schema = dbutils.widgets.get("bronze_schema")
+silver_schema = dbutils.widgets.get("silver_schema")
+output_file_path = dbutils.widgets.get("output_file_path")
+log_list = []
+
+# Assumption is that to get to this notebook Bronze and Silver completed successfully
+log_list.append("Completed Bronze DLT Pipeline.")
+log_list.append("Completed Silver DLT Pipeline.")
+
+UC_TABLES = {
+    f"{uc_catalog_name}.{bronze_schema}.transactions": 10002,
+    f"{uc_catalog_name}.{bronze_schema}.transactions_quarantine": 6,
+    f"{uc_catalog_name}.{bronze_schema}.customers": 51453,
+    f"{uc_catalog_name}.{bronze_schema}.customers_quarantine": 256,
+    f"{uc_catalog_name}.{silver_schema}.transactions": 8759,
+    f"{uc_catalog_name}.{silver_schema}.customers": 73212,
+}
+
+
+log_list.append("Validating DLT Bronze and Silver Table Counts...")
+for table, counts in UC_TABLES.items():
+    query = spark.sql(f"SELECT count(*) as cnt FROM {table}")
+    cnt = query.collect()[0].cnt
+
+    log_list.append(f"Validating Counts for Table {table}.")
+    try:
+        assert int(cnt) == counts
+        log_list.append(f"Expected: {counts} Actual: {cnt}. Passed!")
+    except AssertionError:
+        log_list.append(f"Expected: {counts} Actual: {cnt}. Failed!")
+
+pd_df = pd.DataFrame(log_list)
+pd_df.to_csv(output_file_path)
diff --git a/integration_tests/README.md b/integration_tests/README.md
@@ -1,7 +1,7 @@
 #### Run Integration Tests
 1. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
     - Once you install Databricks CLI, authenticate your current machine to a Databricks Workspace:
-    
+
     ```commandline
     databricks auth login --host WORKSPACE_HOST
     ```
@@ -29,26 +29,26 @@
 7. ```commandline
     dlt_meta_home=$(pwd)
     ```
- 
+
 8. ```commandline
     export PYTHONPATH=$dlt_meta_home
     ```
 
-9. Run integration test against cloudfile or eventhub or kafka using below options: If databricks profile configured using CLI then pass ```--profile <profile-name>``` to below command otherwise provide workspace url and token in command line
-    - 9a. Run the command for **cloudfiles**
-        ```commandline 
-        python integration_tests/run_integration_tests.py  --uc_catalog_name= --source=cloudfiles
+9. Run integration test against cloudfile or eventhub or kafka using below options. To use the Databricks profile configured using CLI then pass ```--profile <profile-name>``` to below command otherwise provide workspace url and token in command line. You will also need to provide a Unity Catalog catalog for which the schemas, tables, and files will be created in.
+
+    - 9a. Run the command for  **cloudfiles**
+        ```commandline
+        python integration_tests/run_integration_tests.py  --uc_catalog_name=<<uc catalog name>> --source=cloudfiles --cloud_provider_name=aws --profile=<<DEFAULT>>
         ```
 
     - 9b. Run the command for **eventhub**
-        ```commandline 
-        python integration_tests/run_integration_tests.py --uc_catalog_name=<<>> --source=eventhub --eventhub_name=iot --eventhub_secrets_scope_name=eventhubs_creds --eventhub_namespace=int_test-standard --eventhub_port=9093 --eventhub_producer_accesskey_name=producer --eventhub_consumer_accesskey_name=consumer
+        ```commandline
+        python integration_tests/run_integration_tests.py --uc_catalog_name=<<uc catalog name>> --source=eventhub  --cloud_provider_name=aws --eventhub_name=iot --eventhub_secrets_scope_name=eventhubs_creds --eventhub_namespace=int_test-standard --eventhub_port=9093 --eventhub_producer_accesskey_name=producer --eventhub_consumer_accesskey_name=consumer  --eventhub_name_append_flow=test_append_flow --eventhub_accesskey_secret_name=test_secret_name --profile=<<DEFAULT>>
         ```
-
     - - For eventhub integration tests, the following are the prerequisites:
         1. Needs eventhub instance running
         2. Use Databricks CLI, Create databricks secrets scope for eventhub keys (```databricks secrets create-scope eventhubs_creds```)
-        3. Use Databricks CLI, Create databricks secrets to store producer and consumer keys using the scope created in step 
+        3. Use Databricks CLI, Create databricks secrets to store producer and consumer keys using the scope created in step
 
     - - Following are the mandatory arguments for running EventHubs integration test
         1. Provide your eventhub topic : --eventhub_name
@@ -61,14 +61,13 @@
 
     - 9c. Run the command for **kafka**
         ```commandline
-        python integration_tests/run_integration_tests.py --uc_catalog_name=<<>> --source=kafka --kafka_topic_name=dlt-meta-integration-test --kafka_broker=host:9092
-        ```
+        python integration_tests/run_integration_tests.py --uc_catalog_name=<<uc catalog name>>  --source=kafka --kafka_topic=dlt-meta-integration-test --kafka_broker=host:9092 --cloud_provider_name=aws --profile=DEFAULT
 
     - - For kafka integration tests, the following are the prerequisites:
         1. Needs kafka instance running
 
     - - Following are the mandatory arguments for running EventHubs integration test
-        1. Provide your kafka topic name : --kafka_topic_name
+        1. Provide your kafka topic name : --kafka_topic
         2. Provide kafka_broker : --kafka_broker
 
     - 9d. Run the command for **snapshot**
@@ -77,10 +76,10 @@
         ```
 
 
-10. Once finished integration output file will be copied locally to 
+10. Once finished integration output file will be copied locally to
 ```integration-test-output_<run_id>.txt```
 
-11. Output of a successful run should have the following in the file 
+11. Output of a successful run should have the following in the file
 ```
 ,0
 0,Completed Bronze DLT Pipeline.