This section of the docs walks through step-by-step on how to test CloudDQ using the default configurations.
Note the following assumes you have already met the project dependencies outlined in the main README.md
First clone the project:
git clone https://github.com/GoogleCloudPlatform/cloud-data-quality.git
cd cloud-data-quality
Ensure you have created a GCP project ID created for running the Data Quality Validation jobs.
We then set the project ID as an enviroment variable and as the main project used by gcloud
:
export PROJECT_ID=<replace_with_your_gcp_project_id>
gcloud config set project ${PROJECT_ID}
If you encounter the issue No service account scopes specified
with the above command, run gcloud auth login
to obtain new credentials and try again.
Ensure the project has BigQuery API enabled:
gcloud services enable bigquery.googleapis.com
Create the profiles.yml
(details here) config to connect to BigQuery:
cp dbt/profiles.yml.template dbt/profiles.yml
sed -i s/\<your_gcp_project_id\>/${PROJECT_ID}/g dbt/profiles.yml
You can set the environment variable CLOUDDQ_BIGQUERY_DATASET
to customize the BigQuery dataset name that will contain the BigQuery views corresponding to each rule_binding as well as the dq_summary
validation outcome table:
export CLOUDDQ_BIGQUERY_DATASET=cloud_data_quality
This environment variable will be automatically picked up by dbt
from the profiles.yml
config file.
You can also set the environment variable CLOUDDQ_BIGQUERY_REGION
to customize the BigQuery region where the BigQuery dataset and BigQuery data validation jobs will be created:
export CLOUDDQ_BIGQUERY_REGION=EU
If you are using OAuth in the profiles.yml
to authenticate to GCP, ensure you are logged in to gcloud
with Application Default Credentials (ADC):
gcloud auth application-default login
If you are explicitly providing a service-acount json key to profiles.yml
for authentication, you don't need to worry about the above step.
Edit the entities
config to use your GCP project ID and custom CLOUDDQ_BIGQUERY_DATASET
:
sed -i s/\<your_gcp_project_id\>/${PROJECT_ID}/g configs/entities/test-data.yml
sed -i s/dq_test/${CLOUDDQ_BIGQUERY_DATASET}/g configs/entities/test-data.yml
Install CloudDQ
in a virtualenv using the instructions in Installing from source. Then test whether you can run the CLI by running:
python3 clouddq --help
Alternatively, you can download a pre-built zip artifact for CloudDQ
by running:
wget -O clouddq_executable_v0.2.1.zip https://github.com/GoogleCloudPlatform/cloud-data-quality/releases/download/v0.2.1/clouddq_executable_v0.2.1_linux-amd64.zip
Currently, we only provide the self-contained executable zip artifact for running on debian/ubuntu linux systems. The artifact will not work on MacOS/Windows.
Once completed you can use the CLI by passing the zip executable into any Python interpreter:
python3 clouddq_executable_v0.2.1.zip --help
Create the corresponding test table contact_details
mentioned in the entities config configs/entities/test-data.yml
by using bq load
:
bq mk --location=${CLOUDDQ_BIGQUERY_REGION} ${CLOUDDQ_BIGQUERY_DATASET}
bq load --source_format=CSV --autodetect ${CLOUDDQ_BIGQUERY_DATASET}.contact_details dbt/data/contact_details.csv
Ensure you have sufficient IAM privileges to create BigQuery datasets and tables in your project.
Run the following command to execute the rule_bindings T2_DQ_1_EMAIL
in configs/rule_bindings/team-2-rule-bindings.yml
:
python3 clouddq \
T2_DQ_1_EMAIL \
configs \
--metadata='{"test":"test"}' \
--dbt_profiles_dir=dbt \
--dbt_path=dbt \
--environment_target=dev
Or if you are using the pre-built zip file (only works on linux systems such as Debian/Ubuntu):
python3 clouddq_executable_v0.2.1.zip \
T2_DQ_1_EMAIL \
configs \
--metadata='{"test":"test"}' \
--dbt_profiles_dir=dbt \
--dbt_path=dbt \
--environment_target=dev
By running this CLI command, CloudDQ
will:
- convert the YAML configs in
T2_DQ_1_EMAIL
into a SQL file located atdbt/models/rule_binding_views/T2_DQ_1_EMAIL.sql
- validate that the SQL is valid using BigQuery dry-run feature
- create a BigQuery view using this SQL file in the BigQuery dataset specified in
profiles.yml
- create a BigQuery job to execute the SQL in this view. The BigQuery job will be created in the GCP project specified in
profiles.yml
- aggregate the validation outcomes using the logic in
dbt/models/data_quality_engine/main.sql
- write the Data Quality validation results into a table called
dq_summary
.
The dq_summary
table will be automatically created by CloudDQ
at the GCP Project, BigQuery Dataset, and BigQuery Region specified in profiles.yml
.
To see the result DQ validation outcomes in the BigQuery table dq_summary
, run:
echo "select * from \`${PROJECT_ID}\`.${CLOUDDQ_BIGQUERY_DATASET}.dq_summary" | bq query --location=${CLOUDDQ_BIGQUERY_REGION} --nouse_legacy_sql --format=json
If you encounter an issue with any of the above steps or have any feedback, please feel free to create a github issue or contact [email protected].