Skip to content

Latest commit

 

History

History
143 lines (109 loc) · 6.41 KB

README.md

File metadata and controls

143 lines (109 loc) · 6.41 KB

DOT configuration

How the tool connects to the database

There is just 1 config file that control how the tool connects to the database; follow the link to see an example file.

This file contains connection parameters for the DOT database and also for any of the project databases (e.g. Muso) for which you want to run the DOT tests. See more details in the paragraph below on how to adapt this and other config files to your needs.

Additionally to the database connections handled in dot_config.yml, the different objects generated by the DOT can be stored in different schemas. Read below about the file dbt_project.yml to learn how to define these output schemas.

Running the tool per project

The DOT can be run per project, where configuration and output files for each project are found in the following directories:

  1. mandatory configuration
|____config
| |____dot_config.yml
  1. optional per project configuration
|____config
| |____<project_name>
| | |____dbt
| | | |____profiles.yml
| | | |____dbt_project.yml
| | |____ge
| | | |____great_expectations.yml
| | | |____config_variables.yml
| | | |____batch_config.json
  1. generated files per project
|____generated_files
| |____<project_name>
| | |____all_tests_summary.xlsx
| | |____ge_clean_results.csv
| | |____dbt_test_coverage_report.txt
| | |____all_tests_rows.xlsx

Configuration per project

All config files are grouped under the config dir. The DOT DB connection details are propagated through Jinja templates to other config files that belong to DBT and Great Expectations. Please follow the guidelines below if you need to customize other configurations.

For the database connection

A single file controls connections the DOT and any project for which you want to run DOT tests:

  • copy the default dot_config into the top config folder (i.e. as dot/config/dot_config.yml)
  • note that the copied file will be ignored for version control
  • change the necessary parameters for the dot_db connection, e.g. host, dbname
  • add connection parameters for each of the projects you would like to run, with the same structure as the Muso_db entry for the dot_config example, i.e.
<project_name>_db:
  type: connection type e.g. postgres
  host: host
  user: username
  pass: password
  port: port number e.g 5432
  dbname: database name
  schema: schema name, e.g. public
  threads: nubmer of threads for DBT, e.g. 4
  • note that the DOT and the project connections should be at least in different schemas, but also they can be either in different databases of the same host, or in different servers

Other project-dependent configurations

In you need to edit configurations for DBT and Great Expectations, you would need to change the Jinja templates. In general these customizations will not be needed, but only in some scenarios with particular requirements; these require a deeper knowledge of the DOT and of either DBT and/or Great Expectations.

dbt_project.yml (DBT)

This file goes into the dbt main folder. If you don't need to customise it, DOT uses this Jinja template, after a few project-dependent adjustments:

  • model-paths is set to a subdirectory for the project, i.e. ["models_<project_name>"]
  • test-paths is also set to a subdirectory for the project, i.d. ["tests_<project_name>"]

and the modified version is copied by the DOT into the destination dbt main folder.

The tool also copies the content of the models folder into the model path for the project, dot/dbt/models/<project_name>/core, and creates the custom SQL tests at dot/dbt/tests/<project_name>.

An example of a common personalization would be for changing the schema in which the objects generated by the dot are written. See the paragraph just below.

Writing the output objects of the DOT to different schemas

The DOT generates 2 kind of database objects:

  • entities of the models that are being tested, e.g. assessments, follow ups, patients
  • results of the failing tests

If nothing is done, these objects would be created in the same schema as the original data for the project (thus polluting the db).

The following lines added to dbt_project.yml will modify where those objects are stored:

models:
  dbt_model_1:
    core:
      +schema: <schema_suffix>
    test:
      +schema: <schema_suffix>

Which will be added as a suffix. I.e. if the project data is stored in a certain schema, the output objects will go to <project_schema>_<schema_suffix> (e.g. to public_tests if the project schema is public and the suffix is set to tests in the lines above).

Note that this mechanism uses a DBT feature, and that the same applies to the GE tests.

Finally, although this is not really recommended, you can send the 2 different types of outputs to 2 schemas:

  • core in the lines above corresponds to the models
  • test corresponds to the failing test results
profiles.yml (DBT)

This DBT configuration file goes into ~/.dbt/profiles.yml. If you don't need to customise it, the Jinja template is used by the tool to generate the final config file, using the connection parameters for the DOTdb in the dot_config file.

First sight there is no good reason to customise this config file.

great_expectations.yml (GE)

This file goes into the great expectations main folder. Starting from this Jinja template a config file is generated into the destination great expectations main folder.

batch_config.json (GE)

This file goes into the great expectations main folder. The Jinja template generates a file copied into the great expectations main folder.

There are no obvious reasons why you may want to customize this file.

config_variables.yml (GE)

Starting from this Jinja template the GE configuration file goes into dot/great_expectations/uncommitted/config_variables.yml

First sight there is no good reason to customise this config file.