diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index ca16cffd..40ee8a0d 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -2,7 +2,7 @@
White Rabbit and Rabbit in a Hat are structured as a Maven package. Contributions are welcome.
-While the software in the project can be executed with Java 8 (1.8), for development Java 17 is needed.
+While the software in the project can be executed with Java 8 (1.8), for development Java 17 (or higher, currently tested upto version 21) is needed.
This has to do with test and verification dependencies that are not available in a version compatible with Java 8 .
Please note that when using an IDE for development, source and target release must still be Java 8 (1.8) . This is enforced
@@ -14,11 +14,13 @@ To generate the files ready for distribution, run `mvn install`.
When creating a pull request, please make sure the verification phase (`mvn verify`) does not fail.
+When contributing code, please make sure that `mvn verify` runs without errors.
+
### Testing
A number of unit and integration tests exist. The integration tests run only in the maven verification phase,
-(`mvn verify`) and depend on docker being available to the user running the verification. If docker is not available, the
-integration tests will fail.
+(`mvn verify`) and depend on Docker being available to the user running the verification. If Docker is not available, the
+integration tests will fail.
When adding test, please follow these conventions:
@@ -47,6 +49,12 @@ These are used for testing of the main White Rabbit and Rabbit in a Hat features
| `riah_input` | An example mapping file used to create the Rabbit in a Hat outputs. |
| `riah_output` | All export formats created by Rabbit in a Hat: as word, html, markdown, sql skeleton and the R TestFramework.
These are all generated from `riah_input/riah_mapping_example.gz`. |
+### Database support
+
+If you are considering adding support for a new type of database, it is recommended to follow the pattern as used
+by the SnowflakeHandler class, which extends the StorageHandler interface. This way, the brand/database specific code
+is isolated into one class, instead of through the code paths that implement support for the
+databases that were added earlier. This will lead to clearer code, that will also be easier to test and debug.
### Snowflake
@@ -68,4 +76,4 @@ and do not relate in any way to any production environment.
The schema should not contain any tables when the test is started.
It is possible to skip the Snowflake tests without failing the build by passing
-`-Dohdsi.org.whiterabbit.skip_snowflake_tests=1` to maven.
\ No newline at end of file
+`-Dohdsi.org.whiterabbit.skip_snowflake_tests=1` to maven.
diff --git a/README.md b/README.md
index 240f7db1..01074c2f 100644
--- a/README.md
+++ b/README.md
@@ -44,7 +44,7 @@ Requires Java 1.8 or higher for running, and read access to the database to be s
Dependencies
============
-For the distributable packages, the only requirement is Java 8. For building the package, Java 17 and Maven are needed.
+For the distributable packages, the only requirement is Java 8. For building the package, Java 17+ and Maven are needed.
Getting Started
===============
diff --git a/docs/RabbitInAHat.html b/docs/RabbitInAHat.html
index 6e1f7975..4397a9fb 100644
--- a/docs/RabbitInAHat.html
+++ b/docs/RabbitInAHat.html
@@ -380,243 +380,105 @@
Rabbit-In-a-Hat comes with WhiteRabbit and is designed to read and -display a WhiteRabbit scan document. WhiteRabbit generates information -about the source data while Rabbit-In-a-Hat uses that information and -through a graphical user interface to allow a user to connect source -data to tables and columns within the CDM. Rabbit-In-a-Hat generates -documentation for the ETL process it does not generate code to create an -ETL.
+Rabbit-In-a-Hat comes with WhiteRabbit and is designed to read and display a WhiteRabbit scan document. WhiteRabbit generates information about the source data while Rabbit-In-a-Hat uses that information and through a graphical user interface to allow a user to connect source data to tables and columns within the CDM. Rabbit-In-a-Hat generates documentation for the ETL process it does not generate code to create an ETL.
The typical sequence for using this software to generate -documentation of an ETL:
+The typical sequence for using this software to generate documentation of an ETL:
Rabbit-In-a-Hat comes with WhiteRabbit, refer to step 1 and 2 of WhiteRabbit’s installation -section.
+Rabbit-In-a-Hat comes with WhiteRabbit, refer to step 1 and 2 of WhiteRabbit’s installation section.
To create a new document, navigate to File –> Open Scan -Report. Use the “Open” window to browse for the scan document -created by WhiteRabbit. When a scan report is opened, the tables scanned -will appear in orange boxes on the “Source” side of the Tables.
-Save the Rabbit-In-a-Hat document by going File –> Save -as.
-Side note: the minimal requirement for a valid scan report is to have -the following named columns in the first sheet: ‘Table’, -‘Field’, ‘N rows’ (numeric) and ‘N rows checked’ (numeric).
+To create a new document, navigate to File –> Open Scan Report. Use the “Open” window to browse for the scan document created by WhiteRabbit. When a scan report is opened, the tables scanned will appear in orange boxes on the “Source” side of the Tables.
+Save the Rabbit-In-a-Hat document by going File –> Save as.
+Side note: the minimal requirement for a valid scan report is to have the following named columns in the first sheet: ‘Table’, ‘Field’, ‘N rows’ (numeric) and ‘N rows checked’ (numeric).
To open an existing Rabbit-In-a-Hat document use File –> Open -ETL specs.
+To open an existing Rabbit-In-a-Hat document use File –> Open ETL specs.
Rabbit-In-a-Hat allows you to select which CDM version (v4, v5 or v6) -you’d like to built your ETL specification against.
-See the graphic below for how to select your desired CDM:
-The CDM version can be changed at any time, but beware that you may -lose some of your existing mappings in the process. By default, -Rabbit-In-a-Hat will attempt to preserve as many mappings between the -source data and the newly selected CDM as possible. When a new CDM is -selected, Rabbit-In-a-Hat will drop any mappings without -warning if the mapping’s CDM table or CDM column name no longer -exists.
-For instance, switching from CDMv4 to CDMv5, a mapping to
-person.person_source_value
will be kept because the person
-table has person_source_value
in both CDMv4 and CDMv5.
-However, person.associated_provider_id
exists only in CDMv4
-(it was renamed to provider_id in CDMv5) and will
-not be kept when switching between these two CDMs.
Rabbit-In-a-Hat allows you to select which CDM version (v4, v5 or v6) you’d like to built your ETL specification against.
+See the graphic below for how to select your desired CDM:
+The CDM version can be changed at any time, but beware that you may lose some of your existing mappings in the process. By default, Rabbit-In-a-Hat will attempt to preserve as many mappings between the source data and the newly selected CDM as possible. When a new CDM is selected, Rabbit-In-a-Hat will drop any mappings without warning if the mapping’s CDM table or CDM column name no longer exists.
+For instance, switching from CDMv4 to CDMv5, a mapping to person.person_source_value
will be kept because the person table has person_source_value
in both CDMv4 and CDMv5. However, person.associated_provider_id
exists only in CDMv4 (it was renamed to provider_id in CDMv5) and will not be kept when switching between these two CDMs.
There are times when users might need to load in a customized version -of the CDM, for instance if they are sandboxing new features. To load in -a custom CDM schema, first you must create a CSV file that uses the same -format as given -for the CDM versions.
-Once you have created the CSV file, load it into RiaH as shown -below:
-There are times when users might need to load in a customized version of the CDM, for instance if they are sandboxing new features. To load in a custom CDM schema, first you must create a CSV file that uses the same format as given for the CDM versions.
+Once you have created the CSV file, load it into RiaH as shown below:
+Please note that the name of the file you load in becomes the label -that appears above the target tables, so “My Super File.csv” will create -the label “My Super File” above the target tables, so name your CSV -accordingly.
+Please note that the name of the file you load in becomes the label that appears above the target tables, so “My Super File.csv” will create the label “My Super File” above the target tables, so name your CSV accordingly.
In some cases a source domains maps to multiple OMOP CDM target -domains. For example lab values that map to both the measurement and -observation domain. Using the stem table will remove some overhead of -repeating the mapping for every target and will also ease implementation -(see below).
-The idea of the stem table is that it contains all the types of -columns that you need regardless of the CDM table the data ultimately -ends up in. There is a pre-specified map from stem to all CDM clinical -event tables, linking every stem field to one or multiple fields in the -CDM. When implementing the ETL, the vocabulary decides where a -particular row mapped to stem table ultimately goes. The OMOP -CDM Data Model Conventions mentions:
+In some cases a source domains maps to multiple OMOP CDM target domains. For example lab values that map to both the measurement and observation domain. Using the stem table will remove some overhead of repeating the mapping for every target and will also ease implementation (see below).
+The idea of the stem table is that it contains all the types of columns that you need regardless of the CDM table the data ultimately ends up in. There is a pre-specified map from stem to all CDM clinical event tables, linking every stem field to one or multiple fields in the CDM. When implementing the ETL, the vocabulary decides where a particular row mapped to stem table ultimately goes. The OMOP CDM Data Model Conventions mentions:
--Write the data record into the table(s) corresponding to the domain -of the Standard CONCEPT_ID(s).
+Write the data record into the table(s) corresponding to the domain of the Standard CONCEPT_ID(s).
Adding the stem table can be done through Edit –> Add stem -table and removing through Edit –> Remove stem table. -Note that removing the stem table will remove any mappings already made -to/from this table.
+Adding the stem table can be done through Edit –> Add stem table and removing through Edit –> Remove stem table. Note that removing the stem table will remove any mappings already made to/from this table.
-This will add the stem table to the source and target tables and -mappings from source stem table to CDM domains.
+This will add the stem table to the source and target tables and mappings from source stem table to CDM domains.
A number of CDM fields have a limited number of standard
-concept_id(s) that can be used. Examples are:
-gender_concept_id
, _type_concept_id
‘s,
-route_concept_id
and visit_concept_id
. To help
-choose the right concept_id during ETL design, Rabbit-In-a-Hat shows the
-list of possible concept ids of a CDM field when clicking on a target
-field. Note that all standard and non-standard target concepts with the
-right domain are shown, but the OMOP conventions only allow for standard
-concepts (flagged with an ’S’ in the panel).
A number of CDM fields have a limited number of standard concept_id(s) that can be used. Examples are: gender_concept_id
, _type_concept_id
‘s, route_concept_id
and visit_concept_id
. To help choose the right concept_id during ETL design, Rabbit-In-a-Hat shows the list of possible concept ids of a CDM field when clicking on a target field. Note that all standard and non-standard target concepts with the right domain are shown, but the OMOP conventions only allow for standard concepts (flagged with an ’S’ in the panel).
The concept id hints are stored statically in a -csv file and are not automatically updated. The code -used to create the aforementioned csv file is also included in the -repo.
+The concept id hints are stored statically in a csv file and are not automatically updated. The code used to create the aforementioned csv file is also included in the repo.
It is assumed that the owners of the source data should be able to -provide detail of what the data table contains, Rabbit-In-a-Hat will -describe the columns within the table but will not provide the context a -data owner should provide. For the CDM tables, if more information is -needed navigate to the OMOP CDM -documentation and review the current OMOP specification.
-To connect a source table to a CDM table, simply hover over the -source table until an arrow head appears.
+It is assumed that the owners of the source data should be able to provide detail of what the data table contains, Rabbit-In-a-Hat will describe the columns within the table but will not provide the context a data owner should provide. For the CDM tables, if more information is needed navigate to the OMOP CDM documentation and review the current OMOP specification.
+To connect a source table to a CDM table, simply hover over the source table until an arrow head appears.
-Use your mouse to grab the arrow head and drag it to the -corresponding CDM table. In the example below, the drug_claims -data will provide information for the drug_exposure table.
+Use your mouse to grab the arrow head and drag it to the corresponding CDM table. In the example below, the drug_claims data will provide information for the drug_exposure table.
-If you click on the arrow once it will highlight and a -Details window will appear in the right pane. Use this to -describe Logic or Comments that someone developing the ETL code -should know about this source table to CDM table mapping.
+If you click on the arrow once it will highlight and a Details window will appear in the right pane. Use this to describe Logic or Comments that someone developing the ETL code should know about this source table to CDM table mapping.
-Continue this process until all tables that are needed to build a CDM -are mapped to their corresponding CDM tables. One source table can map -to multiple CDM tables and one CDM table can receive multiple mappings. -There may be tables in the source data that should not be map into the -CDM and there may be tables in the CDM that cannot be populated from the -source data.
+Continue this process until all tables that are needed to build a CDM are mapped to their corresponding CDM tables. One source table can map to multiple CDM tables and one CDM table can receive multiple mappings. There may be tables in the source data that should not be map into the CDM and there may be tables in the CDM that cannot be populated from the source data.
By double clicking on an arrow connecting a source and CDM table, it -will open a Fields pane below the arrow selected. The -Fields pane will have all the source table and CDM fields and -is meant to make the specific column mappings between tables. Hovering -over a source table will generate an arrow head that can then be -selected and dragged to its corresponding CDM field. For example, in the -drug_claims to drug_exposure table mapping example, -the source data owners know that patient_id is the patient -identifier and corresponds to the CDM.person_id. Also, just as -before, the arrow can be selected and Logic and -Comments can be added.
+By double clicking on an arrow connecting a source and CDM table, it will open a Fields pane below the arrow selected. The Fields pane will have all the source table and CDM fields and is meant to make the specific column mappings between tables. Hovering over a source table will generate an arrow head that can then be selected and dragged to its corresponding CDM field. For example, in the drug_claims to drug_exposure table mapping example, the source data owners know that patient_id is the patient identifier and corresponds to the CDM.person_id. Also, just as before, the arrow can be selected and Logic and Comments can be added.
-If you select the source table orange box, Rabbit-In-a-Hat will -expose values the source data has for that table. This is meant to help -in the process in understanding the source data and what logic may be -required to handle the data in the ETL. In the example below -ndcnum is selected and raw NDC codes are displayed starting -with most frequent (note that in the WhiteRabbit scan a “Min cell count” -could have been selected and values smaller than that count will not -show).
+If you select the source table orange box, Rabbit-In-a-Hat will expose values the source data has for that table. This is meant to help in the process in understanding the source data and what logic may be required to handle the data in the ETL. In the example below ndcnum is selected and raw NDC codes are displayed starting with most frequent (note that in the WhiteRabbit scan a “Min cell count” could have been selected and values smaller than that count will not show).
-Continue this process until all source columns necessary in all -mapped tables have been mapped to the corresponding CDM column. Not all -columns must be mapped into a CDM column and not all CDM columns require -a mapping. One source column may supply information to multiple CDM -columns and one CDM column can receive information from multiple -columns.
+Continue this process until all source columns necessary in all mapped tables have been mapped to the corresponding CDM column. Not all columns must be mapped into a CDM column and not all CDM columns require a mapping. One source column may supply information to multiple CDM columns and one CDM column can receive information from multiple columns.
To generate an ETL MS Word document use File –> Generate ETL -document –> Generate ETL Word document and select a location to -save. The ETL document can also be exported to markdown or html. In this -case, a file per target table is created and you will be prompted to -select a folder. Regardless of the format, the generated document will -contain all mappings and notes from Rabbit-In-a-Hat.
-Once the information is in the document, if an update is needed you -must either update the information in Rabbit-In-a-Hat and regenerate the -document or update the document. If you make changes in the document, -Rabbit-In-a-Hat will not read those changes and update the information -in the tool. However, it is common to generate the document with the -core mapping information and fill in more detail within the -document.
-Once the document is completed, this should be shared with the -individuals who plan to implement the code to execute the ETL. The -markdown and html format enable easy publishing as a web page on -e.g. Github. A good example is the Synthea ETL -documentation.
+To generate an ETL MS Word document use File –> Generate ETL document –> Generate ETL Word document and select a location to save. The ETL document can also be exported to markdown or html. In this case, a file per target table is created and you will be prompted to select a folder. Regardless of the format, the generated document will contain all mappings and notes from Rabbit-In-a-Hat.
+Once the information is in the document, if an update is needed you must either update the information in Rabbit-In-a-Hat and regenerate the document or update the document. If you make changes in the document, Rabbit-In-a-Hat will not read those changes and update the information in the tool. However, it is common to generate the document with the core mapping information and fill in more detail within the document.
+Once the document is completed, this should be shared with the individuals who plan to implement the code to execute the ETL. The markdown and html format enable easy publishing as a web page on e.g. Github. A good example is the Synthea ETL documentation.
To make sure the ETL process is working as specified, it is highly -recommended creating unit tests that -evaluate the behavior of the ETL process. To efficiently create a set of -unit tests Rabbit-in-a-Hat can generate a testing framework.
+To make sure the ETL process is working as specified, it is highly recommended creating unit tests that evaluate the behavior of the ETL process. To efficiently create a set of unit tests Rabbit-in-a-Hat can generate a testing framework.
The step after documenting your ETL process is to implement it in an -ETL framework of your choice. As many implementations involve SQL, -Rabbit-In-a-Hat provides a convenience function to export your design to -an SQL skeleton. This contains all field to field mappings, with -logic/descriptions as comments, as non-functional pseudo-code. This -saves you copying names into your SQL code, but still requires you to -implement the actual logic. The general format of the skeleton is:
+The step after documenting your ETL process is to implement it in an ETL framework of your choice. As many implementations involve SQL, Rabbit-In-a-Hat provides a convenience function to export your design to an SQL skeleton. This contains all field to field mappings, with logic/descriptions as comments, as non-functional pseudo-code. This saves you copying names into your SQL code, but still requires you to implement the actual logic. The general format of the skeleton is:
INSERT INTO <target_table> (
<target_fields>
)
diff --git a/docs/ReadMe.html b/docs/ReadMe.html
index 1cca2264..cefcfc18 100644
--- a/docs/ReadMe.html
+++ b/docs/ReadMe.html
@@ -377,24 +377,14 @@
White Rabbit Documentation Readme
-This folder contains the raw (.md
) and rendered
-(.html
) documentation of WhiteRabbit. The documentation is
-renderd with the R package rmarkdown
and used for the github.io page.
+This folder contains the raw (.md
) and rendered (.html
) documentation of WhiteRabbit. The documentation is renderd with the R package rmarkdown
and used for the github.io page.
Contribute
-Contributions to the documentation are very welcome and even a must
-when new features are implemented. To update the documentation, edit one
-of the following markdown files or create a new markdown file: - WhiteRabbit.md - RabbitInAHat.md - riah_test_framework.md - best_practices.md
+Contributions to the documentation are very welcome and even a must when new features are implemented. To update the documentation, edit one of the following markdown files or create a new markdown file: - WhiteRabbit.md - RabbitInAHat.md - riah_test_framework.md - best_practices.md
Render html
-To generate the site from markdown files, run the following R code
-with the ./docs
folder as working directory.
+To generate the site from markdown files, run the following R code with the ./docs
folder as working directory.
#devtools::install_github("ropenscilabs/icon")
library(rmarkdown)
rmarkdown::render_site()
diff --git a/docs/WhiteRabbit.html b/docs/WhiteRabbit.html
index 9349899c..c7e8bb20 100644
--- a/docs/WhiteRabbit.html
+++ b/docs/WhiteRabbit.html
@@ -380,41 +380,19 @@
Introduction
Scope and purpose
-WhiteRabbit is a software tool to help prepare for ETLs (Extraction,
-Transformation, Loading) of longitudinal health care databases into the
-Observational Medical
-Outcomes Partnership (OMOP) Common Data Model (CDM). The source data
-can be in delimited text files, SAS files, or in a database (MySQL, SQL
-Server, Oracle, PostgreSQL, Microsoft Access, Amazon RedShift, PDW,
-Teradata, Google BigQuery, Azure). Note that for support of the OHDSI
-analytical tooling, the OMOP CDM will need to be in one of a limited set
-of database platforms (SQL Server, Oracle, PostgreSQL, Amazon RedShift,
-Google BigQuery, Impala).
-WhiteRabbit’s main function is to perform a scan of the source data,
-providing detailed information on the tables, fields, and values that
-appear in a field. This scan will generate a report that can be used as
-a reference when designing the ETL, for instance by using the
-Rabbit-In-A-Hat tool. WhiteRabbit differs from standard data profiling
-tools in that it attempts to prevent the display of personally
-identifiable information (PII) data values in the generated output data
-file.
+WhiteRabbit is a software tool to help prepare for ETLs (Extraction, Transformation, Loading) of longitudinal health care databases into the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). The source data can be in delimited text files, SAS files, or in a database (MySQL, SQL Server, Oracle, PostgreSQL, Microsoft Access, Amazon RedShift, PDW, Teradata, Google BigQuery, Azure). Note that for support of the OHDSI analytical tooling, the OMOP CDM will need to be in one of a limited set of database platforms (SQL Server, Oracle, PostgreSQL, Amazon RedShift, Google BigQuery, Impala).
+WhiteRabbit’s main function is to perform a scan of the source data, providing detailed information on the tables, fields, and values that appear in a field. This scan will generate a report that can be used as a reference when designing the ETL, for instance by using the Rabbit-In-A-Hat tool. WhiteRabbit differs from standard data profiling tools in that it attempts to prevent the display of personally identifiable information (PII) data values in the generated output data file.
Process Overview
-The typical sequence for using this software to scan source data in
-preparation of developing an ETL into an OMOP CDM:
+The typical sequence for using this software to scan source data in preparation of developing an ETL into an OMOP CDM:
-- Set working folder, the location on the local desktop computer where
-results will be exported.
-- Connect to the source database or delimited text file and test
-connection.
-- Select the tables to be scanned and execute the WhiteRabbit
-scan.
-- WhiteRabbit creates a ‘ScanReport’ with information about the source
-data.
+- Set working folder, the location on the local desktop computer where results will be exported.
+- Connect to the source database or delimited text file and test connection.
+- Select the tables to be scanned and execute the WhiteRabbit scan.
+- WhiteRabbit creates a ‘ScanReport’ with information about the source data.
-Once the scan report is created, this report can then be used in the
-Rabbit-In-A-Hat tool or as a stand-alone data profiling document.
+Once the scan report is created, this report can then be used in the Rabbit-In-A-Hat tool or as a stand-alone data profiling document.
@@ -422,76 +400,29 @@ Installation and support
Installation
-- Download the latest version of WhiteRabbit from Github: https://github.com/OHDSI/WhiteRabbit/releases/latest.
-The packaged application can be found at the bottom of the page under
-assets, in a file called WhiteRabbit_vX.X.X.zip (where
-X.X.X
is the latest version).
+- Download the latest version of WhiteRabbit from Github: https://github.com/OHDSI/WhiteRabbit/releases/latest.
+The packaged application can be found at the bottom of the page under assets, in a file called WhiteRabbit_vX.X.X.zip (where X.X.X
is the latest version).
- Unzip the download
-- Double-click on
bin/whiteRabbit.bat
on Windows to start
-WhiteRabbit, and bin/whiteRabbit
on macOS and Linux.
-See Running from the command
-line for details on how to run from the command line instead.
-- Go to Using the
-Application Functions for detailed instructions on how to make a
-scan of your data.
+- Double-click on
bin/whiteRabbit.bat
on Windows to start WhiteRabbit, and bin/whiteRabbit
on macOS and Linux.
+See Running from the command line for details on how to run from the command line instead.
+- Go to Using the Application Functions for detailed instructions on how to make a scan of your data.
-Note: on releases earlier than version 0.8.0, open the respective
-WhiteRabbit.jar or RabbitInAHat.jar files instead. Note: WhiteRabbit and
-RabbitInaHat only work from a path with only ascii characters.
+Note: on releases earlier than version 0.8.0, open the respective WhiteRabbit.jar or RabbitInAHat.jar files instead. Note: WhiteRabbit and RabbitInaHat only work from a path with only ascii characters.
Memory
-WhiteRabbit possibly does not start when the memory allocated by the
-JVM is too big or too small. By default this is set to 1200m. To
-increase the memory (in this example to 2400m), either set the
-environment variable EXTRA_JVM_ARGUMENTS=-Xmx2400m
before
-starting or edit in bin/WhiteRabbit.bat
the line
-%JAVACMD% %JAVA_OPTS% -Xmx2400m...
. To lower the memory,
-set one of these variables to e.g. -Xmx600m
. If you have a
-32-bit Java VM installed and problems persist, consider installing
-64-bit Java.
+WhiteRabbit possibly does not start when the memory allocated by the JVM is too big or too small. By default this is set to 1200m. To increase the memory (in this example to 2400m), either set the environment variable EXTRA_JVM_ARGUMENTS=-Xmx2400m
before starting or edit in bin/WhiteRabbit.bat
the line %JAVACMD% %JAVA_OPTS% -Xmx2400m...
. To lower the memory, set one of these variables to e.g. -Xmx600m
. If you have a 32-bit Java VM installed and problems persist, consider installing 64-bit Java.
Temporary Directory for Apache POI
-(This addresses issue
-293)
-The Apache POI library is used for generating the scan report in
-Excel format. This library creates its own directory for temporary files
-in the system temporary directory. In issue 293 it
-has been reported that this can cause problems in a multi-user
-environment, when multiple user attempt to create this directory with
-too restrictive permissions (read-only for other users). WhiteRabbit
-from version 0.10.9 attempts to circumvent this automatically, but this
-workaround can fail due to concurrency problems. If you want to prevent
-this from happening entirely , you can set either the environment
-variable ORG_OHDSI_WHITERABBIT_POI_TMPDIR
or the Java
-system property org.ohdsi.whiterabbit.poi.tmpdir
to a
-temporary directory of your choice when starting WhiteRabbit (best would
-be to add this to the whiteRabbit
or
-whiteRabbit.bat
script). Please note that this directory
-should exist before your start WhiteRabbit, and that it should be
-writable by any user that may want to run WhiteRabbit. For each user a
-separate subdirectory will be created, so that permission related
-conflicts should be avoided. Also, WhiteRabbit now attempts to detect
-this situation before the scan starts. If this is detected, the scan is
-not started, and the problem identified before the scan, instead of
-afterwards.
+(This addresses issue 293)
+The Apache POI library is used for generating the scan report in Excel format. This library creates its own directory for temporary files in the system temporary directory. In issue 293 it has been reported that this can cause problems in a multi-user environment, when multiple user attempt to create this directory with too restrictive permissions (read-only for other users). WhiteRabbit from version 0.10.9 attempts to circumvent this automatically, but this workaround can fail due to concurrency problems. If you want to prevent this from happening entirely , you can set either the environment variable ORG_OHDSI_WHITERABBIT_POI_TMPDIR
or the Java system property org.ohdsi.whiterabbit.poi.tmpdir
to a temporary directory of your choice when starting WhiteRabbit (best would be to add this to the whiteRabbit
or whiteRabbit.bat
script). Please note that this directory should exist before your start WhiteRabbit, and that it should be writable by any user that may want to run WhiteRabbit. For each user a separate subdirectory will be created, so that permission related conflicts should be avoided. Also, WhiteRabbit now attempts to detect this situation before the scan starts. If this is detected, the scan is not started, and the problem identified before the scan, instead of afterwards.
Support
-All source code, descriptions and input/output examples are available
-on GitHub: https://github.com/OHDSI/WhiteRabbit
-Any bugs/issues/enhancements should be posted to the GitHub
-repository: https://github.com/OHDSI/WhiteRabbit/issues
-Any questions/comments/feedback/discussion can be posted on the OHDSI
-Developer Forum: http://forums.ohdsi.org/c/developers
+All source code, descriptions and input/output examples are available on GitHub: https://github.com/OHDSI/WhiteRabbit
+Any bugs/issues/enhancements should be posted to the GitHub repository: https://github.com/OHDSI/WhiteRabbit/issues
+Any questions/comments/feedback/discussion can be posted on the OHDSI Developer Forum: http://forums.ohdsi.org/c/developers
@@ -501,160 +432,108 @@ Specifying the Location of Source Data
Working Folder
-Any files that WhiteRabbit creates will be exported to this local
-folder. Use the “Pick Folder” button to navigate in your local
-environment where you would like the scan document to go.
+Any files that WhiteRabbit creates will be exported to this local folder. Use the “Pick Folder” button to navigate in your local environment where you would like the scan document to go.
Source Data
-Here you can specify the location of the source data. The following
-source types are supported: delimited text files, SAS files, MySQL, SQL
-Server, Oracle, PostgreSQL, Microsoft Access, Amazon RedShift, PDW,
-Teradata, Google BigQuery, Azure. Below are connection instructions for
-each data type of data source. Once you have entered the necessary
-information, the “Test connection” button can ensure a connection can be
-made.
+Here you can specify the location of the source data. The following source types are supported: delimited text files, SAS files, MySQL, SQL Server, Oracle, PostgreSQL, Microsoft Access, Amazon RedShift, PDW, Teradata, Google BigQuery, Azure, Snowflake. Below are connection instructions for each data type of data source. Once you have entered the necessary information, the “Test connection” button can ensure a connection can be made.
Delimited text files
-- Delimiter: specifies the delimiter that
-separates columns. Enter
tab
for a tab delimited file.
+- Delimiter: specifies the delimiter that separates columns. Enter
tab
for a tab delimited file.
-WhiteRabbit will look for the files to scan in the same folder you
-set up as a working directory.
+WhiteRabbit will look for the files to scan in the same folder you set up as a working directory.
SAS
- No parameters have to be provided for SAS files.
-WhiteRabbit will look for .sas7bdat
files to scan in the
-same folder you set up as a working directory.
-Note that it is currently not possible to produce fake data for SAS
-files from a scan report.
+WhiteRabbit will look for .sas7bdat
files to scan in the same folder you set up as a working directory.
+Note that it is currently not possible to produce fake data for SAS files from a scan report.
MySQL
-- Server location: the name or IP address of
-the server running MySQL. You can also specify the port (ex:
-
<host>:<port>
), which defaults to 3306.
-- User name: name of the user used to log
-into the server
-- Password: password for the supplied user
-name
-- Database name: name of the database
-containing the tables
+- Server location: the name or IP address of the server running MySQL. You can also specify the port (ex:
<host>:<port>
), which defaults to 3306.
+- User name: name of the user used to log into the server
+- Password: password for the supplied user name
+- Database name: name of the database containing the tables
Oracle
-- Server location: this field contains the
-SID, service name, and optionally the port:
-
<host>/<sid>
,
-<host>:<port>/<sid>
,
-<host>/<service name>
, or
-<host>:<port>/<service name>
-- User name: name of the user used to log
-into the server
-- Password: password for the supplied user
-name
-- Database name: this field contains the
-schema (i.e. ‘user’ in Oracle terms) containing the tables
+- Server location: this field contains the SID, service name, and optionally the port:
<host>/<sid>
, <host>:<port>/<sid>
, <host>/<service name>
, or <host>:<port>/<service name>
+- User name: name of the user used to log into the server
+- Password: password for the supplied user name
+- Database name: this field contains the schema (i.e. ‘user’ in Oracle terms) containing the tables
SQL Server
-- Server location: the name or IP address of
-the server running SQL Server. You can also specify the port (ex:
-
<host>:<port>
), which defaults to 1433.
-- User name: name of the user used to log
-into the server. Optionally, the domain can be specified as
-
<domain>/<user>
(e.g. ‘MyDomain/Joe’)
-- Password: password for the supplied user
-name
-- Database name: name of the database
-containing the tables
+- Server location: the name or IP address of the server running SQL Server. You can also specify the port (ex:
<host>:<port>
), which defaults to 1433.
+- User name: name of the user used to log into the server. Optionally, the domain can be specified as
<domain>/<user>
(e.g. ‘MyDomain/Joe’)
+- Password: password for the supplied user name
+- Database name: name of the database containing the tables
-When the SQL Server JDBC drivers are installed, you can also use
-Windows authentication. In this case, user name and password should be
-empty.
+When the SQL Server JDBC drivers are installed, you can also use Windows authentication. In this case, user name and password should be empty.
-- Download the .exe from http://msdn.microsoft.com/en-us/sqlserver/aa937724.aspx.
+- Download the .exe from http://msdn.microsoft.com/en-us/sqlserver/aa937724.aspx.
- Run it, thereby extracting its contents to a folder.
-- In the extracted folder you will find the file
-
_sqljdbc_4.0/enu/auth/x64/sqljdbc_auth.dll_
(64-bits) or
-_sqljdbc_4.0/enu/auth/x86/sqljdbc_auth.dll_
(32-bits),
-which needs to be moved to a location on the system path, for example to
-c:/windows/system32
.
+- In the extracted folder you will find the file
_sqljdbc_4.0/enu/auth/x64/sqljdbc_auth.dll_
(64-bits) or _sqljdbc_4.0/enu/auth/x86/sqljdbc_auth.dll_
(32-bits), which needs to be moved to a location on the system path, for example to c:/windows/system32
.
PostgreSQL
-- Server location: this field contains the
-host name and database name
-(
<host>/<database>
). You can also specify the
-port (ex: <host>:<port>/<database>
),
-which defaults to 5432.
-- User name: name of the user used to log
-into the server
-- Password: password for the supplied user
-name
-- Database name: this field contains the
-schema containing the tables
+- Server location: this field contains the host name and database name (
<host>/<database>
). You can also specify the port (ex: <host>:<port>/<database>
), which defaults to 5432.
+- User name: name of the user used to log into the server
+- Password: password for the supplied user name
+- Database name: this field contains the schema containing the tables
Google BigQuery
-Google BigQuery (GBQ) supports two different
-connection/authentication methods: application default credentials and
-service account authentication. The former method is considered more
-secure because it writes auditing events to stackdriver. The specific
-method used is determined by the arguments provided to the configuration
-panel as described below.
+Google BigQuery (GBQ) supports two different connection/authentication methods: application default credentials and service account authentication. The former method is considered more secure because it writes auditing events to stackdriver. The specific method used is determined by the arguments provided to the configuration panel as described below.
Authentication via application default credentials:
-When using application default credentials authentication, you must
-run the following gcloud command in the user account only once:
-gcloud auth application-default login
(do not include the
-single quote characters). An application key is written to
-~/.config/gcloud/application_default_credentails.json
.
+When using application default credentials authentication, you must run the following gcloud command in the user account only once: gcloud auth application-default login
(do not include the single quote characters). An application key is written to ~/.config/gcloud/application_default_credentails.json
.
-- Server location: name of the GBQ
-ProjectID
+- Server location: name of the GBQ ProjectID
- User name: not used
- Password: not used
-- Database name: data set name within
-ProjectID named in Server location field
+- Database name: data set name within ProjectID named in Server location field
Authentication via service account credentials:
-- Server location: name of GBQ
-ProjectID
-- User name: OAuth service account email
-address
-- Password: OAuth private key path (full
-path to the private key JSON file)
-- Database name: data set name within
-ProjectID named in Server location field
+- Server location: name of GBQ ProjectID
+- User name: OAuth service account email address
+- Password: OAuth private key path (full path to the private key JSON file)
+- Database name: data set name within ProjectID named in Server location field
Azure
-- Server location: server address string
-including database name
-(e.g.
<project>.database.windows.net:1433;database=<database_name>
)
-- User name: name of the user used to log
-into the server
-- Password: password for the supplied user
-name
+- Server location: server address string including database name (e.g.
<project>.database.windows.net:1433;database=<database_name>
)
+- User name: name of the user used to log into the server
+- Password: password for the supplied user name
+
+Snowflake
+
+- Account: the account name for your Snowflake instance
+- User: user name to be used for the Snowflake instance
+- Password: password for the above user
+- Warehouse: warehouse within the Snowflake instance
+- Database: database to be used within the warehouse
+- Schema: schema to be used within the database
+- Authentication method: authentication method to be used. Currently only the value ‘externalbrowser’ is supported
+
+Please note that the fields Password and Authentication method are mutually exclusive: for only one of these fields a value should be supplied. A warning will be given when a value is supplied for both fields.
+
@@ -662,50 +541,23 @@ Scanning a Database
Performing the Scan
-A scan generates a report containing information on the source data
-that can be used to help design the ETL. Using the Scan tab in
-WhiteRabbit you can either select individual tables in the selected
-source database by clicking on ‘Add’ (Ctrl + mouse click), or
-automatically select all tables in the database by clicking on ‘Add all
-in DB’.
+A scan generates a report containing information on the source data that can be used to help design the ETL. Using the Scan tab in WhiteRabbit you can either select individual tables in the selected source database by clicking on ‘Add’ (Ctrl + mouse click), or automatically select all tables in the database by clicking on ‘Add all in DB’.
There are a few setting options as well with the scan:
-- Checking the “Scan field values” box tells WhiteRabbit that you
-would like to investigate raw data items within tables selected for a
-scan (i.e. if you select Table A, WhiteRabbit will review the contents
-in each column in Table A).
+
- Checking the “Scan field values” box tells WhiteRabbit that you would like to investigate raw data items within tables selected for a scan (i.e. if you select Table A, WhiteRabbit will review the contents in each column in Table A).
-- “Min cell count” is an option when scanning field values. By
-default, this is set to 5, meaning values in the source data that appear
-less than 5 times will not appear in the report.
-- “Rows per table” is an option when scanning field values. By
-default, WhiteRabbit will random 100,000 rows in the table. There are
-other options to review 500,000, 1 million or all rows within the
-table.
-- “Max distinct values” is an option when scanning field values. By
-default, this is set to 1,000, meaning a maximum of 1,000 distinct
-values per field will appear in the scan report. This option can be set
-to 100, 1,000 or 10,000 distinct values.
+- “Min cell count” is an option when scanning field values. By default, this is set to 5, meaning values in the source data that appear less than 5 times will not appear in the report.
+- “Rows per table” is an option when scanning field values. By default, WhiteRabbit will random 100,000 rows in the table. There are other options to review 500,000, 1 million or all rows within the table.
+- “Max distinct values” is an option when scanning field values. By default, this is set to 1,000, meaning a maximum of 1,000 distinct values per field will appear in the scan report. This option can be set to 100, 1,000 or 10,000 distinct values.
-- Unchecking the “Scan field values” tells WhiteRabbit to not review
-or report on any of the raw data items.
-- Checking the “Numeric stats” box will include numeric statistics.
-See the section on Numerical
-Statistics.
+- Unchecking the “Scan field values” tells WhiteRabbit to not review or report on any of the raw data items.
+- Checking the “Numeric stats” box will include numeric statistics. See the section on Numerical Statistics.
-Once all settings are completed, press the ‘Scan tables’ button.
-After the scan is completed the report will be written to the working
-folder.
+Once all settings are completed, press the ‘Scan tables’ button. After the scan is completed the report will be written to the working folder.
Running from the command line
-For various reasons one could prefer to run WhiteRabbit from the
-command line. This is possible by specifying all the options one would
-normally select in the user interface in an .ini file. An example ini
-file can be found in the iniFileExamples
-folder. Then, we can reference the ini file when calling WhiteRabbit
-from the command line:
+For various reasons one could prefer to run WhiteRabbit from the command line. This is possible by specifying all the options one would normally select in the user interface in an .ini file. Example ini files can be found in the iniFileExamples folder. WhiteRabbit.ini
is a generic example, and there are also one or more database specific examples (e.g. Snowflake.ini
) Then, we can reference the ini file when calling WhiteRabbit from the command line, e.g.:
Windows
bin/whiteRabbit.bat -ini WhiteRabbit.ini
Mac/Unix
@@ -713,159 +565,76 @@ Running from the command line
Reading the Scan
-After the scan is completed, a “ScanReport” Excel document will be
-created in the working folder location selected earlier. The document
-will have multiple tabs. The first two tabs are a “Field Overview” tab
-and a “Table Overview” tab. The subsequent tabs contain field and value
-overviews for each database table or delimited text files selected for
-the scan. The last tab (indicated by "_"
) contains metadata
-on the WhiteRabbit settings used to create the scan report. The “Table
-Overview” and "_"
tab are not present in releases earlier
-than v0.10.0.
+After the scan is completed, a “ScanReport” Excel document will be created in the working folder location selected earlier. The document will have multiple tabs. The first two tabs are a “Field Overview” tab and a “Table Overview” tab. The subsequent tabs contain field and value overviews for each database table or delimited text files selected for the scan. The last tab (indicated by "_"
) contains metadata on the WhiteRabbit settings used to create the scan report. The “Table Overview” and "_"
tab are not present in releases earlier than v0.10.0.
Field Overview
-The “Field Overview” tab will show for each table scanned, the
-details for each field. For example the data type, the number of empty
-rows and other statistics.
+The “Field Overview” tab will show for each table scanned, the details for each field. For example the data type, the number of empty rows and other statistics.
- Column A: will list which table the information is about
- Column B: the column name
- Column C: a column description
- Column D: the data type
-- Column E: the maximum length of the values (number of
-characters/digits)
-- Column F: the number of rows (with text files it will return -
-1)
-- Column G: will tell you how many rows of the N rows were
-scanned
+- Column E: the maximum length of the values (number of characters/digits)
+- Column F: the number of rows (with text files it will return - 1)
+- Column G: will tell you how many rows of the N rows were scanned
- Column H: shows how many of the checked rows are empty
-- Column I: shows a count of the unique values within the checked
-rows. This number is sometimes an upper limit of the unique values,
-indicated by a
<=
sign (This column is not present in
-releases earlier than v0.9.0)
-- Column J: shows the percentage of unique values among all (0% =
-constant value, 100% = unique column. This column is not present in
-releases earlier than v0.9.0)
+- Column I: shows a count of the unique values within the checked rows. This number is sometimes an upper limit of the unique values, indicated by a
<=
sign (This column is not present in releases earlier than v0.9.0)
+- Column J: shows the percentage of unique values among all (0% = constant value, 100% = unique column. This column is not present in releases earlier than v0.9.0)
Table Overview
-The “Table Overview” tab gives information about each of the tables
-in the data source. Below is an example image of the “Table Overview”
-tab.
+The “Table Overview” tab gives information about each of the tables in the data source. Below is an example image of the “Table Overview” tab.
- Column A: will list which table the information is about
- Column B: a table description
-- Column C: the number of rows in a table(with text files it will
-return - 1)
-- Column D: will tell you how many rows of the N rows were
-scanned
+- Column C: the number of rows in a table(with text files it will return - 1)
+- Column D: will tell you how many rows of the N rows were scanned
- Column E: the number of fields in the table
- Column F: the number of empty fields
-The “Description” column for both the field and table overview was
-added in v0.10.0. These cells are not populated by WhiteRabbit (with the
-exception when scanning sas7bdat files that contain labels). Rather,
-this field provides a way for the data holder to add descriptions to the
-fields and tables. These descriptions are displayed in Rabbit-In-A-Hat
-when loading the scan report. This is especially useful when the
-fieldnames are abbreviations or in a foreign language.
+The “Description” column for both the field and table overview was added in v0.10.0. These cells are not populated by WhiteRabbit (with the exception when scanning sas7bdat files that contain labels). Rather, this field provides a way for the data holder to add descriptions to the fields and tables. These descriptions are displayed in Rabbit-In-A-Hat when loading the scan report. This is especially useful when the fieldnames are abbreviations or in a foreign language.
Value scans
-If the values of the table have been scanned (described in Performing the Scan), the scan report
-will contain a tab for each scanned table. An example for one field is
-shown below.
+If the values of the table have been scanned (described in Performing the Scan), the scan report will contain a tab for each scanned table. An example for one field is shown below.
-The field names from the source table will be across the columns of
-the Excel tab. Each source field will generate two columns in the Excel.
-One column will list all distinct values that have a “Min cell count”
-greater than what was set at time of the scan. Next to each distinct
-value will be a second column that contains the frequency, or the number
-of times that value occurs in the data. These two columns(distinct
-values and frequency) will repeat for all the source columns in the
-profiled table.
-If a list of unique values was truncated, the last value in the list
-will be "List truncated..."
; this indicates that there are
-one or more additional unique source values that have a frequency lower
-than the “Min cell count”.
-The scan report is powerful in understanding your source data by
-highlighting what exists. For example, the above example was retrieved
-for the “GENDER” column within one of the tables scanned, we can see
-that there were two common values (1 & 2) that appeared 104 and 96
-times respectively. WhiteRabbit will not define “1” as male and “2” as
-female; the data holder will typically need to define source codes
-unique to the source system. However, these two values (1 & 2) are
-not the only values present in the data because we see this list was
-truncated. These other values appear with very low frequency (defined by
-“Min cell count”) and often represent incorrect or highly suspicious
-values. When generating an ETL we should not only plan to handle the
-high-frequency gender concepts “1” and “2” but also the other
-low-frequency values that exist within this column.
+The field names from the source table will be across the columns of the Excel tab. Each source field will generate two columns in the Excel. One column will list all distinct values that have a “Min cell count” greater than what was set at time of the scan. Next to each distinct value will be a second column that contains the frequency, or the number of times that value occurs in the data. These two columns(distinct values and frequency) will repeat for all the source columns in the profiled table.
+If a list of unique values was truncated, the last value in the list will be "List truncated..."
; this indicates that there are one or more additional unique source values that have a frequency lower than the “Min cell count”.
+The scan report is powerful in understanding your source data by highlighting what exists. For example, the above example was retrieved for the “GENDER” column within one of the tables scanned, we can see that there were two common values (1 & 2) that appeared 104 and 96 times respectively. WhiteRabbit will not define “1” as male and “2” as female; the data holder will typically need to define source codes unique to the source system. However, these two values (1 & 2) are not the only values present in the data because we see this list was truncated. These other values appear with very low frequency (defined by “Min cell count”) and often represent incorrect or highly suspicious values. When generating an ETL we should not only plan to handle the high-frequency gender concepts “1” and “2” but also the other low-frequency values that exist within this column.
Numerical Statistics
-If the option for numerical statistics is checked, then a set of
-statistics is calculated for all integer, real and date data types. The
-following statistics are added to the Field Overview sheet (Columns
-K-Q):
+If the option for numerical statistics is checked, then a set of statistics is calculated for all integer, real and date data types. The following statistics are added to the Field Overview sheet (Columns K-Q):
-- Columns E-J are not shown, see section above for a
-description
+- Columns E-J are not shown, see section above for a description
- Column K: Average
- Column L: Standard Deviation (sampled)
- Column M: Minimum
- Columns N/O/P: Quartiles (sampled)
- Column Q: Maximum
-When selecting the option for scanning numerical statistics, the
-parameter “Numeric stats reservoir size” can be set. This defines the
-number of values that will be stored for calculation of the numeric
-statistics. These values will be randomly sampled from the field values
-in the scan report. If the number of values is smaller than the set
-reservoir size, then the standard deviation and three quartile
-boundaries are the exact population statistics. Otherwise, the
-statistics are approximated based on a representative sample. The
-average, minimum and maximum are always true population statistics. For
-dates, the standard deviation of dates is given in days. The other date
-statistics are converted to a date representation.
+When selecting the option for scanning numerical statistics, the parameter “Numeric stats reservoir size” can be set. This defines the number of values that will be stored for calculation of the numeric statistics. These values will be randomly sampled from the field values in the scan report. If the number of values is smaller than the set reservoir size, then the standard deviation and three quartile boundaries are the exact population statistics. Otherwise, the statistics are approximated based on a representative sample. The average, minimum and maximum are always true population statistics. For dates, the standard deviation of dates is given in days. The other date statistics are converted to a date representation.
Generating Fake Data
-This feature allows one to create a fake dataset based on a
-WhiteRabbit scan report. The generated fake data can be outputted
-directly to database tables (MySQL, Oracle, SQL Server, PostgreSQL) or
-as delimited text file. The resulting dataset could be used to develop
-ETL code when direct access to the data is not available.
+This feature allows one to create a fake dataset based on a WhiteRabbit scan report. The generated fake data can be outputted directly to database tables (MySQL, Oracle, SQL Server, PostgreSQL) or as delimited text file. The resulting dataset could be used to develop ETL code when direct access to the data is not available.
WhiteRabbit has three modes to generate fake data:
-- If no values have been scanned (i.e. the column in the scan report
-doesn’t contain values), WhiteRabbit will generate random strings or
-numbers for that column.
-- If there are values scanned, WhiteRabbit will generate the data by
-choosing from the scan values. Values are sampled either based on the
-frequencies of the values, or sampled uniformly (if this option
-selected).
-- If the column only contains unique values (each value has a
-frequency of 1, e.g. for primary keys), the generated column will be
-kept unique.
+- If no values have been scanned (i.e. the column in the scan report doesn’t contain values), WhiteRabbit will generate random strings or numbers for that column.
+- If there are values scanned, WhiteRabbit will generate the data by choosing from the scan values. Values are sampled either based on the frequencies of the values, or sampled uniformly (if this option selected).
+- If the column only contains unique values (each value has a frequency of 1, e.g. for primary keys), the generated column will be kept unique.
The following options are available for generating fake data:
-- “Max rows per table” sets the number of rows of each output table.
-By default, it is set to 10,000.
-- By checking the “Uniform Sampling” box will generate the fake data
-uniformly. The frequency of each of the values will be treated as being
-1, but the value sampling will still be random. This increases the
-chance that each of the values in the scan report is at least once
-represented in the output data.
+- “Max rows per table” sets the number of rows of each output table. By default, it is set to 10,000.
+- By checking the “Uniform Sampling” box will generate the fake data uniformly. The frequency of each of the values will be treated as being 1, but the value sampling will still be random. This increases the chance that each of the values in the scan report is at least once represented in the output data.
diff --git a/docs/WhiteRabbit.md b/docs/WhiteRabbit.md
index 8790f4f0..f120fe66 100644
--- a/docs/WhiteRabbit.md
+++ b/docs/WhiteRabbit.md
@@ -84,7 +84,7 @@ Use the “Pick Folder” button to navigate in your local environment where you
Here you can specify the location of the source data.
The following source types are supported:
- delimited text files, SAS files, MySQL, SQL Server, Oracle, PostgreSQL, Microsoft Access, Amazon RedShift, PDW, Teradata, Google BigQuery, Azure.
+ delimited text files, SAS files, MySQL, SQL Server, Oracle, PostgreSQL, Microsoft Access, Amazon RedShift, PDW, Teradata, Google BigQuery, Azure, Snowflake.
Below are connection instructions for each data type of data source.
Once you have entered the necessary information, the “Test connection” button can ensure a connection can be made.
@@ -160,6 +160,19 @@ Authentication via service account credentials:
* _**User name:**_ name of the user used to log into the server
* _**Password:**_ password for the supplied user name
+#### Snowflake
+
+ * _**Account:**_ the account name for your Snowflake instance
+ * _**User:**_ user name to be used for the Snowflake instance
+ * _**Password:**_ password for the above user
+ * _**Warehouse:**_ warehouse within the Snowflake instance
+ * _**Database:**_ database to be used within the warehouse
+ * _**Schema:**_ schema to be used within the database
+ * _**Authentication method:**_ authentication method to be used. Currently only the value 'externalbrowser' is supported
+
+Please note that the fields _**Password**_ and _**Authentication method**_ are mutually exclusive: for only one of these fields
+a value should be supplied. A warning will be given when a value is supplied for both fields.
+
## Scanning a Database
### Performing the Scan
@@ -185,8 +198,9 @@ Once all settings are completed, press the ‘Scan tables’ button. After the s
For various reasons one could prefer to run WhiteRabbit from the command line.
This is possible by specifying all the options one would normally select in the user interface in an .ini file.
-An example ini file can be found in the [iniFileExamples folder](https://github.com/OHDSI/WhiteRabbit/blob/master/iniFileExamples/WhiteRabbit.ini).
-Then, we can reference the ini file when calling WhiteRabbit from the command line:
+Example ini files can be found in the [iniFileExamples folder](https://github.com/OHDSI/WhiteRabbit/blob/master/iniFileExamples/WhiteRabbit.ini). `WhiteRabbit.ini` is a generic example, and there
+are also one or more database specific examples (e.g. `Snowflake.ini`)
+Then, we can reference the ini file when calling WhiteRabbit from the command line, e.g.:
**Windows**
diff --git a/docs/best_practices.html b/docs/best_practices.html
index 5df26713..52ac5779 100644
--- a/docs/best_practices.html
+++ b/docs/best_practices.html
@@ -300,49 +300,28 @@ Best Practices
The following lists best practices in using WhiteRabbit and -Rabbit-In-a-Hat to manage your ETL documentation process:
+The following lists best practices in using WhiteRabbit and Rabbit-In-a-Hat to manage your ETL documentation process:
References:
@@ -318,28 +314,16 @@ |
---|
WhiteRabbit is a small application that can be used -to analyse the structure and contents of a database as preparation for -designing an ETL. It comes with RabbitInAHat, an -application for interactive design of an ETL to the OMOP Common Data -Model with the help of the the scan report generated by White -Rabbit.
+WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.
Rabbit in a Hat can generate a framework for creating a set of unit tests. The -framework consists of a set of R functions tailored to the source and -target schema in your ETL. These functions can then be used to define -the unit tests.
-Unit testing assumes that you have your data in source format -somewhere in a database. You should already have created an ETL process -that will extract from the source database, transform it into CDM -format, and load it into a CDM schema. The unit test framework can be -used to make sure that your ETL process is doing what it is supposed to -do. For this you will need to create a new, empty database with exactly -the same structure as your source database, and a new empty database -where a test CDM database will live. The testing framework can be used -to insert test data into the empty source schema. Next, you can run your -ETL process on the test data to populate the test CDM database. Finally, -you can use the framework to verify that the output of the ETL in the -test CDM database is what you’d expect given the test source data.
+Rabbit in a Hat can generate a framework for creating a set of unit tests. The framework consists of a set of R functions tailored to the source and target schema in your ETL. These functions can then be used to define the unit tests.
+Unit testing assumes that you have your data in source format somewhere in a database. You should already have created an ETL process that will extract from the source database, transform it into CDM format, and load it into a CDM schema. The unit test framework can be used to make sure that your ETL process is doing what it is supposed to do. For this you will need to create a new, empty database with exactly the same structure as your source database, and a new empty database where a test CDM database will live. The testing framework can be used to insert test data into the empty source schema. Next, you can run your ETL process on the test data to populate the test CDM database. Finally, you can use the framework to verify that the output of the ETL in the test CDM database is what you’d expect given the test source data.
These are the steps to perform unit testing:
It is advised to use R-Studio -for defining your unit tests. One reason is that RStudio will -automatically prompt you with possible function and argument names after -you’ve only typed the first few characters.
+It is advised to use R-Studio for defining your unit tests. One reason is that RStudio will automatically prompt you with possible function and argument names after you’ve only typed the first few characters.
In Rabbit in a Hat, have your ETL specifications open. The source
-data schema should be loaded from the White-Rabbit scan report, and the
-target data schema should be selected (usually the OMOP CDM v5). Go to
-File → Generate ETL Test Framework, and use a file name with
-the .R extension, for example MyTestFrameWork.R
.
In Rabbit in a Hat, have your ETL specifications open. The source data schema should be loaded from the White-Rabbit scan report, and the target data schema should be selected (usually the OMOP CDM v5). Go to File → Generate ETL Test Framework, and use a file name with the .R extension, for example MyTestFrameWork.R
.
Next, create an empty R script, and start by sourcing the R file that -was just created:
+Next, create an empty R script, and start by sourcing the R file that was just created:
source("MyTestFrameWork.R")
-Be sure to run this command immediately to make the function -definitions available to R-Studio.
+Be sure to run this command immediately to make the function definitions available to R-Studio.
The test framework defines the following functions for each -table in the source schema:
+The test framework defines the following functions for each table in the source schema:
get_defaults_<table name>
shows the default field
-values that will be used when creating a record in the table. At the
-start, these default values have been taken from the White-Rabbit scan
-report, using the most frequent value.set_defaults_<table name>
can be used to change
-the default values of one or more fields in the table. For example
-set_defaults_enrollment(enrollment_date = "2000-01-01")
.add_<table name>
can be used to specify that a
-record should be created in the table. The arguments can be used to
-specify field values. For fields where the user doesn’t specify a value,
-the default value is used. For example
-add_enrollment(member_id = "M00000001")
.get_defaults_<table name>
shows the default field values that will be used when creating a record in the table. At the start, these default values have been taken from the White-Rabbit scan report, using the most frequent value.set_defaults_<table name>
can be used to change the default values of one or more fields in the table. For example set_defaults_enrollment(enrollment_date = "2000-01-01")
.add_<table name>
can be used to specify that a record should be created in the table. The arguments can be used to specify field values. For fields where the user doesn’t specify a value, the default value is used. For example add_enrollment(member_id = "M00000001")
.The following functions are defined for each table in the CDM -schema:
+The following functions are defined for each table in the CDM schema:
expect_<table name>
can be used to state the
-expectation that at least one record with the defined properties should
-exist in the table. For example
-expect_person(person_id = 1, person_source_value = "M00000001")
.expect_no_<table name>
can be used to state the
-expectation that no record with the defined properties should exist in
-the table. For example
-expect_no_condition_occurrence(person_id = 1)
.expect_count_<table name>
can be used to state
-the expectation that a specific number of records with the defined
-properties should exist in the table. For example
-expect_count_condition_occurrence(person_id = 1, rowCount = 3)
.lookup_<table name>
can be used to get a specific
-value from another table. For example to get the person_id
-by person_source_value
.expect_<table name>
can be used to state the expectation that at least one record with the defined properties should exist in the table. For example expect_person(person_id = 1, person_source_value = "M00000001")
.expect_no_<table name>
can be used to state the expectation that no record with the defined properties should exist in the table. For example expect_no_condition_occurrence(person_id = 1)
.expect_count_<table name>
can be used to state the expectation that a specific number of records with the defined properties should exist in the table. For example expect_count_condition_occurrence(person_id = 1, rowCount = 3)
.lookup_<table name>
can be used to get a specific value from another table. For example to get the person_id
by person_source_value
.One further function is available:
declareTest
is used to group multiple statements under
-a single identifier. For example
-declareTest(id = 1, description = "Test person ID")
.declareTest
is used to group multiple statements under a single identifier. For example declareTest(id = 1, description = "Test person ID")
.Using these functions, we can define tests. Here is an example unit -test:
+Using these functions, we can define tests. Here is an example unit test:
declareTest(101, "Person gender mappings")
add_enrollment(member_id = "M000000101", gender_of_member = "male")
add_enrollment(member_id = "M000000102", gender_of_member = "female")
expect_person(person_id = 101, gender_concept_id = 8507, gender_source_value = "male")
expect_person(person_id = 102, gender_concept_id = 8532, gender_source_value = "female")
-In this example, we define a test for gender mappings. We specify
-that two records should be created in the enrollment
table
-in the source schema, and we specify different values for the
-member_id
field and gender_of_member
field.
-Note that the enrollment
table might have many other
-fields, for example defining the start and end of enrollment, but that
-we don’t have to specify these in this example because these fields will
-take their default values, typically taken from the White-Rabbit scan
-report.
In this example we furthermore describe what we expect to see in the
-CDM data schema. In this case we formulate expectations for the
-person
table.
We can add many such tests to our R script. For examples of a full -set of test definitions, see:
+In this example, we define a test for gender mappings. We specify that two records should be created in the enrollment
table in the source schema, and we specify different values for the member_id
field and gender_of_member
field. Note that the enrollment
table might have many other fields, for example defining the start and end of enrollment, but that we don’t have to specify these in this example because these fields will take their default values, typically taken from the White-Rabbit scan report.
In this example we furthermore describe what we expect to see in the CDM data schema. In this case we formulate expectations for the person
table.
We can add many such tests to our R script. For examples of a full set of test definitions, see:
For some tests you need unknown values from other cdm tables. In this -case you can use the lookup function for the required table. This -creates where conditions on the other cdm table in the test sql. In the -example below, we do not know which person_id got assigned to this test -person, so we lookup the id by source value:
+For some tests you need unknown values from other cdm tables. In this case you can use the lookup function for the required table. This creates where conditions on the other cdm table in the test sql. In the example below, we do not know which person_id got assigned to this test person, so we lookup the id by source value:
declareTest(101, "Person gender mappings")
add_enrollment(member_id = "M000000103")
add_diagnosis(member_id = "M000000103", code="I10")
@@ -523,10 +450,7 @@ Lookup functions
The framework also contains a function to show statistics on how well -your tests cover your mappings. Note that this does not show information -on whether the tests passed. Only how many source and target tables are -covered by your defined tests.
+The framework also contains a function to show statistics on how well your tests cover your mappings. Note that this does not show information on whether the tests passed. Only how many source and target tables are covered by your defined tests.
A summary can be printed by running:
summaryTestFramework()
which displays the following statistics
@@ -541,52 +465,29 @@Statistics:
n_tests
: total number of expects, expect_no’s or
-expect_counts are definedn_cases
: total number of cases defined with
-declareTest
function.n_source_fields_tested
: number of source fields for
-which a test data is definedn_source_fields_mapped_from
: number of source fields
-for which a mapping was defined in Rabbit in a Hatsource_coverage
: percentage of mapped source fields for
-which a test has been definedn_target_fields_tested
: number of target fields for
-which one or more expects, expect_no’s or expect_counts have been
-definedn_target_fields_mapped_to
: number of target fields for
-which a mapping was defined in Rabbit in a Hattarget_coverage
: percentage of mapped target fields for
-which a test has been definedn_tests
: total number of expects, expect_no’s or expect_counts are definedn_cases
: total number of cases defined with declareTest
function.n_source_fields_tested
: number of source fields for which a test data is definedn_source_fields_mapped_from
: number of source fields for which a mapping was defined in Rabbit in a Hatsource_coverage
: percentage of mapped source fields for which a test has been definedn_target_fields_tested
: number of target fields for which one or more expects, expect_no’s or expect_counts have been definedn_target_fields_mapped_to
: number of target fields for which a mapping was defined in Rabbit in a Hattarget_coverage
: percentage of mapped target fields for which a test has been definedNote that the mapping coverages depends on the mappings defined in
-Rabbit in a Hat. If this mapping is incomplete or adjusted in the
-meantime, the target_coverage
is possibly incorrect. In
-this case, please update the mappings in Rabbit in a Hat and regenerate
-the testing framework.
You can get all source and target field for which no test has been -defined with the following functions:
+Note that the mapping coverages depends on the mappings defined in Rabbit in a Hat. If this mapping is incomplete or adjusted in the meantime, the target_coverage
is possibly incorrect. In this case, please update the mappings in Rabbit in a Hat and regenerate the testing framework.
You can get all source and target field for which no test has been defined with the following functions:
getUntestedSourceFields()
getUntestedTargetFields()
There are two ways to generate test data, either as SQL insert -statements or as csv files. Please choose the format that is appropriate -for your ETL application.
+There are two ways to generate test data, either as SQL insert statements or as csv files. Please choose the format that is appropriate for your ETL application.
After we have defined all our tests we need to run
insertSql <- generateInsertSql(databaseSchema = "nativeTestSchema")
testSql <- generateTestSql(databaseSchema = "cdmTestSchema")
-to generate the SQL for inserting the test data in the database
-(insertSql), and for running the tests on the ETL-ed data (testSql). The
-insertion SQL assumes that the data schema already exists in
-nativeTestSchema
, and will first remove any records that
-might be in the tables. We can execute the SQL in any SQL client, or we
-can use OHDSI’s DatabaseConnector
-package. For example:
to generate the SQL for inserting the test data in the database (insertSql), and for running the tests on the ETL-ed data (testSql). The insertion SQL assumes that the data schema already exists in nativeTestSchema
, and will first remove any records that might be in the tables. We can execute the SQL in any SQL client, or we can use OHDSI’s DatabaseConnector package. For example:
library(DatabaseConnector)
connectionDetails <- createConnectionDetails(user = "joe",
password = "secret",
@@ -598,29 +499,20 @@ SQL
In case the source data are csv files rather than database tables, we -use this function:
+In case the source data are csv files rather than database tables, we use this function:
generateSourceCsv(directory = "test_data", separator = ",")
And point the ETL to the given directory with test data.
Now that the test source data is populated, you can run the ETL
-process you would like to test. The ETL should transform the data in
-nativeTestSchema
, or in the csv directory, to CDM data in
-cdmTestSchema
.
Now that the test source data is populated, you can run the ETL process you would like to test. The ETL should transform the data in nativeTestSchema
, or in the csv directory, to CDM data in cdmTestSchema
.
The test SQL will create a table called test_results
in
-cdmTestSchema
, and populate it with the results of the
-tests. If the table already exists it will first be dropped. Again, we
-could use any SQL client to run this SQL, or we could use
-DatabaseConnector:
The test SQL will create a table called test_results
in cdmTestSchema
, and populate it with the results of the tests. If the table already exists it will first be dropped. Again, we could use any SQL client to run this SQL, or we could use DatabaseConnector:
executeSql(connection, paste(testSql, collapse = "\n"))
-Afterwards, we can query the results table to see the results for -each test:
+Afterwards, we can query the results table to see the results for each test:
querySql(connection, "SELECT * FROM test_results")
Which could return this table:
In this case we see there were two expect statements under test 101 -(Person gender mappings), and both expectations were met so the test -passed.
-The testing framework also contains a convenience function to display -your test results:
+In this case we see there were two expect statements under test 101 (Person gender mappings), and both expectations were met so the test passed.
+The testing framework also contains a convenience function to display your test results:
outputTestResultsSummary(connection, 'cdmTestSchema')
-which either displays a success message
-All 178 tests PASSED
or the failed unit tests:
which either displays a success message All 178 tests PASSED
or the failed unit tests:
FAILED unit tests: 1/178 (0.6%)
ID DESCRIPTION TEST STATUS
2 1 RMG-PD1 is assigned person_id Expect person FAIL
We can create an overview of defined tests and export it, for example -if you want to list the tests separately.
+We can create an overview of defined tests and export it, for example if you want to list the tests separately.
getTestsOverview()
exportTestsOverviewToFile(filename = "all_test_cases.csv")
-The first function produces a table like below and the second writes -it to a csv file. The output contains the following columns:
+The first function produces a table like below and the second writes it to a csv file. The output contains the following columns: