Skip to content

Commit

Permalink
more sql snippets
Browse files Browse the repository at this point in the history
ref #7
  • Loading branch information
wibeasley committed Jul 4, 2020
1 parent 9494630 commit 775d409
Showing 1 changed file with 122 additions and 0 deletions.
122 changes: 122 additions & 0 deletions ch-file-prototype-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,130 @@ In some scenarios, it is desirable to use the `INSERT` SQL command to transfer d

In both cases, we try to write the SQL files to conform to similar standards and conventions. As stated in [Consistency across Files](#consistency-files) (and in the [previous chapter](#file-prototype-r)), using a consistent file structure can (a) improve the quality of the code because the structure has been proven over time to facilitate good practices and (b) allow your intentions to be more clear to teammates because they are familiar with the order and intentions of the chunks.

Ferry {#sql-ferry}
------------------------------------

This basic sql file moves data within a database to create a table named `dx`, which is contained in the `ley_covid_1` schema of the `cdw_staging` database.

```sql
--use cdw_staging
declare @start_date date = '2020-02-01'; -- sync with config.yml
declare @stop_date date = dateadd(day, -1, cast(getdate() as date)); -- sync with config.yml

DROP TABLE if exists ley_covid_1.dx;
CREATE TABLE ley_covid_1.dx(
dx_id int identity(1, 1) primary key,
patient_id int not null,
covid_confirmed bit not null,
problem_date date null,
icd10_code varchar(20) not null
);
-- TRUNCATE TABLE ley_covid_1.dx;

INSERT INTO ley_covid_1.dx
SELECT
pr.patient_id
,ss.covid_confirmed
,pr.invoice_date as problem_date
,pr.code as icd10_code
-- into ley_covid_1.dx
FROM cdw.star_1.fact_problem as pr
inner join beasley_covid_1.ss_dx as ss on pr.code = ss.icd10_code
WHERE
pr.problem_date_start between @start_date and @stop_date
and
pr.patient_id is not null
ORDER BY pr.problem_date_start

CREATE INDEX ley_covid_1_dx_patient_id on ley_covid_1.dx (patient_id);
CREATE INDEX ley_covid_1_dx_icd10_code on ley_covid_1.dx (icd10_code);
```

Default Databases {#sql-default-database}
------------------------------------

We prefer not to specify the database of each table, and instead control it through the connection (such as the DSN's "default database" value). Nevertheless, it's helpful to include the default database behind a comment for two reasons. First, it communicates to the default database to the human reader. Second, during debugging, the code can be highlighted in [ADS](#workstation-ads)/[SSMS](#workstation-ssms) and executed with "F5"; this will mimic what happens when the file is run via automation with a DSN.

```sql
--use cdw_staging
```

Declare Values Databases {#sql-default-database}
------------------------------------

Similar to the [Declare Globals](#chunk-declare) chunk in a [prototypical R file](file-prototype-r), values set at the top of the file are easy to read and modify.

```sql
declare @start_date date = '2020-02-01'; -- sync with config.yml
declare @stop_date date = dateadd(day, -1, cast(getdate() as date)); -- sync with config.yml
```

Recreate Table {#sql-recreate}
------------------------------------

When batch-loading data, it is typically easiest drop and recreate a database table. In the snippet below, any table with the specific name is dropped/deleted from the database and replaced with a (possibly new) definition. We like to dedicate a line to each table column, with at least three elements per line: the name, the [data type](https://docs.microsoft.com/en-us/sql/t-sql/data-types/data-types-transact-sql), and if nulls are allowed.

Many other features and keywords are available when designing tables. The ones we occasionally use are:

1. [`primary key`](https://www.w3schools.com/sql/sql_primarykey.ASP) helps database optimization when later querying the table, and enforces uniqueness, such as a patient table should not have any two rows with the same `patient_id` value. Primary keys must be nonmissing, so the `not null` keyword is redundant.
1. [`unique`](https://www.w3schools.com/sql/sql_unique.asp) is helpful when a table has additional columns that need to be unique (such as `patient_ssn` and `patient_id`). A more advanced scenario using a [clustered columnar table](https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview#when-should-i-use-a-columnstore-index), which is incompatible with the `primary key` designation.
1. `identity(1, 1)` creates a 1, 2, 3, ... sequence, which relieves the client of creating the sequence with something like [`row_number()`](https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql). Note that when identity column exists, the number columns in the `SELECT` clause will be one fewer than the columns defined in `CREATE TABLE`.

```sql
DROP TABLE if exists ley_covid_1.dx;
CREATE TABLE ley_covid_1.dx(
dx_id int identity(1, 1) primary key,
patient_id int not null,
covid_confirmed bit not null,
problem_date date null,
icd10_code varchar(20) not null
);
```

To jump-start the creation of the table definition, we frequently use the [`INTO`](https://docs.microsoft.com/en-us/sql/t-sql/queries/select-into-clause-transact-sql) clause. This operation creates a new table, informed the column properties of the source tables. Within [ADS](#workstation-ads) and [SSMS](#workstation-ssms), refresh the list of tables and select the new table; there will be an option to copy the `CREATE TABLE` statement (similar to the snippet above) and paste it into the sql file. The definition can then be modified, such as tightening from `null` to `not null`.

```sql
-- into ley_covid_1.dx
```

Truncate Table {#sql-truncate}
------------------------------------

In scenarios where the table definition is stable and the data is refreshed frequently (say, daily), consider [TRUNCATE](https://docs.microsoft.com/en-us/sql/t-sql/statements/truncate-table-transact-sql)-ing the table. When taking this approach, we prefer to keep the `DROP` and `CREATE` code in the file, but commented out. This saves development time in the future if the table definition needs to be modified.

```sql
-- TRUNCATE TABLE ley_covid_1.dx;
```

```sql
INSERT INTO ley_covid_1.dx
```

```sql
SELECT
pr.patient_id
,ss.covid_confirmed
,pr.invoice_date as problem_date
,pr.code as icd10_code
```

```sql
FROM cdw.star_1.fact_problem as pr
inner join beasley_covid_1.ss_dx as ss on pr.code = ss.icd10_code
```

```sql
WHERE
pr.problem_date_start between @start_date and @stop_date
and
pr.patient_id is not null
```

```sql
ORDER BY pr.problem_date_start
```

```sql
CREATE INDEX ley_covid_1_dx_patient_id on ley_covid_1.dx (patient_id);
CREATE INDEX ley_covid_1_dx_icd10_code on ley_covid_1.dx (icd10_code);
```

0 comments on commit 775d409

Please sign in to comment.