Skip to content

Commit

Permalink
starting database chapter
Browse files Browse the repository at this point in the history
ref #7
  • Loading branch information
wibeasley committed Jul 3, 2020
1 parent 5a2f9c9 commit b0912e2
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 5 deletions.
3 changes: 2 additions & 1 deletion _bookdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ rmd_files:
"index.Rmd",
"ch-coding.md",
"ch-architecture.md",
"ch-file-prototype.md",
"ch-file-prototype-r.md",
"ch-file-prototype-sql.md",
"ch-repo-prototype.md",
"ch-rest.md",
"ch-patterns.md",
Expand Down
4 changes: 2 additions & 2 deletions ch-file-prototype.md → ch-file-prototype-r.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Prototypical File {#file-prototype}
Prototypical R File {#file-prototype-r}
====================================

As stated before, in [Consistency across Files](#consistency-files), using a consistent file structure can (a) improve the quality of the code because the structure has been proven over time to facilitate good practices and (b) allow your intentions to be more clear to teammates because they are familiar with the order and intentions of the chunks.
As stated in [Consistency across Files](#consistency-files), using a consistent file structure can (a) improve the quality of the code because the structure has been proven over time to facilitate good practices and (b) allow your intentions to be more clear to teammates because they are familiar with the order and intentions of the chunks.

We use the term "chunk" for a section of code because it corresponds with knitr terminology [@xie2015], and in many analysis files (as opposed to manipulation files), the chunk of our R file connects to a knitr Rmd file.

Expand Down
20 changes: 20 additions & 0 deletions ch-file-prototype-sql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Prototypical SQL File {#file-prototype-sql}
====================================

New data scientists typically import entire tables from a database into R, and then merge, filter, and groom the data.frames. A more efficient approach is to submit [sql](https://en.wikipedia.org/wiki/SQL) that executes on the database and returns a more specialized dataset.

This provides several advantages:

1. A database will be much more efficient when filtering and [joining](https://www.w3schools.com/sql/sql_join.asp) tables than any programing language, such as R or Python. A well-designed database will have [indexed columns](https://en.wikipedia.org/wiki/Database_index#:~:text=A%20database%20index%20is%20a,maintain%20the%20index%20data%20structure.) and other optimizations that surpass R and Python capabilities.
1. A database handles datasets that are thousands of times larger than what R and Python can accommodate in RAM. For large datasets, database engines persist the data on a hard drive (instead of just RAM) and are optimized to read the necessary information into RAM the moment before it is needed, and then return the processed back to disk before progressing to the next block of data.
1. Frequently, only a portion of the table's rows and columns are ultimately needed by the analysis. Reducing the size of the dataset leaving the database has two benefits: less information travels across the network and R's and Python's limited memory space is conserved.

In some scenarios, it is desirable to use the `INSERT` SQL command to transfer data within the database; and never travel across the network and never touch R or your local machine. For our large and complicated projects, the majority of data movement uses `INSERT` commands within SQL files. Among these scenarios, the analysis-focused projects use R to call the sequence of SQL files (see [`flow.R`](#repo-flow)), while the database-focused project uss [SSIS](https://en.wikipedia.org/wiki/SQL_Server_Integration_Services).

In both cases, we try to write the SQL files to conform to similar standards and conventions. As stated in [Consistency across Files](#consistency-files) (and in the [previous chapter](#file-prototype-r)), using a consistent file structure can (a) improve the quality of the code because the structure has been proven over time to facilitate good practices and (b) allow your intentions to be more clear to teammates because they are familiar with the order and intentions of the chunks.

Default Databases {#sql-default-database}
------------------------------------

```sql
```
4 changes: 2 additions & 2 deletions ch-scratch-pad.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Chapters & Sections to Form
1. automation on a remote server or VDI

There's always a chance that my machine is configured a little differently than yours, which may affect results. Will you glance at those results too? I forgot what this project is about, and I wouldn't be able to spot problems like you can. The S drive file and the tables don't seem to have any obvious problems

1. public reports (and dashboards)
1. when developing a report for a external audience (ie, people outside your immediate research team), choose one or two pals who are unfamiliar with your aims/methods as an impromptu focus group. Ask them what things need to be redesigned/reframed/reformated/further-explained. ([genevamarshall](https://github.com/genevamarshall))
1. plots
Expand Down Expand Up @@ -46,7 +46,7 @@ Chapters & Sections to Form

1. Cargo cult programming "is a style of computer programming characterized by the ritual inclusion of code or program structures that serve no real purpose." ([Wikipedia](https://en.wikipedia.org/wiki/Cargo_cult_programming))

Your team should decide which elements of [a file prototype](https://ouhscbbmc.github.io/data-science-practices-1/file-prototype.html) and [repo prototype](https://ouhscbbmc.github.io/data-science-practices-1/repo-prototype.html) are best for you.
Your team should decide which elements of [a file prototype](https://ouhscbbmc.github.io/data-science-practices-1/file-prototype-r.html) and [repo prototype](https://ouhscbbmc.github.io/data-science-practices-1/repo-prototype.html) are best for you.

Practices
------------------------------------
Expand Down

0 comments on commit b0912e2

Please sign in to comment.