This summer, you will use a database to store and analyze data. Databases have several advantages over using text files such as CSVs:
- We're collecting more and more data -- often too much to fit in memory. Most databases can handle this.
- Databases can provide integrity checks and guarantees. If you have a column of numbers in a spreadsheet, Excel will let you change a random cell to text. In contrast, you can tell your database to only accept input that meets your conditions (e.g. type, uniqueness). This is especially important for ongoing projects, where you have new data coming in.
- Databases allow you to store data in one place. That makes updates easy and reliable.
- Databases are more secure. You can more carefully control who has which types of access to what data better in a database than with a CSV.
- Databases can handle multiple users. Concurrent edits to a CSV can get messy. Some file systems won't even let multiple users access a CSV at the same time.
- Databases are designed to help you do analysis. SQl will probably become your best friend.
This session builds on what you learned last week in the pipeline and command line sessions. We will focus on ETL.
- psql (command line)
- dBeaver
- csvkit
- Database server or host: the computer on which the database is running. We will use Amazon RDS.
- Database: a self-contained set of tables and schema. A server can run many databases. This summer, we will operate databases for almost all projects from the same Amazon server.
- Schema: similar to a folder. A database can contain many schema, each containing many tables.
- Tables: tables are like spreadsheets. They have rows and columns and values in each cell.
- Views: views are virtual tables created by a query but only instantiated when the query is run. They can be used as tables but are generated "on-demand" when they're used. An advantage is that they always contain the most current data but take time to compute.
- Queries: Queries are analysis that you run on a database, often in SQL.
Some unique aspects of the setup at DSSG: You cannot access the database server directly; you have to (tunnel) go through one of the EC2 instances. The data are far safer that way: you have to use your private key (better than a password) to access the EC2 and then a username and password to access the database.
There are two ways to connect to the database:
- Connect from your laptop: You use an SSH tunnel to pass data between your laptop and the database. You have a database program running locally. If you're using dBeaver, you're connecting from your laptop.
- Connect from the EC2: You SSH into the EC2 and run everything from there. Your laptop only sends your commands to the EC2; the EC2 does the work. You don't use an SSH tunnel because everything stays on the EC2.
You can use option 1 (especially dBeaver) to explore the data, but you should use option 2 to load the data. First, downloading the datasets to your laptop may violate our contracts. Second, the internet connections will be better. The connections within Amazon are pretty fast; the connections from our office to Amazon might not be. Option 2 keeps the heavy transfers on Amazon's systems.
There are three steps to get data into a database. Let's assume for now that you have a (collection of) CSV(s) that you want to load into a database. Let's also assume that the database exists (if not, you should create it).
The three steps are:
- Create table: This involves figuring out the structure of the table (how many fields, what should they be called, and what data types they are). Once you figure out the structure, you can create a sql "CREATE TABLE" statement and run that to generate an empty table*
- Copy CSV to the table: Every database has a "bulk" copy command that is much more efficient than using pandas. Please do not use pandas to copy large csvs to a database. Postgres has a COPY command that can now copy your csv to the table you just created.
- Check if it copied successfully: You want to check if your table now has the same number of rows and columns as the CSV (as well as other consistency checks). If it did not copy successfully, you may need to modify the table structure, clean the csv to remove characters, replace nulls, and try steps 1 and 2 again.
In this session, we will put the weather data from last week's command-line session into the DSSG training database.
Start by SSHing into the training server:
ssh -i ~/.ssh/your_private_key your_username@the_training_EC2_address
If you haven't already, download Matt's weather data to your training folder directory on the training server and unzip, e.g.:
cd /mnt/data/training/jwalsh/
curl -O ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
gunzip 2016.csv.gz
This gives you a file called 2016.csv
. You can explore the data using head
, tail
, csvlook
, and other tools that Matt taught you.
Here's the output from csvlook
:
The first weird thing I see when I look at the data: no headers. csvlook -H
makes it easier to read:
The README tells us that
- Column 1 is the station identifier
- Column 2 is the date (yyyymmdd)
- Column 3 is the value type (e.g. "PRCP" is precipitation)
- Column 4 is the value
- Column 5 is the "character Measurement Flag," whatever that means
- Column 6 is the "1 character Quality Flag," whatever that means
- Column 7 is the "1 character Source Flag," whatever that means
- Column 8 is the time (hhmm)
Most of the files you work with have a header, so I'll add one here:
Station,Date,Value Type,Value,Measurement Flag,Quality Flag,Source Flag,Time
csvsql
generates create table
statements for you. Because it uses Python, it will load all the data and then do its thing. To limit the resources it needs, I'll only use the first 1000 rows. We're using a PostgreSQL ("Postgres") database:
head -n 1000 2016.csv | csvsql -i postgresql
Here's the output:
CREATE TABLE stdin (
"Station" VARCHAR(11) NOT NULL,
"Date" INTEGER NOT NULL,
"Value Type" VARCHAR(4) NOT NULL,
"Value" INTEGER NOT NULL,
"Measurement Flag" VARCHAR(4),
"Quality Flag" VARCHAR(4),
"Source Flag" VARCHAR(1) NOT NULL,
"Time" VARCHAR(4)
);
A few things to note:
- Station, Date, etc. are column names.
- VARCHAR and INTEGER are column types. VARCHAR(11) means variable character length column up to 11 characters. If you try to give a character column a number, an integer column a decimal, and so on, Postgres will prevent the entire transfer.
- NOT NULL means you have to provide a value for that column.
- Postgres hates uppercase and spaces in column names. If you have either, you need to wrap the column name in quotation marks. Yuck.
- We need to replace
stdin
with the table name (jwalsh_schema.jwalsh_table
). - A common problem: funky (non-unicode) characters often appear in the source files. While that's not true here, you can fix many of them using
iconv
.
Let's give it another shot:
head -n 1000 2016.csv | iconv -t ascii | tr [:upper:] [:lower:] | tr ' ' '_' | csvsql -i postgresql
iconv -t ascii
attempts to output ascii. It can also help to use the-f
option, which givesiconv
the format of the input.tr [:upper:] [:lower:]
converts all uppercase to all lowercase.tr ' ' '_'
converts all spaces to underscores.csvsql -i postgresql
generates the postgrescreate table
statement.
Here's the output:
CREATE TABLE stdin (
station VARCHAR(11) NOT NULL,
date INTEGER NOT NULL,
value_type VARCHAR(4) NOT NULL,
value INTEGER NOT NULL,
measurement_flag VARCHAR(4),
quality_flag VARCHAR(4),
source_flag VARCHAR(1) NOT NULL,
time VARCHAR(4)
);
csvsql
ain't perfect. We still need to make some changes:
- Replace
stdin
with the table name:jwalsh_schema.jwalsh_table
. date
is listed as an integer. It should bedate
.
Remember, the schema is like a folder. You can use schema to categorize your tables. In dBeaver:
CREATE SCHEMA jwalsh_schema;
CREATE TABLE jwalsh_schema.jwalsh_table (
station VARCHAR(11) NOT NULL,
date DATE NOT NULL,
value_type VARCHAR(4) NOT NULL,
value INTEGER NOT NULL,
measurement_flag VARCHAR(4),
quality_flag VARCHAR(4),
source_flag VARCHAR(1) NOT NULL,
time VARCHAR(4)
);
We ready to copy the data! We strongly recommend using psql
. You can do it through Python scripts and other methods, but psql
is optimized for this task. It will likely save you a lot of time.
We should follow Jen's guidelines by storing the database credentials in a file. Postgres looks for four environment variables: PGHOST, PGUSER, PGPASSWORD, and PGDATABASE. To set the environment variables using default_profile.example
:
eval $(cat default_profile.example)
cat 2016.csv | psql -c "\copy jwalsh_schema.jwalsh_table from stdin with csv header;"
Note: You want to pipe the data from cat
to psql
. You'll get a permissions error if you don't.
Use dBeaver!
select * from jwalsh_schema.jwalsh_table limit 25;
select count(*) from jwalsh_schema.jwalsh_table;
select * from jwalsh_schema.jwalsh_table where station = 'USW00094846';
select * from jwalsh_schema.jwalsh_table where station = 'USW00094846' and value_type = 'PRCP';
- Software Carpentry: Databases and SQL
- Computation for Public Policy, Harris School of Public Policy