The homework is due Wednesday November 15 at 1:30am. You can accept it, here.
The American Time Use Survey is a stupendous compendium of how Americans actually spend their time. You can find documentation for the 2003-2015 files here. Every activity a respondent performs on a "diary day," is recorded in very narrow categories (activity "lexicon" here). In addition, basic information is recorded for a roster of everyone in the houshold, and the data are linked to the Current Population Survey (CPS), another Census/BLS product.
Taking all years into account, it gets pretty big. Just the CSV files are around 1.5GB.
But the structure of the files is basically a database, and so I have loaded them together into (somewhat smaller, 200 MB) sqlite database.
Since the codes can be somewhat hard to interpret,
I have edited the variable names (but not values) to make them more readable.
Have a look at .schema
to see all the variables.
There are four tables, which may be merged on the case_id
and, where appropriate, line_no = 1
(see #1).
-
roster
contains a list of attributes of the people in a file.- You will use this file to get the
edited_sex
andedited_age
of respondents and family members. - It is very important to note that the respondent her/himself is ALWAYS
line_no = 1
. This means that if you want to ask about theactivities
people of a given age are doing, you must requireline_no = 1
for your join.
- You will use this file to get the
-
respondents
contains descriptive data on time spent in various activities, as well as household characteristics and some labor force participation stats. -
activities
lists the minutes spent in each activity, as enumerated in the lexicon.Relevant Activity Codes.
Code Activity 120307 Playing Games (incl. computers) 02XXXX Housework 140101 Relgious Services 0301XX Direct Engagement of HH Children: Playing or Ed. 180381 Travel for HH Children -
cps
(Current Population Survey) supplements the activity inrespondents
androster
, with data from this additional survey. In particular, you might useeducational_attainment
andfamily_income
.Educational Attainment Details
Alternatively, consult
PEEDUCA
in the codebook.Code Educational Level 31 Less than 1st grade 32 1st, 2nd, 3rd, or 4th grade 33 5th or 6th grade 34 7th or 8th grade 35 9th grade 36 10th grade 37 11th grade 38 12th grade - no diploma 39 High school graduate - diploma or equivalent (GED) 40 Some college but no degree 41 Associate degree - occupational/vocational 42 Associate degree - academic program 43 Bachelor's degree (BA, AB, BS, etc.) 44 Master's degree (MA, MS, MEng, MEd, MSW, etc.) 45 Professional school degree (MD, DDS, DVM, etc.) 46 Doctoral degree (PhD, EdD, etc.)
You can download the database from my UC Box account.
Each of your scripts must be written in a file named atus1.sql
, atus2.sql
, etc.
You will read answers off of these tables and place them in SOLUTIONS
, before completing a quiz on Canvas.
Please remember throughout, that when using the tables roster
or cps
, you must require that the line_no
value be 1.
This has nothing to do with SQL generically, and just represents the fact that there are many people in a household, and the respondent and activities correspond to person "1."
Always exclude negative values (non-responses!!).
- Gaming Average (video) game playing (code 120307) for men and women (1 and 2), in three ten-year age groups from 20-29, 30-39, 40-49. Use
edited_age
andedited_sex
so you can keep it to just two tables.- Hint: you can do multiple
group by
variables. - Hint: to group by age, use the fact that the default integer division in SQL is floor division.
- Hint: you can do multiple
- Employment. Tabulate whether or not the respondent worked last week as a fraction (1 = yes, 2 = no; exclude retired/disabled/unable 3-5 and no answer; see TUFWK here).
Group on the presence of a partner (
spouse_or_partner_present
: spouse 1 or partner 2, v. none 3; see TRSPPRES in the codebook). - Education. State with the lowest fraction of high school graduates. Refer to the table in the "twisty" above -- HS is 39. If you haven't memorized the FIPS codes, you can find them here.
- Marriage and Education. Average marital status (
spouse_or_partner_present
; again 1 or 2, v. 3), byeducational_attainment
.- Use respondent
spouse_or_partner_present
and cpseducational_attainment
. Require that the educational attainment be non-negative.
- Use respondent
- Human Capital Investments. Average daily time spent directly engaging children in hours (activity codes 0301XX or 180381), by respondent sex.
You should require that there be children in the household, and that the respondent is in the labor force (
edited_labor_force_status
= 1 or 2). Use rosteredited_sex
variable, 1 for men, 2 for women. (Read, understand, and adapt this example; N.B. that the python example at p. 34 of the lecture relied on pandas for the second group by.) - [BONUS] Marriage and Religion. Respondent married average (
spouse_or_partner_present
=TRSPPRES
= 1) grouped by attended religions services (activity code 1401XX; data dictionary). Consider only respondents reporting Sunday (dow_of_diary_day
, 1), Friday (6) or Saturday (7), and households with kids. (Note that the most-naïve interpretation already dramatically understates the difference, due to contamination over which is the religious day. Many of the people in the no religious services category, actually did go...) - [BONUS] Housework. Average daily housework in hours (activity code 02XXXX, i.e., all those starting by 02, see the lexicon -- use floor division, which is the default). Group by
edited_sex
andeducational_attainment
.
The Chicago salaries file, it turns out, is not fit to be loaded into a database from the get-go. First get it by
curl https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.csv -o salaries.csv
This is a great moment for some practice with sed
(you could even try -i
for 'in place'.)
- The empty lines and headers themselves are a nuisance. Remove lines without dollar signs.
- The
sqlite3
importer doesn't recognize quotation marks, so we need to get rid of these. - The dollar signs would stop us from manipulating the salary values as numbers.
- There are weird spaces before first names. Get rid of them.
Once you've done this, import the table using the shortcut from class:
CREATE TABLE chicago (Last TEXT, First TEXT, Position TEXT, Department TEXT,
FP TEXT, SH TEXT, Hours INTEGER, Salary REAL, Wage REAL);
.mode csv
.import salaries.csv chicago
Query this table to find:
- The average salary of firefighters in the city.
- Total expenditures on salaries in the Mayor's office.
- Number of people working in the three largest departments.
Save your queries as s.sql
, and mark the responses in SOLUTIONS.
You can find a nicely-formatted copy of Romeo and Juliet here. You can download it to your directory with
curl http://shakespeare.mit.edu/romeo_juliet/full.html -o rj.html
If you cat
the file, you'll see that new speakers are marked by "A NAME=speech"
and that individual lines also start by "A NAME=" but without "speech".
Begin by making a dictionary of all of the characters in the play. Then count their spoken lines. Who has the most lines in the play? How many lines do they have?
Commit your script as rj.py
.