Skip to content

Latest commit

 

History

History
350 lines (291 loc) · 10.1 KB

OBSERVATIONS.md

File metadata and controls

350 lines (291 loc) · 10.1 KB

Observations

Download the data from data.gov and import it into a new relational database table called indicators.

SELECT count(*) FROM indicators; -- 53,469

There is no existing primary key 😞.

SELECT i.year, count(*) AS row_count
FROM indicators i
GROUP BY i.year
ORDER BY 1

=>

year row_count
2001 104
2006-2010 1323
2007 55
2007-2011 1704
2008 110
2009 279
2009-2011 275
2010 21293
2011 538
2011-2012 312
2012 3282
2013 24194

Years are not integers. Some values represent ranges. This is a categorical field, not numeric.

SELECT
  i.category
  ,count(DISTINCT i.indicatorid) AS indicator_count
  ,count(*) AS row_count
FROM indicators i
GROUP BY 1

=>

category indicator_count row_count
Alcohol 16 2744
Arthritis 8 2640
Asthma 9 4162
Cancer 20 3411
Cardiovascular Disease 18 9302
Chronic Kidney Disease 4 817
Chronic Obstructive Pulmonary Disease 16 9192
Diabetes 20 7728
Disability 1 55
Immunization 1 330
Mental Health 3 440
Nutrition, physical activity, and weight status 37 4504
Older Adults 4 1186
Oral Health 9 1797
Overarching Conditions 16 2741
Reproductive Health 3 165
Tobacco 16 2255
SELECT
  i.category
  ,count(DISTINCT i.indicatorid) AS indicator_count
  ,group_concat(DISTINCT i.indicatorid  ORDER BY i.indicatorid SEPARATOR ", ") AS indicators
FROM indicators i
GROUP BY 1

=>

There is no data dictionary 😞. Trying to make sense of the data...

What's the deal with datavalue vs datavaluealt and do I need to care about both?

SELECT * FROM indicators i WHERE i.datavalue <> i.datavaluealt

It looks like alt values either have different decimal formatting, or different non-null values altogether.

What's up with DataValueFootnoteSymbol and other columns?

SELECT group_concat(DISTINCT DataValueFootnoteSymbol) FROM indicators
-- ,¯,^^^,*,****,**,§§,^^,***,~,~~,~~~,~~~~,†††,^^^^,#,*, #,§§§§,‡‡
SELECT group_concat(DISTINCT DataValueUnit) FROM indicators
-- %,,cases per 100,000,gallons,$,cases per 10,000,cases per million,per 100,000,Number,Average Annual Number,cases per 1,000,000,cases per 1,000,per 100...
SELECT
  DataValueFootnote
  ,count(*) AS row_count
FROM indicators
GROUP BY DataValueFootnote
ORDER BY 2 desc

=>

DataValueFootnote row_count
|	43743

No data available | 6446 Data not shown because of too few respondents or cases | 1000 CI was calculated using Poisson variable | 592 Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30% | 561 50 States + DC: US Median | 521 NSCH Data provided by Data Resource Center for Child and Adolescent Health. Http://www.childhealthdata.org | 306 US Median based on the number of participating states | 94 A confidence interval for this location is not available because a census was used instead of a sample for their surveys | 55 Data are not available because cancer registry did not meet the data quality criteria for all the years from 2007-2011 for publication in United States Cancer statistics (USCS | 48 State directly controlled the sale of (beer/wine/distilled spirits) at the retail and/or wholesale levels. State prices for (beer/wine/distilled spirits) combined both markups | 30 Data are not available because of state legislation and regulations which prohibit the release of county level data to outside entities. | 24 Data are from cancer registries that meet the data quality criteria for all invasive cancer sites combined, for all years from 2007-2011, covering approximately 99% of the U.S | 24 Population served by CWS exceeded the US Census state population estimate; number of people was reduced by the ratio of the population estimate to the CWS population estimate | 9 50 states + DC: Median Percent; NSCH Data provided by Data Resource Center for Child and Adolescent Health. http://www.childhealthdata.org | 6 Percentage of states with applicable childcare regulations | 5 Number of states + DC with applicable tobacco control policies | 3 DC was not rated for this measure because the measure addresses state versus local authority and does not apply to the District of Columbia | 1 US per capita alcohol consumption was calculated using (1) alcohol beverage sales data, either collected directly from states or provided by beverage industry sources (numerat | 1

SELECT group_concat(DISTINCT DataValueType) FROM indicators

=>

DataValueType row_count
Crude Prevalence 14242
Age-adjusted Prevalence 11312
Number 7564
Crude rate 7098
Age-adjusted rate 6929
Average Annual Age-adjusted Rate 1009
Average Annual Crude Rate 1009
Average Annual Number 1009
Age-adjusted Mean 825
Percent 660
Crude Mean 495
Yes/No 440
Mean 330
US Dollars 165
Median 110
Local control of the regulation of alcohol outlet density 55
Mean Score 55
Per capita alcohol consumption 55
Commercial host (dram shop) liability status for alcohol service 55
Prevalence 52

Ok so there are a bunch of metric formatting and footnote concerns that will likely need to translate into the final visualization.

What's the deal with the row structure of this dataset?

SELECT
  indicatorid
  ,Gender
  ,stratificationid1
  ,count(DISTINCT locationid) AS location_count
  ,count(*) AS row_count
FROM indicators
GROUP BY indicatorid, Gender, stratificationid1
ORDER BY indicatorid, Gender, stratificationid1
-- 406 rows

Gender describes stratificationid1 and vice-versa.

Each combination of indicator and gender has between 55 and 100 ish rows.

SELECT
  i.locationid
  ,i.locationabbr
  ,i.category
  ,i.indicatorid
  ,i.indicator
  ,i.datasource
  ,i.datavalueunit
  ,i.datavaluetype
  ,i.datavalue
  ,i.datavaluealt
  ,i.gender
  ,i.stratificationid1
  ,i.lowconfidenceinterval
  ,i.highconfidenceinterval
FROM indicators i
WHERE i.year = "2013" AND i.locationabbr = "CT"

locationabbr and locationdesc describe locationid and vice-versa.

indicator describes indicatorid and vice-versa.

datavaluetype and datavalueunit describe indicator.

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  ,i.stratificationid1
 ,count(DISTINCT i.datasource) AS datasource_count -- none greater than 1
 ,count(DISTINCT i.datavaluetype) AS data_value_type_count -- several greater than 1
 ,count(DISTINCT i.datavalueunit) AS data_value_unit_count -- several greater than 1
FROM indicators i
GROUP BY 1,2,3,4

Year, locationid, indicatorid, and stratificationid1 are insufficient to form a composite primary key.

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  ,i.stratificationid1
  ,i.datavalueunit
  ,i.datavaluetype
 ,count(DISTINCT i.datasource) AS datasource_count -- none greater than 1
 ,count(DISTINCT i.datavalue) AS data_value_count -- none greater than 1
FROM indicators i
GROUP BY 1,2,3,4,5,6
-- HAVING data_value_count > 1

Year, locationid, indicatorid, stratificationid1, datavalueunit, and datavaluetype may be sufficient to form a composite primary key.

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  -- ,i.datasource
  ,i.datavalueunit
  ,i.datavaluetype
  ,i.stratificationid1
  ,i.datavalue
  ,i.lowconfidenceinterval
  ,i.highconfidenceinterval
FROM indicators i
WHERE i.year = "2013" AND i.locationabbr = "CT" AND i.category = "Mental Health"
ORDER BY 1,2,3,6

=>

year locationid indicatorid datavalueunit datavaluetype stratificationid1 datavalue lowconfidenceinterval highconfidenceinterval
2013 09 MTH1_0 Number Age-adjusted Mean GENF 4.1 3.7 4.5
2013 09 MTH1_0 Number Crude Mean GENF 4 3.6 4.3
2013 09 MTH1_0 Number Age-adjusted Mean GENM 2.9 2.5 3.2
2013 09 MTH1_0 Number Crude Mean GENM 2.9 2.6 3.3
2013 09 MTH1_0 Number Age-adjusted Mean GENT 3.5 3.3 3.8
2013 09 MTH1_0 Number Crude Mean GENT 3.5 3.2 3.7
2013 09 MTH2_0 % Crude Prevalence GENF 13.5 11.2 16.3

There can be many datavaluetype per indicatorid but datavalueunit looks like it depends on/describes indicatorid. Right?

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  ,i.stratificationid1
  ,i.datavaluetype
  ,count(DISTINCT i.datavalueunit) AS data_value_unit_count
 ,count(DISTINCT i.datasource) AS datasource_count
 ,count(DISTINCT i.datavalue) AS data_value_count
FROM indicators i
GROUP BY 1,2,3,4,5
HAVING data_value_count > 1 OR datasource_count > 1 OR data_value_count > 1 -- zero rows

Indeed, this dataset row structure is a row per:

  • year
  • location
  • indicator
  • gender/strat
  • data value type

If a user chooses one value for each, comparisons can be made between values.

For indicators of the "Mental Health" category, what kind of consistency exists across metrics across years?

SELECT
 i.year
 ,i.indicatorid
 ,count(DISTINCT i.locationid) AS location_count
FROM indicators i
WHERE i.category = "Mental Health"
GROUP BY 1,2

=>

year indicatorid location_count
2009-2011 MTH3_0 55
2013 MTH1_0 55
2013 MTH2_0 55

There is no consistency of the same metric being measured across multiple years.

This means user selections of indicator and year are not independent, and should be limited/restricted to avoid confusion.

The first iteration of this visualization should be based off the following data filters:

SELECT
  i.year
  ,i.locationdesc
  ,i.category
  ,i.indicator
 ,i.datavalueunit
 ,i.datavalue
 ,i.datavaluefootnote
 ,i.gender
 ,i.stratificationid1
 ,i.indicatorid
 ,i.lowconfidenceinterval
 ,i.highconfidenceinterval
FROM indicators i
WHERE i.category = "Mental Health"
  AND i.indicatorid = "MTH1_0"
  AND i.datavaluetype = "Age-adjusted Mean"
  AND i.year = "2013"
ORDER BY i.indicatorid, i.gender

Methodology note: @NedMccague recommends using the "Age-adjusted Mean" when comparing values across states.