Observations

Download the data from data.gov and import it into a new relational database table called indicators.

SELECT count(*) FROM indicators; -- 53,469

There is no existing primary key 😞.

SELECT i.year, count(*) AS row_count
FROM indicators i
GROUP BY i.year
ORDER BY 1

=>

year	row_count
2001	104
2006-2010	1323
2007	55
2007-2011	1704
2008	110
2009	279
2009-2011	275
2010	21293
2011	538
2011-2012	312
2012	3282
2013	24194

Years are not integers. Some values represent ranges. This is a categorical field, not numeric.

SELECT
  i.category
  ,count(DISTINCT i.indicatorid) AS indicator_count
  ,count(*) AS row_count
FROM indicators i
GROUP BY 1

=>

category	indicator_count	row_count
Alcohol	16	2744
Arthritis	8	2640
Asthma	9	4162
Cancer	20	3411
Cardiovascular Disease	18	9302
Chronic Kidney Disease	4	817
Chronic Obstructive Pulmonary Disease	16	9192
Diabetes	20	7728
Disability	1	55
Immunization	1	330
Mental Health	3	440
Nutrition, physical activity, and weight status	37	4504
Older Adults	4	1186
Oral Health	9	1797
Overarching Conditions	16	2741
Reproductive Health	3	165
Tobacco	16	2255

SELECT
  i.category
  ,count(DISTINCT i.indicatorid) AS indicator_count
  ,group_concat(DISTINCT i.indicatorid  ORDER BY i.indicatorid SEPARATOR ", ") AS indicators
FROM indicators i
GROUP BY 1

=>

There is no data dictionary 😞. Trying to make sense of the data...

What's the deal with datavalue vs datavaluealt and do I need to care about both?

SELECT * FROM indicators i WHERE i.datavalue <> i.datavaluealt

It looks like alt values either have different decimal formatting, or different non-null values altogether.

What's up with DataValueFootnoteSymbol and other columns?

SELECT group_concat(DISTINCT DataValueFootnoteSymbol) FROM indicators
-- ,¯,^^^,*,****,**,§§,^^,***,~,~~,~~~,~~~~,†††,^^^^,#,*, #,§§§§,‡‡

SELECT group_concat(DISTINCT DataValueUnit) FROM indicators
-- %,,cases per 100,000,gallons,$,cases per 10,000,cases per million,per 100,000,Number,Average Annual Number,cases per 1,000,000,cases per 1,000,per 100...

SELECT
  DataValueFootnote
  ,count(*) AS row_count
FROM indicators
GROUP BY DataValueFootnote
ORDER BY 2 desc

=>

DataValueFootnote	row_count

|	43743

No data available | 6446 Data not shown because of too few respondents or cases | 1000 CI was calculated using Poisson variable | 592 Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30% | 561 50 States + DC: US Median | 521 NSCH Data provided by Data Resource Center for Child and Adolescent Health. Http://www.childhealthdata.org | 306 US Median based on the number of participating states | 94 A confidence interval for this location is not available because a census was used instead of a sample for their surveys | 55 Data are not available because cancer registry did not meet the data quality criteria for all the years from 2007-2011 for publication in United States Cancer statistics (USCS | 48 State directly controlled the sale of (beer/wine/distilled spirits) at the retail and/or wholesale levels. State prices for (beer/wine/distilled spirits) combined both markups | 30 Data are not available because of state legislation and regulations which prohibit the release of county level data to outside entities. | 24 Data are from cancer registries that meet the data quality criteria for all invasive cancer sites combined, for all years from 2007-2011, covering approximately 99% of the U.S | 24 Population served by CWS exceeded the US Census state population estimate; number of people was reduced by the ratio of the population estimate to the CWS population estimate | 9 50 states + DC: Median Percent; NSCH Data provided by Data Resource Center for Child and Adolescent Health. http://www.childhealthdata.org | 6 Percentage of states with applicable childcare regulations | 5 Number of states + DC with applicable tobacco control policies | 3 DC was not rated for this measure because the measure addresses state versus local authority and does not apply to the District of Columbia | 1 US per capita alcohol consumption was calculated using (1) alcohol beverage sales data, either collected directly from states or provided by beverage industry sources (numerat | 1

SELECT group_concat(DISTINCT DataValueType) FROM indicators

=>

DataValueType	row_count
Crude Prevalence	14242
Age-adjusted Prevalence	11312
Number	7564
Crude rate	7098
Age-adjusted rate	6929
Average Annual Age-adjusted Rate	1009
Average Annual Crude Rate	1009
Average Annual Number	1009
Age-adjusted Mean	825
Percent	660
Crude Mean	495
Yes/No	440
Mean	330
US Dollars	165
Median	110
Local control of the regulation of alcohol outlet density	55
Mean Score	55
Per capita alcohol consumption	55
Commercial host (dram shop) liability status for alcohol service	55
Prevalence	52

Ok so there are a bunch of metric formatting and footnote concerns that will likely need to translate into the final visualization.

What's the deal with the row structure of this dataset?

SELECT
  indicatorid
  ,Gender
  ,stratificationid1
  ,count(DISTINCT locationid) AS location_count
  ,count(*) AS row_count
FROM indicators
GROUP BY indicatorid, Gender, stratificationid1
ORDER BY indicatorid, Gender, stratificationid1
-- 406 rows

Gender describes stratificationid1 and vice-versa.

Each combination of indicator and gender has between 55 and 100 ish rows.

SELECT
  i.locationid
  ,i.locationabbr
  ,i.category
  ,i.indicatorid
  ,i.indicator
  ,i.datasource
  ,i.datavalueunit
  ,i.datavaluetype
  ,i.datavalue
  ,i.datavaluealt
  ,i.gender
  ,i.stratificationid1
  ,i.lowconfidenceinterval
  ,i.highconfidenceinterval
FROM indicators i
WHERE i.year = "2013" AND i.locationabbr = "CT"

locationabbr and locationdesc describe locationid and vice-versa.

indicator describes indicatorid and vice-versa.

datavaluetype and datavalueunit describe indicator.

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  ,i.stratificationid1
 ,count(DISTINCT i.datasource) AS datasource_count -- none greater than 1
 ,count(DISTINCT i.datavaluetype) AS data_value_type_count -- several greater than 1
 ,count(DISTINCT i.datavalueunit) AS data_value_unit_count -- several greater than 1
FROM indicators i
GROUP BY 1,2,3,4

Year, locationid, indicatorid, and stratificationid1 are insufficient to form a composite primary key.

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  ,i.stratificationid1
  ,i.datavalueunit
  ,i.datavaluetype
 ,count(DISTINCT i.datasource) AS datasource_count -- none greater than 1
 ,count(DISTINCT i.datavalue) AS data_value_count -- none greater than 1
FROM indicators i
GROUP BY 1,2,3,4,5,6
-- HAVING data_value_count > 1

Year, locationid, indicatorid, stratificationid1, datavalueunit, and datavaluetype may be sufficient to form a composite primary key.

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  -- ,i.datasource
  ,i.datavalueunit
  ,i.datavaluetype
  ,i.stratificationid1
  ,i.datavalue
  ,i.lowconfidenceinterval
  ,i.highconfidenceinterval
FROM indicators i
WHERE i.year = "2013" AND i.locationabbr = "CT" AND i.category = "Mental Health"
ORDER BY 1,2,3,6

=>

year	locationid	indicatorid	datavalueunit	datavaluetype	stratificationid1	datavalue	lowconfidenceinterval	highconfidenceinterval
2013	09	MTH1_0	Number	Age-adjusted Mean	GENF	4.1	3.7	4.5
2013	09	MTH1_0	Number	Crude Mean	GENF	4	3.6	4.3
2013	09	MTH1_0	Number	Age-adjusted Mean	GENM	2.9	2.5	3.2
2013	09	MTH1_0	Number	Crude Mean	GENM	2.9	2.6	3.3
2013	09	MTH1_0	Number	Age-adjusted Mean	GENT	3.5	3.3	3.8
2013	09	MTH1_0	Number	Crude Mean	GENT	3.5	3.2	3.7
2013	09	MTH2_0	%	Crude Prevalence	GENF	13.5	11.2	16.3

There can be many datavaluetype per indicatorid but datavalueunit looks like it depends on/describes indicatorid. Right?

SELECT
  i.year
  ,i.locationid
  ,i.indicatorid
  ,i.stratificationid1
  ,i.datavaluetype
  ,count(DISTINCT i.datavalueunit) AS data_value_unit_count
 ,count(DISTINCT i.datasource) AS datasource_count
 ,count(DISTINCT i.datavalue) AS data_value_count
FROM indicators i
GROUP BY 1,2,3,4,5
HAVING data_value_count > 1 OR datasource_count > 1 OR data_value_count > 1 -- zero rows

Indeed, this dataset row structure is a row per:

year
location
indicator
gender/strat
data value type

If a user chooses one value for each, comparisons can be made between values.

For indicators of the "Mental Health" category, what kind of consistency exists across metrics across years?

SELECT
 i.year
 ,i.indicatorid
 ,count(DISTINCT i.locationid) AS location_count
FROM indicators i
WHERE i.category = "Mental Health"
GROUP BY 1,2

=>

year	indicatorid	location_count
2009-2011	MTH3_0	55
2013	MTH1_0	55
2013	MTH2_0	55

There is no consistency of the same metric being measured across multiple years.

This means user selections of indicator and year are not independent, and should be limited/restricted to avoid confusion.

The first iteration of this visualization should be based off the following data filters:

SELECT
  i.year
  ,i.locationdesc
  ,i.category
  ,i.indicator
 ,i.datavalueunit
 ,i.datavalue
 ,i.datavaluefootnote
 ,i.gender
 ,i.stratificationid1
 ,i.indicatorid
 ,i.lowconfidenceinterval
 ,i.highconfidenceinterval
FROM indicators i
WHERE i.category = "Mental Health"
  AND i.indicatorid = "MTH1_0"
  AND i.datavaluetype = "Age-adjusted Mean"
  AND i.year = "2013"
ORDER BY i.indicatorid, i.gender

Methodology note: @NedMccague recommends using the "Age-adjusted Mean" when comparing values across states.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OBSERVATIONS.md

OBSERVATIONS.md

Observations

Files

OBSERVATIONS.md

Latest commit

History

OBSERVATIONS.md

File metadata and controls

Observations