generated from INFO-511-F24/project-final
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproposal.qmd
82 lines (57 loc) · 5.68 KB
/
proposal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: "Data, Data, Data"
subtitle: "Oh My"
author:
- name: "The Outliers"
affiliations:
- name: "School of Information, University of Arizona"
description: "Final Project: Milestone 2"
format:
html:
code-tools: true
code-overflow: wrap
code-line-numbers: true
embed-resources: true
editor: visual
code-annotations: hover
execute:
warning: false
jupyter: python3
---
```{python}
#| label: load-pkgs
#| message: false
import numpy as np
import seaborn as sns
import pandas as pd
```
## Dataset 1: Nutrition, physical activity, and obesity
### Introduction and data
This dataset was provided by the Centers for Disease Control and Provention (CDC), National Center for Chronic Disease prevention and Health Promotion, Division of Nutrition, Physical Activity, and Obesity. This data was collected through health-related telephone surveys that gather state data about U.S. Residents. This dataset has been used for the Data, Trends, and Maps database that the Division of Nutrition, Physical Activity, and Obesity (DNPAO) section of the CDC has, which is responsible for providing both state and national data for these topics.
#### Description of contents
This dataset includes over 104 thousand rows, and has 33 columns. Each row represents a combination of a year, state, survey question, and percent of individuals who are positively identified for that question, along with stratification. The categories for stratification are Age Range, Education, Gender, Income, Race/Ethnicity, and Total.
Survey questions fall into the categories of "Fruits and Vegetables - Behavior", "Obesity/Weight Status", and "Physical Activity - Behavior". Examples of survey questions include "Percent of adults who engage in muscle-strengthening activities on 2 or more days a week" and "Percent of adults aged 18 years and older who have obesity".
This dataset includes observations for the years 2011-2023. Percentages and data are not included for groups with insufficient sample sizes.
#### Ethical Concerns
There are no particular ethical concerns regarding working with this data. This dataset is aggregated, and numbers are excluded in instances where the sample size is too small. This removes concerns surrounding the personal identification of individuals within this dataset. This dataset is publicly available for anyone to download, and the licensing agreement states that it is free to be shared, created, and adapted, as long as it is attributed as the data source when publicly displayed or published. This removes concerns surrounding unfair or illegal acquisition and use of the data.
### Research Question
- Research Question: Do higher-income populations consistently have more time for physical activity than lower income populations?
- Additional research questions include: How has the relationship between amount of physical activity and income changed over time? How does this vary between groups? And how does the amount of physical activity that lower-income populations have the time to do changed over the years?
- The target population for this research question is U.S. Residents 18 and over, represented by the dataset
- This question is important because it may highlight areas that correlate with differences in physical health across the population. If there are groupings that are identified that are tied to physical health and activity more than others, then more research can be done to identify ways in which these groups can receive more assistance with nutrition and adopting healthier lifestyles.
- The research topic of interest here is whether or not there is a relationship between the amount of income that an individual makes and the amount of physical activity they are able to make time for. This falls in a larger category of interest surrounding differences between physical activity and nutrition for different groups and subsets of the population.
- We hypothesize that individuals who have lower income levels will have less time for physical activity, showing that larger percentages of the low income population will fall into the “Percent of adults who engage in no leisure-time physical activity” group compared to individuals with higher incomes.
- The variables in this research question are mostly categorical, the questions themselves and the groupings (income range) are both categorical. The percentage of respondents who fall into each category is a quantitative variable, and there is a time element (years) which can be used as well.
### Glimpse of data
```{python}
nutrition = pd.read_csv("data/nutrition.csv")
```
```{python}
nutrition.head()
```
```{python}
nutrition.info()
```
### Analysis plan
Initial detailed exploration of the data (focusing on the main variables involved) will be followed by data cleaning and some wrangling. Identification of missing observations, data type conversions, etc. will be addressed in these initial steps. The variables involved to answer the largest research question include the "Question" and "Income" and "StratificationCategoryId1" columns, as well as the "Data_Value" column and the "YearStart" column. New columns that group some of these variables may also be created, such as grouping the "Income" values into "Low Income" and "High Income". At this point there is no plan to bring in and merge any external data.
Once the data has been wrangled, it can be visualized in an assortment of plots for analysis, and summary statistics by group can be compiled as well. These visualizations and statistics will allow for insight as to how the relationship between income and physical activity compares between groups, as well as over time. From here, additional metrics can be run, plots created, etc. to explore and dive in further.