generated from jtr13/EDAVtemplate
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-data.Rmd
116 lines (66 loc) · 9.09 KB
/
02-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Data sources
Shiyu Wang is the one who choose the data set for further analysis.
To assure the correctness of the data sets, they are chosen from the U.S. Bureau of labor statistics. [(link to dataset)](https://www.bls.gov/cps/tables.htm#empstat) and all the data is downloaded as excel for further analysis.
At first, we decide which topic we would work on and several questions we may have, based on our point of view.
Second, with the concern of the correctness and accuracy of the data, our group choose to use the data from U.S. Bureau of labor statistics. And U.S. Bureau of labor statistics offers a variety of data sets about employment from different perspective on the website, like employment status, characteristic of employed and characteristic of the unemployed, etc.
Then, looking through the data sets combined with our potential questions wanted to explore, current knowledge and own interest, our group decide the questions we'd like to explore and choose the data set we would potentially use.
To be more specific, the following offer the details about choosing the data set:
## General View: "Employment status of the civilian non-institutional population, 1950s to date" :
[link_to_this_file](data/overall_workforce_change.xlsx)
since this data set would give us an overview of the employment status which would be beneficial for our whole analysis.
Variables: It has 11 variables, gives the Year; Civilian non-institutional population; Civilian labor force information including the total number, Employed people including Agriculture and Non-agriculture, Unemployed number and these factors' percentage; Not in labor force number. This gives bunch of information for us to choose in the further analysis. It has around 70 records
Issue: There is no issues in this data set.
## Question 1
It is asking about "does it exists discrimination in employment in different industries?", the key point is discrimination which would involve gender, race, age etc. Thus, gathering the data in these topics would be essential.For this question our group gather the data set as follow:
### "Employment status of the civilian non-institutional population 16 years and over by sex, 1980s to date":
[link_to_this_file](data/question_01/men_vs_women_over20years.xlsx)
Since gender is one of the points to analyze the discrimination, this data set gives us the employment status of men and women 1980s, and using this data set would be potential useful for our data analysis.
Variables: It has 11 variables and very similar to the general view of employment data set. gives the Sex and year; Civilian non-institutional population; Civilian labor force information including the total number, Employed people including Agriculture and Non-agriculture, Unemployed number and these factors' percentage; Not in labor force number. It has around 83 records.
Issue: There is no issues in this data set.
### "Employed persons by detailed occupation, sex, race, and Hispanic or Latino ethnicity":(from 2016 to 2020)
[link_to_2016](data/question_01/occupation_sex_race/detailed/2016.xlsx)
[link_to_2017](data/question_01/occupation_sex_race/detailed/2017.xlsx)
[link_to_2018](data/question_01/occupation_sex_race/detailed/2018.xlsx)
[link_to_2019](data/question_01/occupation_sex_race/detailed/2019.xlsx)
[link_to_2020](data/question_01/occupation_sex_race/detailed/2020.xlsx)
Our group choose the data from 2016 to 2020 since the website gives same format of data in the same topic from 2016 to 2020. This data set gives the information about the race vs occupation, so it could be potentially useful for further analysis since it is highly possible in different industry, the discrimination would be different. Besides, race is also a key point of discrimination. And choosing 2016 to 2020 is because the recent 5 years data would be more valuable for us to do analysis and give insights to the others.
Variables: They all have 7 variables except in 2016 which looses white people information. The others' variables including Occupation; Total employed number; percentage of total employed of women; percentage of total employed of White; percentage of total employed of Black or African American; percentage of total employed of Asian; Hispanic or Latino; They have around 560 records
Issue: These files contain missing values.
### "Employed persons by occupation, race, Hispanic or Latino ethnicity, and sex":(from 2016 to 2020)
[link_to_2016-2017](data/question_01/occupation_sex_race/general_compare/2016-2017.xlsx)
[link_to_2017-2018](data/question_01/occupation_sex_race/general_compare/2017-2018.xlsx)
[link_to_2018-2019](data/question_01/occupation_sex_race/general_compare/2018-2019.xlsx)
[link_to_2019-2020](data/question_01/occupation_sex_race/general_compare/2019-2020.xlsx)
These data set offers a more general version about occupation, gender and race, compared to the previous one. And it is also potentially valuable for our group for further analysis.
Variables: They all have 7 variables including the occupation and race; Total for both years; Men for both years; Women for both years. For each one, there is around 85 records.
Issue: There are no issues in these files.
### "Employment status of the civilian noninstitutional population by age, sex, and race":(from 2016 to 2020)
[link_to_2016](data/question_01/sex_race_age/2016.xlsx)
[link_to_2017](data/question_01/sex_race_age/2017.xlsx)
[link_to_2018](data/question_01/sex_race_age/2018.xlsx)
[link_to_2019](data/question_01/sex_race_age/2019.xlsx)
[link_to_2020](data/question_01/sex_race_age/2020.xlsx)
These data sets offer the information about different age and different race in different ages, which could provide us another angel of viewing the employment status. Thus, these data sets could be highly potentially useful for the further analysis
Variables: They all have 9 variables in total, including the different gender and race in different age; Civilian non-institutional population; Civilian labor force total; Civilian labor force percentage; Civilian labor force employed total; Civilian labor force employed percentage; Civilian labor force unemployed total; Civilian labor force unemployed percentage; Not in the labor force. For each one, it has around 275 records.
Issues: There is some missing values in these files.
## Question 2
It is asking about the whether the employment situation has truly improved. In another angel, analyzing the unemployment would give us the answer to this question since based on the news, from the politicians' words, unemployment rate will be improved based on his actions once he has been elected. However, whether they told the public the truth or not could only be verified based on the actual data.
### "Unemployed persons by occupation and sex"(from 2016 to 2020):
[link_to_2016-2017](data/question_02/Unemployed_sex_occupation/2016-2017.xlsx)
[link_to_2017-2018](data/question_02/Unemployed_sex_occupation/2017-2018.xlsx)
[link_to_2018-2019](data/question_02/Unemployed_sex_occupation/2018-2019.xlsx)
[link_to_2019-2020](data/question_02/Unemployed_sex_occupation/2019-2020.xlsx)
These data sets offer the information about the unemployment rate based on gender and occupation which could offer us the insights about the unemployment status.
Variables: They all have 5 variables, with the occupation; total unemployment number for both years; unemployment rates for both years; unemployment rates of men for both years; unemployment rates of women for both years. Each one has around 32 records.
Issues: There are missing values in these data sets.
## Question 3
The question is: as the development of the society is becoming better, does the requirement of each industries become more strict? To answer this question, how to define the requirements seems essential. From our point of view, the percentage of having a license or certificate is an appropriate indicator for this question. Therefore, gathering the following data is reasonable.
### "Certification and licensing status of the employed by industry and class of worker"(from 2016 to 2020):
[link_to_2016](data/question_03/certification_and_license/2016.xlsx)
[link_to_2017](data/question_03/certification_and_license/2017.xlsx)
[link_to_2018](data/question_03/certification_and_license/2018.xlsx)
[link_to_2019](data/question_03/certification_and_license/2019.xlsx)
[link_to_2020](data/question_03/certification_and_license/2020.xlsx)
These data sets give two parts useful information and both of them will be useful for the further analysis. The first part gives information about holding certificate or license (whether have both or only one) based on the industry. The second part offers another angel that the license or certificate holding situation based on which type the people himself/herself is.
Variables: They all have 7 variables with industry and class of workers; total employed workers; percentage of total employed workers; percentage of with a certificate or license total; percentage of only having the certificate but no license; percentage of having license(could possibly also have certificate); percentage of not having certificate or license. For each file, there is around 34 records.
Issues: There is no issues in these data files.