-
Notifications
You must be signed in to change notification settings - Fork 3
/
03-descriptive-stats.Rmd
155 lines (108 loc) · 11.2 KB
/
03-descriptive-stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
output:
html_document: default
pdf_document: default
---
# Descriptive Statistics {#descriptive}
** Under Construction **
```{r descriptive-local-setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE,message = FALSE)
knitr::read_chunk("bin/03-descriptive-stats-code.R")
```
```{r packages}
```
Here in we will describe
- the data set
- the methods by which we label students at risk
- the distributions of at-risk students by
- demographic indicators
- registration record indicators
## Demographics across the colleges and major programs
### Dawson
```{r demographics-dawson}
```
### John Abbott College
```{r demographics-jac}
```
### Vanier
```{r demographics-vanier}
```
## Academic Performance in Semesters leading to drop-out
```{r descriptive-stats-setup}
```
### Grades
Do students who drop out do so because of poor grades? What fraction of students are counted year after year as drop-outs and labeled as problems to be solved by the system while being exemplary students in terms of academic performance. Armed with this dataset, we can get the answer to that question. Let us begin by looking at the average grades of students who eventually dropped out compared to grades of students who haven't. The following set of graphs will look at that comparison for 3 different semesters: the semester in which they dropped out (or graduated), the semester before that and the one before that. Let us see what the data says.
```{r Average comp}
```
```{r}
x<-students_last_session$Average[which(students_last_session$status=="out")]
```
From the average total grade between the drop outs and the graduates, we can clearly see that the distributions are significantly different, but what is surprising is that `r round(sum(x>75, na.rm=T)/length(x),digits=2)*100`% of students have an average grade above 75% and that more than `r round(sum(x>59, na.rm=T)/length(x),digits=2)*100`% of the students who dropped out had an average grade above 60%. In other words, most students who drop out have passing averages. One hypothesis for this effect is that students who end up dropping out start with good semesters and have their performance decline closer to the final semester. Let us verify this hypothesis.
Let us now look at the performance of drop outs vs graduates on a semester by semester basis. Let us start by looking at the average grades of students in their last semester during which they are either graduating or dropping-out.
```{r Average-comp}
```
```{r Average-comp-ttest}
```
```{r}
x<-students_last_session$Average_all_courses[which(students_last_session$status=="out")]
```
From the average total grade between the drop outs and the graduates, we can clearly see that the distributions are significantly different, but what is surprising is that `r round(sum(x>75, na.rm=T)/length(x),digits=2)*100`% of students have an average grade above 75% and that more than `r round(sum(x>59, na.rm=T)/length(x),digits=2)*100`% of the students who dropped out had an average grade above 60%. In other words, most students who drop out have passing averages. One hypothesis for this effect is that students who end up dropping out start with good semesters and have their performance decline closer to the final semester. Let us verify this hypothesis.
Let us now look at the performance of drop outs vs graduates on a semester by semester basis. Let us start by looking at the average grades of students in their last semester during which they are either graduating or dropping-out.
```{r Last-semester-comp}
```
```{r Last-semester-comp-ttest}
```
```{r}
x<-avg_grade_per_term[,.SD[.N],by=c("student_number")][status=="out"]$avg_grade
```
In this data, we can clearly observe a stark difference between the graduates and the drop outs. First of all, note that `r round(sum(x>75, na.rm=T)/length(x),digits=2)*100`% of students still have a term average over 75% in the semester in which they drop out. To push it further `r round(sum(x>80, na.rm=T)/length(x),digits=2)*100`% of students who drop out have an average grade over 80%. The data clearly suggests that some of the drop outs aren't dropping out because of academic performance. Furthermore, the long tail of the data on the low end of performance suggests that some of the students stopped coming to class prior to the end of the semester resulting in very low grades that serve to drive their cegep average grades from the graph above even lower. `r round(sum(x<40, na.rm=T)/length(x),digits=2)*100`% of students have a grade below 40 suggesting that they have indeed stopped coming to school some time during the semester. What if these students hadn't stopped coming, would their performance be similar to that of graduates?
Let us now turn our attention to the semester before the one where they graduate or drop out.
```{r N-1-semester-comp}
```
```{r}
x<-c2[,.SD[.N],by=c("student_number")][status=="out"]$V1
```
In this data, we can clearly observe a stark difference between the graduates and the drop outs. First of all, note that `r round(sum(x>75, na.rm=T)/length(x),digits=2)*100`% of students still have a term average over 75% in the semester in which they drop out. To push it further `r round(sum(x>80, na.rm=T)/length(x),digits=2)*100`% of students who drop out have an average grade over 80%. The data clearly suggests that some of the drop outs aren't dropping out because of academic performance. Furthermore, the long tail of the data on the low end of performance suggests that some of the students stopped coming to class prior to the end of the semester resulting in very low grades that serve to drive their cegep average grades from the graph above even lower. `r round(sum(x<40, na.rm=T)/length(x),digits=2)*100`% of students have a grade below 40 suggesting that they have indeed stopped coming to school some time during the semester. What if these students hadn't stopped coming, would their performance be similar to that of graduates?
Let us now turn our attention to the semester before the one where they graduate or drop out.
```{r N-1 semester comp}
```
```{r}
x<-c2[,.SD[.N-1],by=c("student_number")][status=="out"]$V1
```
```{r N-1-semester-comp-ttest}
```
```{r}
x<-avg_grade_last_term_minus1[status=="out"]$avg_grade
```
The data clearly shows that even if there are significant differences between the groups, the drop out student population is getting closer to the graduate population. A section of `r round(sum(x>75, na.rm=T)/length(x),digits=2)*100`% is observed to have an average grade above 75%.
Finally, let us look at 2 semesters before they graduate or drop-out. We are again expecting the same trend.
```{r N-2-semester-comp}
```
```{r N-2-semester-comp-ttest}
```
```{r}
x<-c2[,.SD[.N-2],by=c("student_number")][status=="out"]$V1
```
## Conclusion
In conclusion, the data strongly suggests that there are approximately 20% of students who consistently have averages above 75%, but still drop out. Therefore, that section of the drop out population is a section that will always go undetected if only traditional academic failure metrics are used to assess which students are likely to drop out. Students who move out to the US or the rest of Canada after their second semester and those who drop out by lack of interest are two potential student types that will always drop out no matter what remedial solutions are offered for them.
One of the profile level Key Performance Indicators that colleges are supposed to specifically keep track of is third semester retention. In this light, we can somewhat flip the line of questionning above, and look to see if the distribution of grades in the third semester students looks different for those who will drop out, and those who will continue on.
```{r third-semester-attrition-by-program}
```
In conclusion, the data strongly suggests that there are approximately 20% of students who consistently have averages above 75%, but still drop out. Therefore, that section of the drop out population is a section that will always go undetected if only traditional academic failure metrics are used to assess which students are likely to drop out. Students who move out to the US or the rest of Canada after their second semester and those who drop out by lack of interest are two potential student types that will always drop out no matter what remedial solutions are offered for them.
### Passed vs. Failed Courses
The line of questionning above can be repeated, but instead of looking at the average grade of courses taken in a term, we can instead look to see if the number of courses passed, or the proportion failed, might be different for students who drop out versus those who do not.
```{r failed-courses}
```
What we remark in the table above is that the number of courses failed does not seem to differentiate students who will drop out after their third term. But perhaps, we should consider the fraction of courses that a student took in their third term, and failed?
```{r prop-failed-courses}
```
For the above graphic, we calculate, for each student in their term, the fraction of courses that they took and _failed_, and we plot the distributions for the group of students we know who dropped out, and those we know stayed in the college for a fourth term. The red boxplot shows that,of the students who dropped out after their third term, 50% of them failed more than than half of their classes in that third term (the central notch represents a 95% confidence interval on the median). Conversely, the distribution on the left, squashed down at 0, shows that almost all students who stay on for a fourth term, pass all of their third term classes (The additional dots shown in this distribution are considered outliers to the distribution, meaning there are some students who failed all of their third term classes, and decided to stay for an additional fourth term).
### Early Detection Mechanisms: The Mid-Semester Evaluation
In all colleges, the Mid-Semester Evaluation (MSE) is used by teachers to provide feedback to students concerning their performance thus far in the semester. In light of this research project, it would be useful for colleges to know which number of which kind of MSE performances are pre-emptive signs that a student is at risk of dropping out before graduating. Current methods that exist to put students on the danger list are that a particular student has two or more failing MSEs. Let us verify that this criterion is optimal.
```{r Optimal-Failing-MSE}
```
```{r Optimal-Failing-MSE-prop}
```
You should print the 3 confusion matrices. Then go to the 3 graphs.
Looking at both the F1 score and the accuracy is interesting. Accuracy actually has a peak and then decrease at 1 for the last semester whereas the F1 score just flattens out for all semesters.
As can be seen from the above data, the F1 score, which is a figure of merit for the predictive quality focusing on the maximixing the diagonal elements and minimizing the off diagonal elements, suggests that there is no extra predictability after a student crosses the threshold of 1 Failing MSE. This suggests that should colleges want to create a mechanism to optimally figure out which students are at risk of dropping out, they should place students on the danger list if these students have one or more failing MSEs in any given semester, not two.