-
Notifications
You must be signed in to change notification settings - Fork 0
/
04bis.matches_stats_apeep_manual.Rmd
285 lines (221 loc) · 9.67 KB
/
04bis.matches_stats_apeep_manual.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Statistics for segmentation benchmark (Apeep VS manual on CC4)"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=10, fig.height=6)
library(tidyverse)
library(scales)
```
The purpose of this script is to compare the performances of apeep segmentation against a ground truth human segmentation, on 258 large `apeep` images (10240 px * 2048 px).
Apeep pipeline is an adaptive gray level segmentation, down to 50 px.
## Prepare data
### Read data
```{r read_data}
output_dir <- "data_cc4/matches"
man_parts <- read_csv(file.path(output_dir, "man_particles_props.csv"), col_types = cols()) %>% select(-c("object_label", "object_bbox-0", "object_bbox-1", "object_bbox-2", "object_bbox-3"))
reg_parts <- read_csv(file.path(output_dir, "reg_particles_props.csv"), col_types = cols()) %>% select(-c("object_label", "object_bbox-0", "object_bbox-1", "object_bbox-2", "object_bbox-3"))
matches_reg <- read_csv(file.path(output_dir, "matches_reg.csv"), col_types = cols())
```
### Select relevant objects
Make a list of taxa in manually segmented particles.
```{r taxa}
taxa <- man_parts %>% pull(taxon) %>% unique() %>% sort()
taxa
```
The manual segmentation originally generated `r nrow(man_parts)` particles.
We will ignore objects in the `detritus` and `othertocheck` categories as well as objects smaller than 50 px.
```{r filter_objects}
ignored <- man_parts %>% filter(taxon %in% c("detritus", "othertocheck")) %>% pull(object_id)
small <- man_parts %>% filter(area < 50) %>% pull(object_id)
man_parts <- man_parts %>% filter(!(taxon %in% c("detritus", "othertocheck"))) %>% filter(area >= 50)
matches_reg <- matches_reg %>% filter((!(man_ids %in% ignored)) & (!(man_ids %in% small)))
```
**After removing non living and small particles, `r nrow(man_parts)` manual particles are left.
`r nrow(reg_parts)` particles were generated by apeep segmentation.**
Let’s inspect taxonomic composition of benchmark dataset.
```{r testset_comp}
man_parts %>%
count(taxon) %>%
arrange(-n) %>%
mutate(taxon = factor(taxon, taxon)) %>%
ggplot() +
geom_col(aes(x = taxon, y = n, fill = n > 10)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(trans = "log1p", breaks = c(0, 10, 50, 100, 200, 400, 1000, 2000)) +
labs(x = "Taxon", y = "Object number", title = "Test set composition") +
theme(text = element_text(size = 16))
```
## Compute global statistics
Compute overall precision and recall:
* precision: among `apeep` particles, how many where matched with manual particles?
* recall: among manual particles, how many where matched with `apeep` particles?
```{r global}
# Precision
# apeep particles matched with manual particles / apeep particles
precision <- length(unique(matches_reg$reg_ids)) / length(reg_parts$object_id)
# Recall
# manual particles matched with apeep particles / manual particles
recall <- length(unique(matches_reg$man_ids)) / length(man_parts$object_id)
global_stats <- tibble(
metric = c("precision", "recall"),
value = c(precision, recall)
)
global_stats
```
**Apeep segmentation has a very good recall (`r percent(recall, accuracy = 0.1)`) but a very poor precision (`r percent(precision, accuracy = 0.1)`).**
## Compute statistics per taxon
We want to compute the recall of organism per taxonomic group, but we have to deal with multiple matches.
Case 1: one `apeep` particle matched with multiple manual particles, likely with two different taxo.
Two solutions:
- take the rarest taxo
- ignore particle as the CNN won’t be able to predict it (selected solution)
Case 2: one manual particle matched with multiple `apeep` particles, only one taxo but `apeep` segmentation overestimates the number of organisms in this taxo.
Solution: keep only one match.
```{r taxo_stats}
## apeep particles
# Count matches
counts_taxo <- matches_reg %>%
# drop all cases of duplicated matches of apeep particles in match table (solve case 1 of multiple matches)
add_count(reg_ids) %>% filter(n==1) %>% select(-n) %>%
# drop duplicates of matched manual particles and keep only one (solve case 2 of multiple matches)
distinct(man_ids, .keep_all = TRUE) %>%
# join matched apeep particles with manual particles taxo
left_join(man_parts, by = c("man_ids" = "object_id")) %>%
# count matches per taxon
count(taxon, name = "n_reg")
# Compute recall
recall_taxo <- man_parts %>%
# count true particles per taxon
count(taxon, name = "n_truth") %>%
# join with matched particles
left_join(counts_taxo, by = "taxon") %>%
# compute ratio of apeep matched particles over true particles (recall)
mutate(recall= n_reg / n_truth) %>%
arrange(n_truth) %>%
mutate(taxon = factor(taxon, taxon))
# Plot it
recall_taxo %>%
ggplot() +
geom_col(aes(x = taxon, y = n_truth, fill = recall), position = "dodge") +
scale_fill_viridis_c(direction = -1) +
scale_y_continuous(trans = "log1p") +
coord_flip() +
labs(title = "Recall scores and organisms number per taxa for each segmentation", y = "Number of true objects") +
theme(text = element_text(size = 16))
```
Other plot
```{r taxo_plot, echo=FALSE}
recall_taxo %>%
ggplot() +
geom_col(aes(x = taxon, y = recall, fill = n_truth)) +
scale_fill_viridis_c(trans = "log1p", direction = -1) +
coord_flip() +
labs(title = "Recall scores and organisms number per taxa", fill = "Number of \ntrue objects") +
theme(text = element_text(size = 16))
```
## Compute statistics per size class
We will define size classes for particles:
* [50 px,100 px)
* [100 px, 150 px)
* [150 px, 200 px)
* [200 px, 250 px)
* [250 px, 300 px)
* [300 px, 350 px)
* [350 px, 400 px)
* [400 px, 450 px)
* [450 px, 500 px)
* \> 500 px
And compute recall for each size class.
```{r size_stats}
# Define size classes
# - 50 px from 50 px to 500 px
# - larger than 500 px
man_parts <- man_parts %>%
mutate(class_size = cut(area, breaks = c(0, seq(from = 50, to = 500, by = 50), 1000000), right = FALSE))
## apeep particles
# Count matches
counts_size <- matches_reg %>%
# drop all cases of duplicated matches of apeep particles in match table (solve case 1 of multiple matches)
add_count(reg_ids) %>% filter(n==1) %>% select(-n) %>%
# drop duplicates of matched manual particles and keep only one (solve case 2 of multiple matches)
distinct(man_ids, .keep_all = TRUE) %>%
# join matched apeep particles with manual particles class size
left_join(man_parts, by = c("man_ids" = "object_id")) %>%
# count matches per class size
count(class_size, name = "n_reg")
# Compute recall
recall_size <- man_parts %>%
# count true particles per class size
count(class_size, name = "n_truth") %>%
# join with matched particles
left_join(counts_size, by = "class_size") %>%
# compute ratio of apeep matched particles over true particles (recall)
mutate(recall = n_reg / n_truth)
recall_size %>%
ggplot() +
geom_col(aes(x = class_size, y = n_truth, fill = recall), position = "dodge") +
scale_fill_viridis_c(direction = -1) +
scale_y_continuous(trans = "log1p") +
coord_flip() +
labs(title = "Recall scores and organisms number per size class", fill = "Recall",
x = "Size class (px)", y = "Number of \ntrue objects") +
theme(text = element_text(size = 16))
```
Other plot.
```{r size_plot, echo=FALSE}
recall_size %>%
ggplot() +
geom_col(aes(x = class_size, y = recall, fill = n_truth)) +
scale_fill_viridis_c(trans = "log1p", direction = -1) +
coord_flip() +
labs(title = "Recall scores and organisms number per size class", fill = "Number of \ntrue objects", x = "Size class (px)") +
theme(text = element_text(size = 16))
```
Now compute precision on each size class.
```{r size_prec}
# Define size classes
# - 50 px from 50 px to 500 px
# - larger than 500 px
reg_parts <- reg_parts %>% mutate(class_size = cut(object_area, breaks = c(0, seq(from = 50, to = 500, by = 50), 1000000), right = FALSE))
## apeep particles
# Count matches
counts_size <- matches_reg %>%
# drop all cases of duplicated matches of apeep particles in match table (solve case 1 of multiple matches)
add_count(reg_ids) %>% filter(n==1) %>% select(-n) %>%
# drop duplicates of matched manual particles and keep only one (solve case 2 of multiple matches)
distinct(man_ids, .keep_all = TRUE) %>%
# join matched apeep particles with apeep particles class size
left_join(reg_parts, by = c("reg_ids" = "object_id")) %>%
# count matches per class size
count(class_size, name = "n_match")
# Compute precision
precision_size <- reg_parts %>%
# count true particles per class size
count(class_size, name = "n_reg") %>%
# join with matched particles
left_join(counts_size, by = "class_size") %>%
# compute ratio of apeep matched particles over true particles (recall)
mutate(precision = n_match / n_reg) %>%
# add count of true objects
left_join(man_parts %>% count(class_size, name = "n_truth"), by = "class_size")
precision_size %>%
ggplot() +
geom_col(aes(x = class_size, y = n_truth, fill = precision), position = "dodge") +
scale_fill_viridis_c(direction = -1) +
scale_y_continuous(trans = "log1p") +
coord_flip() +
labs(title = "Precision scores and organisms number per size class for each segmentation", fill = "Precision",
x = "Size class (px)", y = "Number of \ntrue objects") +
theme(text = element_text(size = 16))
```
Other plot.
```{r size_prec_plot, echo=FALSE}
precision_size %>%
ggplot() +
geom_col(aes(x = class_size, y = precision, fill = n_truth)) +
scale_fill_viridis_c(trans = "log1p", direction = -1) +
coord_flip() +
labs(title = "Precision scores and organisms number per size class", fill = "Number of \ntrue objects", x = "Size class (px)") +
theme(text = element_text(size = 16))
```