-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcuration.rmd
52 lines (43 loc) · 1.45 KB
/
curation.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
title: "Data curation"
output: html_notebook
---
Now where we have the assemblies and GTDB taxonomy, we will perform a bit of manual curation to end up with a high quality dataset. First let's read in the GTDB classifications and clean them up a bit.
```{r}
library(data.table)
classified <- fread("data/gtdb/assemblies.bac120.summary.tsv")
bacteria <- classified[, .(
assembly = user_genome,
classification,
fastani_ani,
fastani_af,
closest_placement_ani,
closest_placement_af,
msa_percent,
classification_method
)]
bacteria[, "species" := gsub("_[A-Z]", "", tstrsplit(classification, ";s__")[[2]])]
bacteria
```
There are 18 genomes that could not get aligned.
Let's also remove two genomes without species assignment and *Bacillus anthracis* which we will assume is a misclassification (nobody had anthrax).
```{r}
bacteria <- bacteria[!is.na(species) & species != "Bacillus anthracis"]
nrow(bacteria)
```
Okay this is our clean list that we will use.
```{r}
fwrite(bacteria, "data/curated_assemblies.csv")
```
## Species distribution
```{r, fig.width=4, fig.height=6}
library(ggplot2)
theme_minimal() |> theme_set()
bacteria[, "genus" := tstrsplit(species, " ")[[1]]]
counts <- bacteria[, .N, by="genus"]
ggplot(counts, aes(x=N, y=reorder(genus, N))) +
geom_vline(xintercept=c(0.01, 0.05) * sum(counts$N), linetype="dashed") +
geom_bar(stat="identity") +
labs(y="", x="#assemblies")
ggsave("figures/genus_counts.png", width=4, height=6)
```