README for the greengenes2 2024_09 reference files
The biocore group released an updated version of the greengenes taxonomy in September 2024, which was an update of v.2022.10 that had been published in Nature Biotechnology. If you use these files, you should cite McDonald et al. My understanding is that the biggest changes you should notice will be a change in names to follow the GTDB 220.
Looking through their ftp site you will find a number of files that are useful for classifying ASVs, metagenomes, etc. For mothur-based workflows using the naive Bayesian classifier, we are interested in the full-length sequences. It is worth noting that the taxonomy provided with greengenes2 goes out to the species level. This does not mean that you should expect to be able to classify your sequences to the species level. There is general agreement that to achieve that you will need to have genome sequences.
Here is how I generated the mothur-compatible greengenes2 files.
To download the files we want two qza files (these are zip files that are used with QIIME 2). We can download them, extract them, and get them in an easy to find location using these bash commands:
wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2024.09/2024.09.backbone.full-length.fna.qza
wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2024.09/2024.09.backbone.tax.qza
unzip 2024.09.backbone.full-length.fna.qza
unzip 2024.09.backbone.tax.qza
mv */data/dna-sequences.fasta dna-sequences.fasta
mv */data/taxonomy.tsv taxonomy.tsv
Then in R we will read in the fasta and taxonomy files, make sure they’re in the correct order and polish the taxonomy strings to work with mothur:
library(tidyverse)
library(vroom)
fasta_fname <- "dna-sequences.fasta"
tax_fname <- "taxonomy.tsv"
f <- vroom_lines(fasta_fname)
indices <- seq_along(f)
seq_data <- tibble(id = f[indices %% 2 == 1],
seq = f[indices %% 2 == 0]) %>%
mutate(id = stringi::stri_replace_first_fixed(id, ">", ""))
tax_data <- vroom(tax_fname, delim = "\t",
col_names = c("id", "taxonomy"),
col_types = "cc",
skip = 1)
s_t <- anti_join(seq_data, tax_data, by = "id") %>% nrow(.) == 0
t_s <- anti_join(tax_data, seq_data, by = "id") %>% nrow(.) == 0
stopifnot(s_t, t_s)
seq_tax_data <- inner_join(seq_data, tax_data, by = "id")
Let’s look at the data a bit…
parsed_tax <- seq_tax_data %>%
select(id, taxonomy) %>%
separate(taxonomy, sep = "; ", into = c("k", "p", "c", "o", "f", "g", "s"))
count(parsed_tax, k)
There are 335,412 bacterial and 2,094 archaeal sequences. Here is the number of kingdoms through species included in the reference and the number of taxa at each level that only has one sequence.
parsed_tax %>%
pivot_longer(-id) %>%
filter(!str_detect(value, "__$")) %>%
nest(data = -name) %>%
mutate(summary = map_dfr(data, \(x){
z <- count(x, value)
tibble(n_taxa = nrow(z), n_singletons = sum(z$n == 1))
})) %>%
select(name, summary) %>%
unnest(summary)
# A tibble: 7 × 3
name n_taxa n_singletons
<chr> <int> <int>
1 k 2 0
2 p 132 3
3 c 333 21
4 o 957 105
5 f 2107 315
6 g 8114 2023
7 s 22901 10316
Here’s the number of sequences with a name at each taxonomic level
parsed_tax %>%
pivot_longer(-id) %>%
mutate(name = factor(name, levels = c("k", "p", "c", "o", "f", "g", "s"))) %>%
filter(!str_detect(value, "__$")) %>%
count(name)
# A tibble: 7 × 2
name n
<fct> <int>
1 k 337506
2 p 337431
3 c 336876
4 o 330113
5 f 323589
6 g 290541
7 s 199527
About 59% of sequences have a species level name. Not all sequences have genera-level names and not all genera have species-level names.
parsed_tax %>%
select(g, s) %>%
distinct() %>%
summarize(n_species = sum(s != "s__"), .by = g) %>%
count(n_species, name = "n_genera") %>%
select(n_genera, n_species) %>%
print(n = Inf)
n_genera n_species
<int> <int>
1 5473 1
2 1037 2
3 483 3
4 278 4
5 170 5
6 106 6
7 89 7
8 68 8
9 55 9
10 40 10
11 42 11
12 37 12
13 21 13
14 23 14
15 17 15
16 9 16
17 12 17
18 12 18
19 7 19
20 6 20
21 6 21
22 4 22
23 9 23
24 5 24
25 9 25
26 7 26
27 8 27
28 2 28
29 4 29
30 2 30
31 4 31
32 1 32
33 4 33
34 6 34
35 1 35
36 1 36
37 5 37
38 2 38
39 1 39
40 4 40
41 1 41
42 1 43
43 2 44
44 3 45
45 1 46
46 1 47
47 3 48
48 2 51
49 1 52
50 1 54
51 1 57
52 1 61
53 1 63
54 1 66
55 1 67
56 1 68
57 1 70
58 2 72
59 1 75
60 1 77
61 1 79
62 2 80
63 1 82
64 2 85
65 1 90
66 1 106
67 1 114
68 1 116
69 1 119
70 1 134
71 1 136
72 1 139
73 1 166
74 1 201
75 1 266
76 1 675
This tells us that there are 5373 genera with only one species; there’s also one genus with 675 species. This tells me that if you pick one of the 5373 singleton genera, you will definitely get that species - that’s about two-thirds of the genera. I worry that this will give the sense of an unrealistic sense of specificity for sequences. With that in mind, I will post the version of the greengenes2 database without species-level names. The code for generating the results with the species-level names is also included below in case you want to go out on that limb.
Let’s do the one with species-level names first:
seq_tax_data %>%
select(id, taxonomy) %>%
mutate(
taxonomy = stringi::stri_replace_all_regex(taxonomy, " ", ""),
taxonomy = stringi::stri_replace_all_regex(taxonomy, "$", ";")
) %>%
write_tsv("greengenes2_2024_09.w_sp.taxonomy", col_names = FALSE)
seq_tax_data %>%
select(id, seq) %>%
mutate(id = stringi::stri_replace_first_regex(id, "^", ">")) %>%
unite(fasta, id, seq, sep = "\n") %>%
write_tsv("greengenes2_2024_09.w_sp.fasta", col_names = FALSE)
Let’s do the one without species-level names next:
seq_tax_data %>%
select(id, taxonomy) %>%
mutate(
taxonomy = stringi::stri_replace_all_regex(taxonomy, " ", ""),
taxonomy = stringi::stri_replace_all_regex(taxonomy, "s__.*", "")
) %>%
write_tsv("greengenes2_2024_09.wo_sp.taxonomy", col_names = FALSE)
seq_tax_data %>%
select(id, seq) %>%
mutate(id = stringi::stri_replace_first_regex(id, "^", ">")) %>%
unite(fasta, id, seq, sep = "\n") %>%
write_tsv("greengenes2_2024_09.wo_sp.fasta", col_names = FALSE)
Let’s package it all together…
mkdir greengenes2_2024_09.wo_sp
mv greengenes2_2024_09.wo_sp.* greengenes2_2024_09.wo_sp
cp README.md greengenes2_2024_09.wo_sp
tar cvzf greengenes2_2024_09.wo_sp.tgz greengenes2_2024_09.wo_sp