Pat offers mothur and R workshops through Riffomonas Professional Development. Learn more!

README for the greengenes2 2024_09 reference files

April 27, 2026 • Patrick D. Schloss • 5 min read

The biocore group released an updated version of the greengenes taxonomy in September 2024, which was an update of v.2022.10 that had been published in Nature Biotechnology. If you use these files, you should cite McDonald et al. My understanding is that the biggest changes you should notice will be a change in names to follow the GTDB 220.

Looking through their ftp site you will find a number of files that are useful for classifying ASVs, metagenomes, etc. For mothur-based workflows using the naive Bayesian classifier, we are interested in the full-length sequences. It is worth noting that the taxonomy provided with greengenes2 goes out to the species level. This does not mean that you should expect to be able to classify your sequences to the species level. There is general agreement that to achieve that you will need to have genome sequences.

Here is how I generated the mothur-compatible greengenes2 files.

To download the files we want two qza files (these are zip files that are used with QIIME 2). We can download them, extract them, and get them in an easy to find location using these bash commands:

wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2024.09/2024.09.backbone.full-length.fna.qza
wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2024.09/2024.09.backbone.tax.qza

unzip 2024.09.backbone.full-length.fna.qza
unzip 2024.09.backbone.tax.qza

mv */data/dna-sequences.fasta dna-sequences.fasta
mv */data/taxonomy.tsv taxonomy.tsv

Then in R we will read in the fasta and taxonomy files, make sure they’re in the correct order and polish the taxonomy strings to work with mothur:

library(tidyverse)
library(vroom)

fasta_fname <- "dna-sequences.fasta"
tax_fname <- "taxonomy.tsv"

f <- vroom_lines(fasta_fname)
indices <- seq_along(f)

seq_data <- tibble(id = f[indices %% 2 == 1],
      seq = f[indices %% 2 == 0]) %>%
  mutate(id = stringi::stri_replace_first_fixed(id, ">", ""))


tax_data <- vroom(tax_fname, delim = "\t",
                  col_names = c("id", "taxonomy"),
                  col_types = "cc",
                  skip = 1)


s_t <- anti_join(seq_data, tax_data, by = "id") %>% nrow(.) == 0
t_s <- anti_join(tax_data, seq_data, by = "id") %>% nrow(.) == 0
stopifnot(s_t, t_s)

seq_tax_data <- inner_join(seq_data, tax_data, by = "id")

Let’s look at the data a bit…

parsed_tax <- seq_tax_data %>%
  select(id, taxonomy) %>%
  separate(taxonomy, sep = "; ", into = c("k", "p", "c", "o", "f", "g", "s"))

count(parsed_tax, k)

There are 335,412 bacterial and 2,094 archaeal sequences. Here is the number of kingdoms through species included in the reference and the number of taxa at each level that only has one sequence.

parsed_tax %>%
  pivot_longer(-id) %>%
  filter(!str_detect(value, "__$")) %>%
  nest(data = -name) %>%
  mutate(summary = map_dfr(data, \(x){
    z <- count(x, value)
    tibble(n_taxa = nrow(z), n_singletons = sum(z$n == 1))
  })) %>%
  select(name, summary) %>%
  unnest(summary)
# A tibble: 7 × 3
  name  n_taxa n_singletons
  <chr>  <int>        <int>
1 k          2            0
2 p        132            3
3 c        333           21
4 o        957          105
5 f       2107          315
6 g       8114         2023
7 s      22901        10316

Here’s the number of sequences with a name at each taxonomic level

parsed_tax %>%
  pivot_longer(-id) %>%
  mutate(name = factor(name, levels = c("k", "p", "c", "o", "f", "g", "s"))) %>%
  filter(!str_detect(value, "__$")) %>%
  count(name)
# A tibble: 7 × 2
  name       n
  <fct>  <int>
1 k     337506
2 p     337431
3 c     336876
4 o     330113
5 f     323589
6 g     290541
7 s     199527

About 59% of sequences have a species level name. Not all sequences have genera-level names and not all genera have species-level names.

parsed_tax %>%
  select(g, s) %>%
  distinct() %>%
  summarize(n_species = sum(s != "s__"), .by = g) %>% 
  count(n_species, name = "n_genera") %>%
  select(n_genera, n_species) %>%
  print(n = Inf)
   n_genera n_species
      <int>     <int>
 1     5473         1
 2     1037         2
 3      483         3
 4      278         4
 5      170         5
 6      106         6
 7       89         7
 8       68         8
 9       55         9
10       40        10
11       42        11
12       37        12
13       21        13
14       23        14
15       17        15
16        9        16
17       12        17
18       12        18
19        7        19
20        6        20
21        6        21
22        4        22
23        9        23
24        5        24
25        9        25
26        7        26
27        8        27
28        2        28
29        4        29
30        2        30
31        4        31
32        1        32
33        4        33
34        6        34
35        1        35
36        1        36
37        5        37
38        2        38
39        1        39
40        4        40
41        1        41
42        1        43
43        2        44
44        3        45
45        1        46
46        1        47
47        3        48
48        2        51
49        1        52
50        1        54
51        1        57
52        1        61
53        1        63
54        1        66
55        1        67
56        1        68
57        1        70
58        2        72
59        1        75
60        1        77
61        1        79
62        2        80
63        1        82
64        2        85
65        1        90
66        1       106
67        1       114
68        1       116
69        1       119
70        1       134
71        1       136
72        1       139
73        1       166
74        1       201
75        1       266
76        1       675

This tells us that there are 5373 genera with only one species; there’s also one genus with 675 species. This tells me that if you pick one of the 5373 singleton genera, you will definitely get that species - that’s about two-thirds of the genera. I worry that this will give the sense of an unrealistic sense of specificity for sequences. With that in mind, I will post the version of the greengenes2 database without species-level names. The code for generating the results with the species-level names is also included below in case you want to go out on that limb.

Let’s do the one with species-level names first:

seq_tax_data %>%
  select(id, taxonomy) %>%
  mutate(
    taxonomy = stringi::stri_replace_all_regex(taxonomy, " ", ""),
    taxonomy = stringi::stri_replace_all_regex(taxonomy, "$", ";")
  ) %>%
  write_tsv("greengenes2_2024_09.w_sp.taxonomy", col_names = FALSE)

seq_tax_data %>%
  select(id, seq) %>%
  mutate(id = stringi::stri_replace_first_regex(id, "^", ">")) %>%
  unite(fasta, id, seq, sep = "\n") %>%
  write_tsv("greengenes2_2024_09.w_sp.fasta", col_names = FALSE)

Let’s do the one without species-level names next:

seq_tax_data %>%
  select(id, taxonomy) %>%
  mutate(
    taxonomy = stringi::stri_replace_all_regex(taxonomy, " ", ""),
    taxonomy = stringi::stri_replace_all_regex(taxonomy, "s__.*", "")
  ) %>%
  write_tsv("greengenes2_2024_09.wo_sp.taxonomy", col_names = FALSE)

seq_tax_data %>%
  select(id, seq) %>%
  mutate(id = stringi::stri_replace_first_regex(id, "^", ">")) %>%
  unite(fasta, id, seq, sep = "\n") %>%
  write_tsv("greengenes2_2024_09.wo_sp.fasta", col_names = FALSE)

Let’s package it all together…

mkdir greengenes2_2024_09.wo_sp
mv greengenes2_2024_09.wo_sp.* greengenes2_2024_09.wo_sp
cp README.md greengenes2_2024_09.wo_sp
tar cvzf greengenes2_2024_09.wo_sp.tgz greengenes2_2024_09.wo_sp