<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="https://www.w3.org/2005/Atom">
  <channel>
    <title>mothur</title>
    <description>The website that supports the mothur software program - one of the most widely used tools for analyzing 16S rRNA gene sequence data. Step inside to learn how to use the software, get help, and join our community!
</description>
    <link>https://mothur.org/</link>
    <atom:link href="https://mothur.org/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 28 Apr 2026 20:59:45 +0000</pubDate>
    <lastBuildDate>Tue, 28 Apr 2026 20:59:45 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>README for the greengenes2 2024_09 reference files</title>
        <description>&lt;p&gt;The &lt;a href=&quot;https://github.com/biocore/greengenes2/&quot;&gt;biocore group&lt;/a&gt; released an updated version of the greengenes taxonomy in &lt;a href=&quot;https://ftp.microbio.me/greengenes_release/2024.09/00README&quot;&gt;September 2024&lt;/a&gt;, which was an update of v.2022.10 that had been published in &lt;a href=&quot;https://www.nature.com/articles/s41587-023-01845-1&quot;&gt;Nature Biotechnology&lt;/a&gt;. If you use these files, you should cite McDonald et al. My understanding is that the &lt;a href=&quot;https://ftp.microbio.me/greengenes_release/2024.09/00CHANGELOG&quot;&gt;biggest changes you should notice&lt;/a&gt; will be a change in names to follow the &lt;a href=&quot;https://gtdb.ecogenomic.org/&quot;&gt;GTDB 220&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Looking through their &lt;a href=&quot;https://ftp.microbio.me/greengenes_release/2024.09/&quot;&gt;ftp site&lt;/a&gt; you will find a number of files that are useful for classifying ASVs, metagenomes, etc. For mothur-based workflows using the naive Bayesian classifier, we are interested in the full-length sequences. It is worth noting that the taxonomy provided with greengenes2 goes out to the species level. This does not mean that you should expect to be able to classify your sequences to the species level. There is general agreement that to achieve that you will need to have genome sequences.&lt;/p&gt;

&lt;p&gt;Here is how I generated the mothur-compatible greengenes2 files.&lt;/p&gt;

&lt;p&gt;To download the files we want two qza files (these are zip files that are used with QIIME 2). We can download them, extract them, and get them in an easy to find location using these bash commands:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{bash}&quot;&gt;wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2024.09/2024.09.backbone.full-length.fna.qza
wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2024.09/2024.09.backbone.tax.qza

unzip 2024.09.backbone.full-length.fna.qza
unzip 2024.09.backbone.tax.qza

mv */data/dna-sequences.fasta dna-sequences.fasta
mv */data/taxonomy.tsv taxonomy.tsv
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then in R we will read in the fasta and taxonomy files, make sure they’re in the correct order and polish the taxonomy strings to work with mothur:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{R}&quot;&gt;library(tidyverse)
library(vroom)

fasta_fname &amp;lt;- &quot;dna-sequences.fasta&quot;
tax_fname &amp;lt;- &quot;taxonomy.tsv&quot;

f &amp;lt;- vroom_lines(fasta_fname)
indices &amp;lt;- seq_along(f)

seq_data &amp;lt;- tibble(id = f[indices %% 2 == 1],
      seq = f[indices %% 2 == 0]) %&amp;gt;%
  mutate(id = stringi::stri_replace_first_fixed(id, &quot;&amp;gt;&quot;, &quot;&quot;))


tax_data &amp;lt;- vroom(tax_fname, delim = &quot;\t&quot;,
                  col_names = c(&quot;id&quot;, &quot;taxonomy&quot;),
                  col_types = &quot;cc&quot;,
                  skip = 1)


s_t &amp;lt;- anti_join(seq_data, tax_data, by = &quot;id&quot;) %&amp;gt;% nrow(.) == 0
t_s &amp;lt;- anti_join(tax_data, seq_data, by = &quot;id&quot;) %&amp;gt;% nrow(.) == 0
stopifnot(s_t, t_s)

seq_tax_data &amp;lt;- inner_join(seq_data, tax_data, by = &quot;id&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let’s look at the data a bit…&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax &amp;lt;- seq_tax_data %&amp;gt;%
  select(id, taxonomy) %&amp;gt;%
  separate(taxonomy, sep = &quot;; &quot;, into = c(&quot;k&quot;, &quot;p&quot;, &quot;c&quot;, &quot;o&quot;, &quot;f&quot;, &quot;g&quot;, &quot;s&quot;))

count(parsed_tax, k)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are 335,412 bacterial and 2,094 archaeal sequences. Here is the number of kingdoms through species included in the reference and the number of taxa at each level that only has one sequence.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax %&amp;gt;%
  pivot_longer(-id) %&amp;gt;%
  filter(!str_detect(value, &quot;__$&quot;)) %&amp;gt;%
  nest(data = -name) %&amp;gt;%
  mutate(summary = map_dfr(data, \(x){
    z &amp;lt;- count(x, value)
    tibble(n_taxa = nrow(z), n_singletons = sum(z$n == 1))
  })) %&amp;gt;%
  select(name, summary) %&amp;gt;%
  unnest(summary)
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# A tibble: 7 × 3
  name  n_taxa n_singletons
  &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;        &amp;lt;int&amp;gt;
1 k          2            0
2 p        132            3
3 c        333           21
4 o        957          105
5 f       2107          315
6 g       8114         2023
7 s      22901        10316
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s the number of sequences with a name at each taxonomic level&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax %&amp;gt;%
  pivot_longer(-id) %&amp;gt;%
  mutate(name = factor(name, levels = c(&quot;k&quot;, &quot;p&quot;, &quot;c&quot;, &quot;o&quot;, &quot;f&quot;, &quot;g&quot;, &quot;s&quot;))) %&amp;gt;%
  filter(!str_detect(value, &quot;__$&quot;)) %&amp;gt;%
  count(name)
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# A tibble: 7 × 2
  name       n
  &amp;lt;fct&amp;gt;  &amp;lt;int&amp;gt;
1 k     337506
2 p     337431
3 c     336876
4 o     330113
5 f     323589
6 g     290541
7 s     199527
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;About 59% of sequences have a species level name. Not all sequences have genera-level names and not all genera have species-level names.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax %&amp;gt;%
  select(g, s) %&amp;gt;%
  distinct() %&amp;gt;%
  summarize(n_species = sum(s != &quot;s__&quot;), .by = g) %&amp;gt;% 
  count(n_species, name = &quot;n_genera&quot;) %&amp;gt;%
  select(n_genera, n_species) %&amp;gt;%
  print(n = Inf)
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   n_genera n_species
      &amp;lt;int&amp;gt;     &amp;lt;int&amp;gt;
 1     5473         1
 2     1037         2
 3      483         3
 4      278         4
 5      170         5
 6      106         6
 7       89         7
 8       68         8
 9       55         9
10       40        10
11       42        11
12       37        12
13       21        13
14       23        14
15       17        15
16        9        16
17       12        17
18       12        18
19        7        19
20        6        20
21        6        21
22        4        22
23        9        23
24        5        24
25        9        25
26        7        26
27        8        27
28        2        28
29        4        29
30        2        30
31        4        31
32        1        32
33        4        33
34        6        34
35        1        35
36        1        36
37        5        37
38        2        38
39        1        39
40        4        40
41        1        41
42        1        43
43        2        44
44        3        45
45        1        46
46        1        47
47        3        48
48        2        51
49        1        52
50        1        54
51        1        57
52        1        61
53        1        63
54        1        66
55        1        67
56        1        68
57        1        70
58        2        72
59        1        75
60        1        77
61        1        79
62        2        80
63        1        82
64        2        85
65        1        90
66        1       106
67        1       114
68        1       116
69        1       119
70        1       134
71        1       136
72        1       139
73        1       166
74        1       201
75        1       266
76        1       675
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This tells us that there are 5373 genera with only one species; there’s also one genus with 675 species. This tells me that if you pick one of the 5373 singleton genera, you will definitely get that species - that’s about two-thirds of the genera. I worry that this will give the sense of an unrealistic sense of specificity for sequences. With that in mind, I will post the version of the greengenes2 database without species-level names. The code for generating the results with the species-level names is also included below in case you want to go out on that limb.&lt;/p&gt;

&lt;p&gt;Let’s do the one with species-level names first:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;seq_tax_data %&amp;gt;%
  select(id, taxonomy) %&amp;gt;%
  mutate(
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot; &quot;, &quot;&quot;),
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot;$&quot;, &quot;;&quot;)
  ) %&amp;gt;%
  write_tsv(&quot;greengenes2_2024_09.w_sp.taxonomy&quot;, col_names = FALSE)

seq_tax_data %&amp;gt;%
  select(id, seq) %&amp;gt;%
  mutate(id = stringi::stri_replace_first_regex(id, &quot;^&quot;, &quot;&amp;gt;&quot;)) %&amp;gt;%
  unite(fasta, id, seq, sep = &quot;\n&quot;) %&amp;gt;%
  write_tsv(&quot;greengenes2_2024_09.w_sp.fasta&quot;, col_names = FALSE)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let’s do the one without species-level names next:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;seq_tax_data %&amp;gt;%
  select(id, taxonomy) %&amp;gt;%
  mutate(
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot; &quot;, &quot;&quot;),
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot;s__.*&quot;, &quot;&quot;)
  ) %&amp;gt;%
  write_tsv(&quot;greengenes2_2024_09.wo_sp.taxonomy&quot;, col_names = FALSE)

seq_tax_data %&amp;gt;%
  select(id, seq) %&amp;gt;%
  mutate(id = stringi::stri_replace_first_regex(id, &quot;^&quot;, &quot;&amp;gt;&quot;)) %&amp;gt;%
  unite(fasta, id, seq, sep = &quot;\n&quot;) %&amp;gt;%
  write_tsv(&quot;greengenes2_2024_09.wo_sp.fasta&quot;, col_names = FALSE)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let’s package it all together…&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{bash}&quot;&gt;mkdir greengenes2_2024_09.wo_sp
mv greengenes2_2024_09.wo_sp.* greengenes2_2024_09.wo_sp
cp README.md greengenes2_2024_09.wo_sp
tar cvzf greengenes2_2024_09.wo_sp.tgz greengenes2_2024_09.wo_sp
&lt;/code&gt;&lt;/pre&gt;
</description>
        <pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2026/greengenes2_2024_09/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2026/greengenes2_2024_09/</guid>
        
        
      </item>
    
      <item>
        <title>README for the SILVA v138.2 reference files</title>
        <description>&lt;p&gt;The good people at &lt;a href=&quot;http://arb-silva.de&quot;&gt;SILVA&lt;/a&gt; have released a new version of the SILVA v138 (and v138.1) database. &lt;a href=&quot;https://www.arb-silva.de/documentation/release-1382/&quot;&gt;My understanding&lt;/a&gt; is that this update removed 13 sequences from v138. The biggest change was a number of modifications to the taxonomy including applying 6 taxonomic levels and using “Incertae Sedis” instead of “unclassified”. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;http://www.mothur.org/wiki/Silva_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;curation-of-references&quot;&gt;Curation of references&lt;/h2&gt;

&lt;h3 id=&quot;getting-the-data-in-and-out-of-the-arb-database&quot;&gt;Getting the data in and out of the ARB database&lt;/h3&gt;

&lt;p&gt;This README file explains how we generated the silva reference files for use with mothur’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;classify.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt; commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using version 6.0. First we need to download the database and decompress it. From the command line we do the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget -N https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/ARB_files/SILVA_138.2_SSURef_NR99_03_07_24_opt.arb.gz
gunzip SILVA_138.2_SSURef_NR99_03_07_24_opt.arb.gz
arb SILVA_138.2_SSURef_NR99_03_07_24_opt.arb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 510,495 sequences within it that are not more than 99% similar to each other. The release notes for &lt;a href=&quot;http://www.arb-silva.de/documentation/release-1382/&quot;&gt;this database&lt;/a&gt; as well as the idea behind the &lt;a href=&quot;http://www.arb-silva.de/projects/ssu-ref-nr/&quot;&gt;non-redundant database&lt;/a&gt; are available from the silva website. Within arb do the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Click the search button&lt;/li&gt;
  &lt;li&gt;Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)&lt;/li&gt;
  &lt;li&gt;Click ‘Search’. This yielded 446,875 hits&lt;/li&gt;
  &lt;li&gt;Click the “Mark Listed Unmark Rest” button&lt;/li&gt;
  &lt;li&gt;Close the “Search and Query” box&lt;/li&gt;
  &lt;li&gt;Now click on File-&amp;gt;export-&amp;gt;export to external format&lt;/li&gt;
  &lt;li&gt;In this box the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Export&lt;/code&gt; option should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;marked&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Filter&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compression&lt;/code&gt; should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;no&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;In the field for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Choose an output file name make sure the path has you in the correct working directory and enter &lt;/code&gt;silva.full_v138_2.fasta`.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fasta_mothur.eft&lt;/code&gt; in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$ARBHOME/lib/export/&lt;/code&gt; folder with the following:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SUFFIX          fasta
BEGIN
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;acc&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;name&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;align_ident_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;tax_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;|export_sequence&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Save this as silva.full_v138_2.fasta&lt;/li&gt;
  &lt;li&gt;You can now quit arb.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;screening-the-sequences&quot;&gt;Screening the sequences&lt;/h3&gt;

&lt;p&gt;Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line. Run the following commands from a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;#Convert from RNA to DNA sequences...
sed &apos;/^[^&amp;gt;]/s/[Uu]/T/g&apos; silva.full_v138_2.fasta &amp;gt; silva.full_v138_2_dna.fasta

mothur &quot;#screen.seqs(fasta=silva.full_v138_2_dna.fasta, start=1044, end=43116, maxambig=5);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();&quot;

#identify the unique sequences without regard to their alignment
grep &quot;&amp;gt;&quot; silva.full_v138_2_dna.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- &amp;gt; silva.full_v138_2_dna.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur &quot;#get.seqs(fasta=silva.full_v138_2_dna.good.pcr.fasta, accnos=silva.full_v138_2_dna.good.pcr.ng.unique.accnos)&quot;

#generate alignment file
mv silva.full_v138_2_dna.good.pcr.pick.fasta silva.nr_v138_2.align

#generate taxonomy file
grep &apos;&amp;gt;&apos; silva.nr_v138_2.align | cut -f1,3 | cut -f2 -d&apos;&amp;gt;&apos; &amp;gt; silva.nr_v138.full
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The mothur commands above do several things. First the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (&amp;gt;900 bp) archaeal 16S rRNA gene sequences. Second, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; converts any base calls that occur before position 1044 and after 43116 to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.&lt;/code&gt; to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;degap.seqs&lt;/code&gt;) and identify the unique sequences (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt;). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.seqs&lt;/code&gt;).&lt;/p&gt;

&lt;h3 id=&quot;formatting-the-taxonomy-files&quot;&gt;Formatting the taxonomy files&lt;/h3&gt;

&lt;p&gt;Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/Exports/taxonomy/tax_slv_ssu_138.2.txt.gz
&lt;span class=&quot;nb&quot;&gt;gunzip &lt;/span&gt;tax_slv_ssu_138.2.txt.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’ll run the following code from within R to clean up the taxa names and make sure everything has six levels:&lt;/p&gt;

&lt;div class=&quot;language-R highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tidyverse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desired_levels&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;domain&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;phylum&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;class&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;order&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;family&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;genus&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desired_levels_tbl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tibble&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_level&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;factor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desired_levels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desired_levels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# this is their reference taxonomy with levels for each substring found&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# in the database&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label_level&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_tsv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tax_slv_ssu_138.2.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
                            &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_type&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cols&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.default&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_level&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# this is the the full taxonoy for each sequence in the database&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;database_tax_label&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_tsv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;silva.nr_v138.full&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tax_label&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                               &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_type&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cols&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.default&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# these are the unique tax_label values found in database_tax_label&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique_tax_labels&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;database_tax_label&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;distinct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left_join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label_level&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tax_label&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# now need to get each of the substrings found in unique_tax_labels and return&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# the tax_level for each substring taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generate_substrings&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;$&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&amp;gt;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unlist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substrings&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;seq_along&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substrings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collapse&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substrings&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# replace missing levels with insertae sedis of the previous good name with&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# the taxonomic level appended&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill_ss_tbl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  
  &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;right_join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desired_levels_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tax_level&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nas&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;which&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;is.na&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;previous_good_string&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    
    &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nas&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_detect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;substring&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_insertae_sedis_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;previous_good_string&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      
      &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;substring&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;previous_good_string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;substring&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                       &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_insertae_sedis_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                       &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tax_level&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace_all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ss_tbl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collapse&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clean_tax_labels_lookup&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique_tax_labels&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generate_substrings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#generate substrs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unnest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inner_join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label_level&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;substring&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tax_label&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^.*?([^;]+);$&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_detect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^Incertae Sedis$&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;substring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_level&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clean_tax_label&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map_chr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill_ss_tbl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unnest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clean_tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;


&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;left_join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;database_tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clean_tax_labels_lookup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tax_label&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clean_tax_label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write_tsv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;silva.full_v138_2.tax&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quote&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;none&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;building-the-seed-references&quot;&gt;Building the SEED references&lt;/h3&gt;

&lt;p&gt;The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_238 taxonomy file as well. The following code will be run from within a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;grep &quot;&amp;gt;&quot; silva.nr_v138_2.align | cut -f 1,2 | grep &quot;\t100&quot; | cut -f 1 | cut -c 2- &amp;gt; silva.seed_v138.accnos
mothur &quot;#get.seqs(fasta=silva.nr_v138_2.align, taxonomy=silva.full_v138_2.tax, accnos=silva.seed_v138.accnos)&quot;
mv silva.nr_v138_2.pick.align silva.seed_v138_2.align
mv silva.full_v138_2.pick.tax silva.seed_v138_2.tax

cp silva.full_v138_2.tax silva.nr_v138_2.tax
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;taxonomic-representation&quot;&gt;Taxonomic representation&lt;/h3&gt;

&lt;p&gt;Let’s look to see how many different taxa we have for each taxonomic level within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_2.tax&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_2.tax&lt;/code&gt;. To do this we’ll run the following in R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;getNumTaxaNames &amp;lt;- function(file, kingdom){
  taxonomy &amp;lt;- read.table(file=file, row.names=1)
  sub.tax &amp;lt;- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  phyla &amp;lt;- sum(!grepl(kingdom, phyla))

  class &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  class &amp;lt;- sum(!grepl(kingdom, class))

  order &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  order &amp;lt;- sum(!grepl(kingdom, order))

  family &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  family &amp;lt;- sum(!grepl(kingdom, family))

  genus &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  genus &amp;lt;- sum(!grepl(kingdom, genus))

  n.seqs &amp;lt;- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms &amp;lt;- c(&quot;Bacteria&quot;, &quot;Archaea&quot;, &quot;Eukaryota&quot;)
tax.levels &amp;lt;- c(&quot;phyla&quot;, &quot;class&quot;, &quot;order&quot;, &quot;family&quot;, &quot;genus&quot;, &quot;n.seqs&quot;)

nr.file &amp;lt;- &quot;silva.nr_v138_2.tax&quot;
nr.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
nr.matrix[1,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) &amp;lt;- kingdoms
colnames(nr.matrix) &amp;lt;- tax.levels
nr.matrix
#           phyla class order family genus n.seqs
# Bacteria     96   249   660   1208  4554 145520
# Archaea      14    32    62    109   251   3744
# Eukaryota   116   343  1077   1863  2694  15032

seed.file &amp;lt;- &quot;silva.seed_v138_2.tax&quot;
seed.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
seed.matrix[1,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) &amp;lt;- kingdoms
colnames(seed.matrix) &amp;lt;- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     63   148   344    615  1547   6714
#Archaea       8    19    29     38    61    132
#Eukaryota    48   120   311    602   883   1850

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.6562500 0.5943775 0.5212121 0.5091060 0.3397014 0.04613799
#Archaea   0.5714286 0.5937500 0.4677419 0.3486239 0.2430279 0.03525641
#Eukaryota 0.4137931 0.3498542 0.2887651 0.3231347 0.3277654 0.12307078
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Archaea take a beating and recall they lost a bunch of sequences in the initial steps since many of the arachaeal sequences in SILVA are between 900 and 1200 nt long. If you are interested in analyzing the Archaea and the Eukaryota, I would suggest duplicating my efforts here but modify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; steps to target your region of interest.&lt;/p&gt;

&lt;p&gt;Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tar cvzf silva.nr_v138_2.tgz silva.nr_v138_2.tax silva.nr_v138_2.align README.md
tar cvzf silva.seed_v138_2.tgz silva.seed_v138_2.tax silva.seed_v138_2.align README.md
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;application&quot;&gt;Application&lt;/h2&gt;

&lt;p&gt;So… which to use for what application? If you have the RAM, I’d suggest using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_2.align&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt;. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences if you only use a single processor. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#pcr.seqs(fasta=silva.nr_v138_2.align, start=11894, end=25319, keepdots=F);
        unique.seqs()&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will get you down to 106,985 unique sequences to then align against. Other tricks to consider would be to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.lineage&lt;/code&gt; to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter.seqs&lt;/code&gt; with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know &lt;em&gt;a priori&lt;/em&gt;). It’s likely that you can just use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v138_2.align&lt;/code&gt; reference for aligning. For classifying sequences, I would strongly recommend using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_2.align&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_2.tax&lt;/code&gt; references after running pcr.seqs on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_2.align&lt;/code&gt;. I probably wouldn’t advise using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt; on the output.&lt;/p&gt;

&lt;h2 id=&quot;legalese&quot;&gt;Legalese&lt;/h2&gt;

&lt;p&gt;If you are going to use the files generated in this README, you should be aware that this release is available under &lt;a href=&quot;https://www.arb-silva.de/silva-license-information&quot;&gt;a CC-BY license&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Thu, 26 Sep 2024 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2024/SILVA-v138_2-reference-files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2024/SILVA-v138_2-reference-files/</guid>
        
        
      </item>
    
      <item>
        <title>README for the greengenes2 2020_10 reference files</title>
        <description>&lt;p&gt;The &lt;a href=&quot;https://github.com/biocore/greengenes2/&quot;&gt;biocore group&lt;/a&gt; released an updated version of the greengenes taxonomy in &lt;a href=&quot;https://ftp.microbio.me/greengenes_release/2022.10/&quot;&gt;October 2022&lt;/a&gt;, which was published in &lt;a href=&quot;https://www.nature.com/articles/s41587-023-01845-1&quot;&gt;Nature Biotechnology&lt;/a&gt;. If you use these files, you should cite McDonald et al.&lt;/p&gt;

&lt;p&gt;Looking through their &lt;a href=&quot;https://ftp.microbio.me/greengenes_release/2022.10/&quot;&gt;ftp site&lt;/a&gt; you will find a number of files that are useful for classifying ASVs, metagenomes, etc. For mothur-based workflows using the naive Bayesian classifier, we are interested in the full-length sequences. It is worth noting that the taxonomy provided with greengenes2 goes out to the species level. This does not mean that you should expect to be able to classify your sequences to the species level. There is general agreement that to achieve that you will need to have genome sequences.&lt;/p&gt;

&lt;p&gt;Here is how I generated the mothur-compatible greengenes2 files.&lt;/p&gt;

&lt;p&gt;To download the files we want two qza files (these are zip files that are used with QIIME 2). We can download them, extract them, and get them in an easy to find location using these bash commands:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{bash}&quot;&gt;wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza
wget --no-check-certificate https://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.tax.qza

unzip 2022.10.backbone.full-length.fna.qza
unzip 2022.10.backbone.tax.qza

mv */data/dna-sequences.fasta dna-sequences.fasta
mv */data/taxonomy.tsv taxonomy.tsv
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then in R we will read in the fasta and taxonomy files, make sure they’re in the correct order and polish the taxonomy strings to work with mothur:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{R}&quot;&gt;library(tidyverse)
library(vroom)

fasta_fname &amp;lt;- &quot;dna-sequences.fasta&quot;
tax_fname &amp;lt;- &quot;taxonomy.tsv&quot;

f &amp;lt;- vroom_lines(fasta_fname)
indices &amp;lt;- seq_along(f)

seq_data &amp;lt;- tibble(id = f[indices %% 2 == 1],
      seq = f[indices %% 2 == 0]) %&amp;gt;%
  mutate(id = stringi::stri_replace_first_fixed(id, &quot;&amp;gt;&quot;, &quot;&quot;))


tax_data &amp;lt;- vroom(tax_fname, delim = &quot;\t&quot;,
                  col_names = c(&quot;id&quot;, &quot;taxonomy&quot;),
                  col_types = &quot;cc&quot;,
                  skip = 1)


s_t &amp;lt;- anti_join(seq_data, tax_data, by = &quot;id&quot;) %&amp;gt;% nrow(.) == 0
t_s &amp;lt;- anti_join(tax_data, seq_data, by = &quot;id&quot;) %&amp;gt;% nrow(.) == 0
stopifnot(s_t, t_s)

seq_tax_data &amp;lt;- inner_join(seq_data, tax_data, by = &quot;id&quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let’s look at the data a bit…&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax &amp;lt;- seq_tax_data %&amp;gt;%
  select(id, taxonomy) %&amp;gt;%
  separate(taxonomy, sep = &quot;; &quot;, into = c(&quot;k&quot;, &quot;p&quot;, &quot;c&quot;, &quot;o&quot;, &quot;f&quot;, &quot;g&quot;, &quot;s&quot;))

count(parsed_tax, k)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are 329,175 bacterial and 2,094 archaeal sequences. Here is the number of kingdoms through species included in the reference and the number of taxa at each level that only has one sequence.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax %&amp;gt;%
  pivot_longer(-id) %&amp;gt;%
  filter(!str_detect(value, &quot;__$&quot;)) %&amp;gt;%
  nest(data = -name) %&amp;gt;%
  mutate(summary = map_dfr(data, \(x){
    z &amp;lt;- count(x, value)
    tibble(n_taxa = nrow(z), n_singletons = sum(z$n == 1))
  })) %&amp;gt;%
  select(name, summary) %&amp;gt;%
  unnest(summary)
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# A tibble: 7 × 3
  name  n_taxa n_singletons
  &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;        &amp;lt;int&amp;gt;
1 k          2            0
2 p        131            2
3 c        342           22
4 o       1054          126
5 f       2224          376
6 g       8019         2006
7 s      22929        10278
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s the number of sequences with a name at each taxonomic level&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax %&amp;gt;%
  pivot_longer(-id) %&amp;gt;%
  mutate(name = factor(name, levels = c(&quot;k&quot;, &quot;p&quot;, &quot;c&quot;, &quot;o&quot;, &quot;f&quot;, &quot;g&quot;, &quot;s&quot;))) %&amp;gt;%
  filter(!str_detect(value, &quot;__$&quot;)) %&amp;gt;%
  count(name)
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# A tibble: 7 × 2
  name       n
  &amp;lt;fct&amp;gt;  &amp;lt;int&amp;gt;
1 k     331269
2 p     331216
3 c     331006
4 o     328980
5 f     322566
6 g     291792
7 s     201702
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;About 61% of sequences have a species level name. Not all sequences have genera-level names and not all genera have species-level names.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;parsed_tax %&amp;gt;%
  select(g, s) %&amp;gt;%
  distinct() %&amp;gt;%
  summarize(n_species = sum(s != &quot;s__&quot;), .by = g) %&amp;gt;% 
  count(n_species, name = &quot;n_genera&quot;) %&amp;gt;%
  select(n_genera, n_species) %&amp;gt;%
  print(n = Inf)
&lt;/code&gt;&lt;/pre&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   n_genera n_species
      &amp;lt;int&amp;gt;     &amp;lt;int&amp;gt;
 1        3         0
 2     5373         1
 3     1026         2
 4      487         3
 5      275         4
 6      169         5
 7      113         6
 8       92         7
 9       74         8
10       45         9
11       42        10
12       44        11
13       31        12
14       23        13
15       21        14
16       17        15
17       13        16
18        5        17
19       17        18
20       11        19
21        4        20
22        1        21
23       12        22
24        7        23
25        8        24
26        7        25
27        5        26
28        6        27
29        4        28
30        1        29
31        7        30
32        4        31
33        2        32
34        5        33
35        5        34
36        3        37
37        4        38
38        2        39
39        2        40
40        1        41
41        2        42
42        3        43
43        1        44
44        1        45
45        1        46
46        1        47
47        1        48
48        1        50
49        5        52
50        1        56
51        1        57
52        2        59
53        1        61
54        2        66
55        1        71
56        1        72
57        2        73
58        1        74
59        3        76
60        1        79
61        1        80
62        1        83
63        1        84
64        1        86
65        1        89
66        1        92
67        1       104
68        1       107
69        1       119
70        1       120
71        2       139
72        1       150
73        1       165
74        1       199
75        1       267
76        1       456
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This tells us that there are 3 genera with no species and 5373 with only one species; there’s also one genus with 456 species. This tells me that if you pick one of the 5373 singleton genera, you will definitely get that species - that’s about two-thirds of the genera. I worry that this will give the sense of an unrealistic sense of specificity for sequences. With that in mind, I will post the version of the greengenes2 database without species-level names. The code for generating the results with the species-level names is also included below in case you want to go out on that limb.&lt;/p&gt;

&lt;p&gt;Let’s do the one with species-level names first:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;seq_tax_data %&amp;gt;%
  select(id, taxonomy) %&amp;gt;%
  mutate(
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot; &quot;, &quot;&quot;),
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot;$&quot;, &quot;;&quot;)
  ) %&amp;gt;%
  write_tsv(&quot;greengenes2_2020_10.w_sp.taxonomy&quot;, col_names = FALSE)

seq_tax_data %&amp;gt;%
  select(id, seq) %&amp;gt;%
  mutate(id = stringi::stri_replace_first_regex(id, &quot;^&quot;, &quot;&amp;gt;&quot;)) %&amp;gt;%
  unite(fasta, id, seq, sep = &quot;\n&quot;) %&amp;gt;%
  write_tsv(&quot;greengenes2_2020_10.w_sp.fasta&quot;, col_names = FALSE)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let’s do the one without species-level names next:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{r}&quot;&gt;seq_tax_data %&amp;gt;%
  select(id, taxonomy) %&amp;gt;%
  mutate(
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot; &quot;, &quot;&quot;),
    taxonomy = stringi::stri_replace_all_regex(taxonomy, &quot;s__.*&quot;, &quot;&quot;)
  ) %&amp;gt;%
  write_tsv(&quot;greengenes2_2020_10.wo_sp.taxonomy&quot;, col_names = FALSE)

seq_tax_data %&amp;gt;%
  select(id, seq) %&amp;gt;%
  mutate(id = stringi::stri_replace_first_regex(id, &quot;^&quot;, &quot;&amp;gt;&quot;)) %&amp;gt;%
  unite(fasta, id, seq, sep = &quot;\n&quot;) %&amp;gt;%
  write_tsv(&quot;greengenes2_2020_10.wo_sp.fasta&quot;, col_names = FALSE)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let’s package it all together…&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-{bash}&quot;&gt;mkdir greengenes2_2020_10.wo_sp
mv greengenes2_2020_10.wo_sp.* greengenes2_2020_10.wo_sp
cp README.md greengenes2_2020_10.wo_sp
tar cvzf greengenes2_2020_10.wo_sp.tgz greengenes2_2020_10.wo_sp
&lt;/code&gt;&lt;/pre&gt;
</description>
        <pubDate>Tue, 25 Jun 2024 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2024/greengenes2_2020_10/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2024/greengenes2_2020_10/</guid>
        
        
      </item>
    
      <item>
        <title>README for the RDP v19 reference files</title>
        <description>&lt;p&gt;The good people at the &lt;a href=&quot;http://rdp.cme.msu.edu&quot;&gt;RDP&lt;/a&gt; have released a new version of the RDP database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;http://mothur.org/wiki/RDP_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;. The original files are available from the RDPs &lt;a href=&quot;http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/&quot;&gt;sourceforge server&lt;/a&gt; and were used as the starting point for this README.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;http://rdp.cme.msu.edu/misc/rel10info.jsp#release11_history&quot;&gt;release notes&lt;/a&gt; indicate the following:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The Bacteria and Archaea hierarchy model used by RDP Classifier has been updated to training set No. 19. The new version has over 600 new genera and 2500 new species added since last version No. 18 released in July 2020. The information that is used to update the RDP taxonomy to training set version No. 19, and RDP Classifier version 2.14 came from publicly available scientific articles and public sequence repository, mostly from International Journal of Systematic and Evolutionary Microbiology (IJSEM), the All-Species Living Tree Project (LTP) and GenBank.&lt;/p&gt;

  &lt;p&gt;It is worth noting that most of the phyla have new names, according to article “
Oren A, Garrity GM. Valid publication of the names of forty-two phyla of prokaryotes. Int J Syst Evol Microbiol. 2021 Oct;71(10). doi: 10.1099/ijsem.0.005056. PMID: 34694987.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s get going…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; RDPClassifier_16S_trainsetNo19_rawtrainingdata&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

wget &lt;span class=&quot;nt&quot;&gt;-N&lt;/span&gt; http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo19_rawtrainingdata.zip
unzip &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; RDPClassifier_16S_trainsetNo19_rawtrainingdata.zip
&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;RDPClassifier_16S_trainsetNo19_rawtrainingdata/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; ./&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we’d like to start to form the taxonomy file and the fasta file that will be our reference. Again, using bash commands…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&amp;gt;&quot;&lt;/span&gt; trainset19_072023_speciesrank.fa | &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; 2- &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset19_072023_rmdup.tax
&lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;trainset19_072023_speciesrank.fa trainset19_072023.rdp.fasta&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Next, we’d like to get our taxonomy file properly formatted. First we’ll read in the taxonomy data. Then we’ll output the taxonomy data to a file we’ll call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;trainset19_072023.rdp.tax&lt;/code&gt; to have a consistent naming scheme with previous versions of those files. The following steps are done in R…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tidyverse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;incertae_sedis&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_fix&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_detect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;domain__&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)][&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;is.na&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_fix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_fix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;if_else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_detect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_fix&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;incertae_sedis&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_fix&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                      &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_fix&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_incertae_sedis&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
              &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;incertae_sedis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse_taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

  &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;domain&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.*domain__([^;]*);.*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;phylum&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.* phylum__([^;]*);.*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.* class__([^;]*);.*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.* order__([^;]*);.*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;family&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.* family__([^;]*);.*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;genus&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.* genus__([^;]*).*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;incertae_sedis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as_tibble_row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;across&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;phylum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;genus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                              &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str_replace_all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replacement&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_tsv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset19_072023_rmdup.tax&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;accession&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;species_strain&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;taxonomy&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                    &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_types&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cols&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.default&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;accession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse_taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unnest&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mutate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;domain&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;phylum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                          &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;order&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;family&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;genus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;accession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&amp;gt;%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write_tsv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset19_072023.rdp.tax&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col_names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quote&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;none&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The RDP training sets do not include mitochondria or sequences from eukaryotes. We find that it is helpful to have these sequences because we can get non-specific amplification at times and would like to be able to remove these lineages. Let’s go ahead and pull down the pds version of training set v.9 and copy those sequences over to our new training set. The following steps will be done in bash:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;wget &lt;span class=&quot;nt&quot;&gt;-N&lt;/span&gt; https://mothur.s3.us-east-2.amazonaws.com/wiki/trainset10_082014.pds.tgz
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xvzf trainset10_082014.pds.tgz
&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;trainset10_082014.pds/trainset10_082014&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; ./
&lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; trainset10_082014.pds trainset10_082014.pds.tgz&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now let’s run a mothur command to pull out the extra sequences that are in the pds files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;mothur &lt;span class=&quot;s2&quot;&gt;&quot;#get.lineage(fasta=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, taxon=Eukaryota-Mitochondria)&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This last command gets us the extra “pds” sequences that we can now use to paste on to the end of the normal RDP training set&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;trainset19_072023.rdp.tax trainset10_082014.pds.pick.tax &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset19_072023.pds.tax
&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;trainset19_072023.rdp.fasta trainset10_082014.pds.pick.fasta &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset19_072023.pds.fasta&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;While we’ve got the old version of the training set, it might be nice to see what the differences are. It would have been nice for them to provide a README indicating what changed, but, well, no, they didn’t.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;wc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.pds.tax&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##   10773 trainset10_082014.pds.tax
##   24765 trainset19_072023.pds.tax
##   35538 total&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we’re ready to compress the taxonomy files. First we do the RDP files…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;trainset19_072023.rdp
&lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;README.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; trainset19_072023.rdp.fasta trainset19_072023.rdp.tax trainset19_072023.rdp
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;cvzf trainset19_072023.rdp.tgz  trainset19_072023.rdp/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;##   a trainset19_072023.rdp/README.md&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;##   a trainset19_072023.rdp/trainset19_072023.rdp.fasta&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;##   a trainset19_072023.rdp/trainset19_072023.rdp.tax&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;… and then the pds files…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;trainset19_072023.pds
&lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;README.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; trainset19_072023.pds.fasta trainset19_072023.pds.tax trainset19_072023.pds
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;cvzf trainset19_072023.pds.tgz  trainset19_072023.pds/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;##   a trainset19_072023.pds/README.md&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;##   a trainset19_072023.pds/trainset19_072023.pds.fasta&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;##   a trainset19_072023.pds/trainset19_072023.pds.tax&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

</description>
        <pubDate>Tue, 12 Mar 2024 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2024/RDP-v19-reference-files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2024/RDP-v19-reference-files/</guid>
        
        
      </item>
    
      <item>
        <title>README for the SILVA v138.1 reference files</title>
        <description>&lt;p&gt;The good people at &lt;a href=&quot;http://arb-silva.de&quot;&gt;SILVA&lt;/a&gt; have released a new version of the SILVA v138 database. My understanding is that this is a minor update to correct some taxonomic information. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;http://www.mothur.org/wiki/Silva_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;curation-of-references&quot;&gt;Curation of references&lt;/h2&gt;

&lt;h3 id=&quot;getting-the-data-in-and-out-of-the-arb-database&quot;&gt;Getting the data in and out of the ARB database&lt;/h3&gt;

&lt;p&gt;This README file explains how we generated the silva reference files for use with mothur’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;classify.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt; commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using version 6.0. First we need to download the database and decompress it. From the command line we do the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget -N https://www.arb-silva.de/fileadmin/arb_web_db/release_138_1/ARB_files/SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz
gunzip SILVA_138.1_SSURef_NR99_05_01_20_opt.arb.gz
arb SILVA_138.1_SSURef_NR99_12_06_20_opt.arb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 510,508 sequences within it that are not more than 99% similar to each other. The release notes for &lt;a href=&quot;http://www.arb-silva.de/documentation/release-1381/&quot;&gt;this database&lt;/a&gt; as well as the idea behind the &lt;a href=&quot;http://www.arb-silva.de/projects/ssu-ref-nr/&quot;&gt;non-redundant database&lt;/a&gt; are available from the silva website. Within arb do the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Click the search button&lt;/li&gt;
  &lt;li&gt;Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)&lt;/li&gt;
  &lt;li&gt;Click ‘Search’. This yielded 446,881 hits&lt;/li&gt;
  &lt;li&gt;Click the “Mark Listed Unmark Rest” button&lt;/li&gt;
  &lt;li&gt;Close the “Search and Query” box&lt;/li&gt;
  &lt;li&gt;Now click on File-&amp;gt;export-&amp;gt;export to external format&lt;/li&gt;
  &lt;li&gt;In this box the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Export&lt;/code&gt; option should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;marked&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Filter&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compression&lt;/code&gt; should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;no&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;In the field for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Choose an output file name make sure the path has you in the correct working directory and enter &lt;/code&gt;silva.full_v138_1.fasta`.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fasta_mothur.eft&lt;/code&gt; in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$ARBHOME/lib/export/&lt;/code&gt; folder with the following:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SUFFIX          fasta
BEGIN
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;acc&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;name&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;align_ident_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;tax_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;|export_sequence&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Save this as silva.full_v138_1.fasta&lt;/li&gt;
  &lt;li&gt;You can now quit arb.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;screening-the-sequences&quot;&gt;Screening the sequences&lt;/h3&gt;

&lt;p&gt;Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line. Run the following commands from a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#screen.seqs(fasta=silva.full_v138_1.fasta, start=1044, end=43116, maxambig=5);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();&quot;

#identify the unique sequences without regard to their alignment
grep &quot;&amp;gt;&quot; silva.full_v138_1.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- &amp;gt; silva.full_v138_1.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur &quot;#get.seqs(fasta=silva.full_v138_1.good.pcr.fasta, accnos=silva.full_v138_1.good.pcr.ng.unique.accnos)&quot;

#generate alignment file
mv silva.full_v138_1.good.pcr.pick.fasta silva.nr_v138_1.align

#generate taxonomy file
grep &apos;&amp;gt;&apos; silva.nr_v138_1.align | cut -f1,3 | cut -f2 -d&apos;&amp;gt;&apos; &amp;gt; silva.nr_v138.full
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The mothur commands above do several things. First the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (&amp;gt;900 bp) archaeal 16S rRNA gene sequences. Second, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; converts any base calls that occur before position 1044 and after 43116 to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.&lt;/code&gt; to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;degap.seqs&lt;/code&gt;) and identify the unique sequences (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt;). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.seqs&lt;/code&gt;).&lt;/p&gt;

&lt;h3 id=&quot;formatting-the-taxonomy-files&quot;&gt;Formatting the taxonomy files&lt;/h3&gt;

&lt;p&gt;Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz
&lt;span class=&quot;nb&quot;&gt;gunzip &lt;/span&gt;tax_slv_ssu_138.1.txt.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thanks to &lt;a href=&quot;https://forum.mothur.org/t/18s-rdna-classification-issues/2303/14&quot;&gt;Eric Collins at the University of Alaska Fairbanks&lt;/a&gt;, we have some nice R code to map all of the taxa names to the six Linnean levels (kingdom, phylum, class, order, family, and genus). We’ll run the following code from within R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;map.in &amp;lt;- read.table(&quot;tax_slv_ssu_138.1.txt&quot;,header=F,sep=&quot;\t&quot;,stringsAsFactors=F)
map.in &amp;lt;- map.in[,c(1,3)]
colnames(map.in) &amp;lt;- c(&quot;taxlabel&quot;,&quot;taxlevel&quot;)
&amp;lt;!-- map.in &amp;lt;- rbind(map.in, c(&quot;Bacteria;RsaHf231;&quot;, &quot;phylum&quot;)) #wasn&apos;t in tax_slv_ssu_138.txt --&amp;gt;

#fix Escherichia nonsense
&amp;lt;!-- map.in$taxlevel[which(map.in$taxlabel==&quot;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;&quot;)] &amp;lt;- &quot;genus&quot; --&amp;gt;

taxlevels &amp;lt;- c(&quot;root&quot;,&quot;domain&quot;,&quot;major_clade&quot;,&quot;superkingdom&quot;,&quot;kingdom&quot;,&quot;subkingdom&quot;,&quot;infrakingdom&quot;,&quot;superphylum&quot;,&quot;phylum&quot;,&quot;subphylum&quot;,&quot;infraphylum&quot;,&quot;superclass&quot;,&quot;class&quot;,&quot;subclass&quot;,&quot;infraclass&quot;,&quot;superorder&quot;,&quot;order&quot;,&quot;suborder&quot;,&quot;superfamily&quot;,&quot;family&quot;,&quot;subfamily&quot;,&quot;genus&quot;)
taxabb &amp;lt;- c(&quot;ro&quot;,&quot;do&quot;,&quot;mc&quot;,&quot;pk&quot;,&quot;ki&quot;,&quot;bk&quot;,&quot;ik&quot;,&quot;pp&quot;,&quot;ph&quot;,&quot;bp&quot;,&quot;ip&quot;,&quot;pc&quot;,&quot;cl&quot;,&quot;bc&quot;,&quot;ic&quot;,&quot;po&quot;,&quot;or&quot;,&quot;bo&quot;,&quot;pf&quot;,&quot;fa&quot;,&quot;bf&quot;,&quot;ge&quot;)
tax.mat &amp;lt;- matrix(data=&quot;&quot;,nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] &amp;lt;- &quot;root&quot;
colnames(tax.mat) &amp;lt;- taxlevels

outlevels &amp;lt;- c(&quot;domain&quot;,&quot;phylum&quot;,&quot;class&quot;,&quot;order&quot;,&quot;family&quot;,&quot;genus&quot;)

for(i in 1:nrow(map.in)) {
	taxname &amp;lt;- unlist(strsplit(as.character(map.in[i,1]), split=&apos;;&apos;))
	#print(taxname);

	while ( length(taxname) &amp;gt; 0) {
		#regex to look for exact match

		tax.exp &amp;lt;- paste(paste(taxname,collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
		tax.match &amp;lt;- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] &amp;lt;- tail(taxname,1)
		taxname &amp;lt;- head(taxname,-1)
	}
}

for(i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don&apos;t want this behavior, cut it out
	for(j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] &amp;lt; 0) { tax.mat[i,j] &amp;lt;- paste(tmptax,taxabb[j],sep=&quot;_&quot;)}
		else { tmptax &amp;lt;- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,&quot;taxout&quot;] &amp;lt;- paste(paste(tax.mat[i,outlevels],collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
}

# replace spaces with underscores
map.in$taxout &amp;lt;- gsub(&quot; &quot;,&quot;_&quot;,map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in &amp;lt;- read.table(&quot;silva.nr_v138.full&quot;,header=F,stringsAsFactors=F,sep=&quot;\t&quot;)
colnames(tax.in) &amp;lt;- c(&quot;taxid&quot;,&quot;taxlabel&quot;)

# Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem
&amp;lt;!-- tax.in$taxlabel &amp;lt;- gsub(&quot;Polaribacter;Polaribacter;&quot;, &quot;Polaribacter;&quot;, tax.in$taxlabel) --&amp;gt;
tax.in$taxlabel &amp;lt;- gsub(&quot;;[[:space:]]+$&quot;, &quot;;&quot;, tax.in$taxlabel)

tax.in$id &amp;lt;- 1:nrow(tax.in)

tax.write &amp;lt;- merge(tax.in,map.in,all.x=T,sort=F)
tax.write &amp;lt;- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth &amp;lt;- function(taxonString){
	initial &amp;lt;- nchar(taxonString)
	removed &amp;lt;- nchar(gsub(&quot;;&quot;, &quot;&quot;, taxonString))
	return(initial-removed)
}

depth &amp;lt;- getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria &amp;lt;- grepl(&quot;Bacteria;&quot;, tax.write$taxout)
archaea &amp;lt;- grepl(&quot;Archaea;&quot;, tax.write$taxout)
eukarya &amp;lt;- grepl(&quot;Eukaryota;&quot;, tax.write$taxout)

tax.write[depth &amp;gt; 6 &amp;amp; bacteria,] #if zero, we&apos;re good to go
tax.write[depth &amp;gt; 6 &amp;amp; archaea,]  #if zero, we&apos;re good to go
tax.write[depth &amp;gt; 6 &amp;amp; eukarya,]  #if zero, we&apos;re good to go

write.table(tax.write[,c(&quot;taxid&quot;,&quot;taxout&quot;)], file=&quot;silva.full_v138_1.tax&quot;,sep=&quot;\t&quot;,row.names=F,quote=F,col.names=F)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;building-the-seed-references&quot;&gt;Building the SEED references&lt;/h3&gt;

&lt;p&gt;The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_138 taxonomy file as well. The following code will be run from within a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;grep &quot;&amp;gt;&quot; silva.nr_v138_1.align | cut -f 1,2 | grep &quot;\t100&quot; | cut -f 1 | cut -c 2- &amp;gt; silva.seed_v138.accnos
mothur &quot;#get.seqs(fasta=silva.nr_v138_1.align, taxonomy=silva.full_v138_1.tax, accnos=silva.seed_v138.accnos)&quot;
mv silva.nr_v138.pick.align silva.seed_v138_1.align
mv silva.full_v138_1.pick.tax silva.seed_v138_1.tax

mothur &quot;#get.seqs(taxonomy=silva.full_v138_1.tax, accnos=silva.full_v138_1.good.pcr.ng.unique.accnos)&quot;
mv silva.full_v138_1.pick.tax silva.nr_v138_1.tax
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;taxonomic-representation&quot;&gt;Taxonomic representation&lt;/h3&gt;

&lt;p&gt;Let’s look to see how many different taxa we have for each taxonomic level within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_1.tax&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_1.tax&lt;/code&gt;. To do this we’ll run the following in R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;getNumTaxaNames &amp;lt;- function(file, kingdom){
  taxonomy &amp;lt;- read.table(file=file, row.names=1)
  sub.tax &amp;lt;- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  phyla &amp;lt;- sum(!grepl(kingdom, phyla))

  class &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  class &amp;lt;- sum(!grepl(kingdom, class))

  order &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  order &amp;lt;- sum(!grepl(kingdom, order))

  family &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  family &amp;lt;- sum(!grepl(kingdom, family))

  genus &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  genus &amp;lt;- sum(!grepl(kingdom, genus))

  n.seqs &amp;lt;- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms &amp;lt;- c(&quot;Bacteria&quot;, &quot;Archaea&quot;, &quot;Eukaryota&quot;)
tax.levels &amp;lt;- c(&quot;phyla&quot;, &quot;class&quot;, &quot;order&quot;, &quot;family&quot;, &quot;genus&quot;, &quot;n.seqs&quot;)

nr.file &amp;lt;- &quot;silva.nr_v138_1.tax&quot;
nr.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
nr.matrix[1,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) &amp;lt;- kingdoms
colnames(nr.matrix) &amp;lt;- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     87   238   631   1139  3955 128884
#Archaea      15    33    57     97   222   2846
#Eukaryota    92   243   644    871  2682  14871


seed.file &amp;lt;- &quot;silva.seed_v138_1.tax&quot;
seed.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
seed.matrix[1,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) &amp;lt;- kingdoms
colnames(seed.matrix) &amp;lt;- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     51   123   299    523  1182   5736
#Archaea       7    17    23     30    44     81
#Eukaryota    40    98   272    422   855   1824

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.5862069 0.5168067 0.4738510 0.4591747 0.2988622 0.04450514
#Archaea   0.4666667 0.5151515 0.4035088 0.3092784 0.1981982 0.02846100
#Eukaryota 0.4347826 0.4032922 0.4223602 0.4845006 0.3187919 0.12265483
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Archaea take a beating and recall they lost a bunch of sequences in the initial steps since many of the arachaeal sequences in SILVA are between 900 and 1200 nt long. If you are interested in analyzing the Archaea and the Eukaryota, I would suggest duplicating my efforts here but modify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; steps to target your region of interest.&lt;/p&gt;

&lt;p&gt;Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tar cvzf silva.nr_v138_1.tgz silva.nr_v138_1.tax silva.nr_v138_1.align README.md
tar cvzf silva.seed_v138_1.tgz silva.seed_v138_1.tax silva.seed_v138_1.align README.md
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;application&quot;&gt;Application&lt;/h2&gt;

&lt;p&gt;So… which to use for what application? If you have the RAM, I’d suggest using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_1.align&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt;. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences if you only use a single processor. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#pcr.seqs(fasta=silva.nr_v138_1.align, start=11894, end=25319, keepdots=F);
        unique.seqs()&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will get you down to 106,985 unique sequences to then align against. Other tricks to consider would be to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.lineage&lt;/code&gt; to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter.seqs&lt;/code&gt; with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know &lt;em&gt;a priori&lt;/em&gt;). It’s likely that you can just use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v138_1.align&lt;/code&gt; reference for aligning. For classifying sequences, I would strongly recommend using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_1.align&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_1.tax&lt;/code&gt; references after running pcr.seqs on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138_1.align&lt;/code&gt;. I probably wouldn’t advise using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt; on the output.&lt;/p&gt;

&lt;h2 id=&quot;legalese&quot;&gt;Legalese&lt;/h2&gt;

&lt;p&gt;If you are going to use the files generated in this README, you should be aware that this release is available under &lt;a href=&quot;https://www.arb-silva.de/silva-license-information&quot;&gt;a CC-BY license&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Tue, 23 Feb 2021 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2021/SILVA-v138_1-reference-files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2021/SILVA-v138_1-reference-files/</guid>
        
        
      </item>
    
      <item>
        <title>README for the RDP v18 reference files</title>
        <description>&lt;p&gt;The good people at the &lt;a href=&quot;http://rdp.cme.msu.edu&quot;&gt;RDP&lt;/a&gt; have released a new version of the RDP database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;http://mothur.org/wiki/RDP_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;. The original files are available from the RDPs &lt;a href=&quot;http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/&quot;&gt;sourceforge server&lt;/a&gt; and were used as the starting point for this README.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://sourceforge.net/p/rdp-classifier/news/2020/07/rdp-classifier-213-july-2020-release-note/&quot;&gt;release notes&lt;/a&gt; indicate the following:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The Bacteria and Archaea hierarchy model used by RDP Classifier has been updated to training set No. 18. The new version has over 800 new genera and 4000 new species added. Major rearrangements for Classifier training set No. 18 include the following: (go check out the release notes that are linked above for the list of changes).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s get going…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; RDPClassifier_16S_trainsetNo18_rawtrainingdata&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

wget &lt;span class=&quot;nt&quot;&gt;-N&lt;/span&gt; http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo18_rawtrainingdata.zip
unzip &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; RDPClassifier_16S_trainsetNo18_rawtrainingdata.zip
&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;RDPClassifier_16S_trainsetNo18_rawtrainingdata/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; ./&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we’d like to start to form the taxonomy file and the fasta file that will be our reference. Again, using bash commands…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;trainset18_062020.fa trainset18_062020.rdp.fasta
&lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&amp;gt;&quot;&lt;/span&gt; trainset18_062020.rdp.fasta | &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; 2- &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset18_062020_rmdup.tax&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Next, we’d like to get our taxonomy file properly formatted. First we’ll read in the taxonomy data. The following steps are done in R…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;tax_file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset18_062020_rmdup.tax&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;what&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\n&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quiet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;accession&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^(\\S*).*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#some are separated by tabs or spaces or both&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.*(Root.*)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;	&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#remove spaces and replace with &apos;_&apos;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\t&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;	&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#remove extra tab characters&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;[^;]*_incertae_sedis$&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;\&quot;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#remove quote marks&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The RDP inserts a variety of sub taxonomic levels (e.g. suborder) that will get in the way of us having a consistent number of taxonomic levels for our analyses. Let’s use the data in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;trainset18_db_taxid.txt&lt;/code&gt; to remove these extra taxonomic levels:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;levels&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset18_db_taxid.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;levels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sub&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;levels&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;V5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub.names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;V2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.split&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strsplit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remove.subs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.vector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
	&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.vector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;which&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.vector&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%in%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub.names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remove.subs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unlist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collapse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^Root;(.*)$&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Finally, we can output the taxonomy data to a file we’ll call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;trainset18_062020.rdp.tax&lt;/code&gt; to have a consistent naming scheme with previous versions of those files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;write.table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;as.character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;accession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset18_062020.rdp.tax&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col.names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quote&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\t&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The RDP training sets do not include mitochondria or sequences from eukaryotes. We find that it is helpful to have these sequences because we can get non-specific amplification at times and would like to be able to remove these lineages. Let’s go ahead and pull down the pds version of training set v.9 and copy those sequences over to our new training set. The following steps will be done in bash:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;wget &lt;span class=&quot;nt&quot;&gt;-N&lt;/span&gt; https://mothur.s3.us-east-2.amazonaws.com/wiki/trainset10_082014.pds.tgz
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xvzf trainset10_082014.pds.tgz
&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;trainset10_082014.pds/trainset10_082014&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; ./
&lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; trainset10_082014.pds trainset10_082014.pds.tgz&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now let’s run a mothur command to pull out the extra sequences that are in the pds files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;mothur &lt;span class=&quot;s2&quot;&gt;&quot;#get.lineage(fasta=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, taxon=Eukaryota-Mitochondria)&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This last command gets us the extra “pds” sequences that we can now use to paste on to the end of the normal RDP training set&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;trainset18_062020.rdp.tax trainset10_082014.pds.pick.tax &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset18_062020.pds.tax
&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;trainset18_062020.rdp.fasta trainset10_082014.pds.pick.fasta &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset18_062020.pds.fasta&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;While we’ve got the old version of the training set, it might be nice to see what the differences are. It would have been nice for them to provide a README indicating what changed, but, well, no, they didn’t.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;wc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.pds.tax&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 10773 trainset10_082014.pds.tax
## 21318 trainset18_062020.pds.tax
## 32091 total&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we’re ready to compress the taxonomy files. First we do the RDP files…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;trainset18_062020.rdp
&lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;README.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; trainset18_062020.rdp.fasta trainset18_062020.rdp.tax trainset18_062020.rdp
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;cvzf trainset18_062020.rdp.tgz  trainset18_062020.rdp/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;## a trainset18_062020.rdp/README.md&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;## a trainset18_062020.rdp/trainset18_062020.rdp.fasta&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;## a trainset18_062020.rdp/trainset18_062020.rdp.tax&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;… and then the pds files…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;trainset18_062020.pds
&lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;README.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; trainset18_062020.pds.fasta trainset18_062020.pds.tax trainset18_062020.pds
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;cvzf trainset18_062020.pds.tgz  trainset18_062020.pds/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;## a trainset18_062020.pds/README.md&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;## a trainset18_062020.pds/trainset18_062020.pds.fasta&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;## a trainset18_062020.pds/trainset18_062020.pds.tax&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

</description>
        <pubDate>Thu, 04 Feb 2021 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2021/RDP-v18-reference-files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2021/RDP-v18-reference-files/</guid>
        
        
      </item>
    
      <item>
        <title>README for the SILVA v138 reference files</title>
        <description>&lt;p&gt;The good people at &lt;a href=&quot;https://arb-silva.de&quot;&gt;SILVA&lt;/a&gt; have released a new version of the SILVA database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;/wiki/Silva_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;getting-the-data-in-and-out-of-the-arb-database&quot;&gt;Getting the data in and out of the ARB database&lt;/h2&gt;

&lt;p&gt;This README file explains how we generated the silva reference files for use with mothur’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;classify.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt; commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using version 6.0. First we need to download the database and decompress it. From the command line we do the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget -N https://www.arb-silva.de/fileadmin/silva_databases/release_138/ARB_files/SILVA_138_SSURef_NR99_05_01_20_opt.arb.gz
gunzip SILVA_138_SSURef_NR99_05_01_20_opt.arb.gz
arb SILVA_138_SSURef_NR99_05_01_20_opt.arb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 510,984 sequences within it that are not more than 99% similar to each other. The release notes for &lt;a href=&quot;https://www.arb-silva.de/documentation/release-138/&quot;&gt;this database&lt;/a&gt; as well as the idea behind the &lt;a href=&quot;https://www.arb-silva.de/projects/ssu-ref-nr/&quot;&gt;non-redundant database&lt;/a&gt; are available from the silva website. Within arb do the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Click the search button&lt;/li&gt;
  &lt;li&gt;Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)&lt;/li&gt;
  &lt;li&gt;Click ‘Search’. This yielded 447,349 hits&lt;/li&gt;
  &lt;li&gt;Click the “Mark Listed Unmark Rest” button&lt;/li&gt;
  &lt;li&gt;Close the “Search and Query” box&lt;/li&gt;
  &lt;li&gt;Now click on File-&amp;gt;export-&amp;gt;export to external format&lt;/li&gt;
  &lt;li&gt;In this box the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Export&lt;/code&gt; option should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;marked&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Filter&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compression&lt;/code&gt; should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;no&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;In the field for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Choose an output file name&lt;/code&gt; make sure the path has you in the correct working directory and enter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.full_v138.fasta&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fasta_mothur.eft&lt;/code&gt; in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$ARBHOME/lib/export/&lt;/code&gt; folder with the following:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SUFFIX          fasta    
BEGIN    
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;acc&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;name&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;align_ident_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;tax_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;|export_sequence&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;    
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Save this as silva.full_v138.fasta&lt;/li&gt;
  &lt;li&gt;You can now quit arb.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;screening-the-sequences&quot;&gt;Screening the sequences&lt;/h2&gt;

&lt;p&gt;Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line. Run the following commands from a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#screen.seqs(fasta=silva.full_v138.fasta, start=1044, end=43116, maxambig=5, processors=8);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();&quot;

#identify the unique sequences without regard to their alignment
grep &quot;&amp;gt;&quot; silva.full_v138.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- &amp;gt; silva.full_v138.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur &quot;#get.seqs(fasta=silva.full_v138.good.pcr.fasta, accnos=silva.full_v138.good.pcr.ng.unique.accnos)&quot;

#generate alignment file
mv silva.full_v138.good.pcr.pick.fasta silva.nr_v138.align

#generate taxonomy file
grep &apos;&amp;gt;&apos; silva.nr_v138.align | cut -f1,3 | cut -f2 -d&apos;&amp;gt;&apos; &amp;gt; silva.nr_v138.full
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The mothur commands above do several things. First the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (&amp;gt;900 bp) archaeal 16S rRNA gene sequences. Second, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; converts any base calls that occur before position 1044 and after 43116 to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.&lt;/code&gt; to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;degap.seqs&lt;/code&gt;) and identify the unique sequences (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt;). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.seqs&lt;/code&gt;).&lt;/p&gt;

&lt;h2 id=&quot;formatting-the-taxonomy-files&quot;&gt;Formatting the taxonomy files&lt;/h2&gt;

&lt;p&gt;Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/taxonomy/tax_slv_ssu_138.txt.gz
&lt;span class=&quot;nb&quot;&gt;gunzip &lt;/span&gt;tax_slv_ssu_138.txt.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thanks to &lt;a href=&quot;https://forum.mothur.org/t/18s-rdna-classification-issues/2303/14&quot;&gt;Eric Collins at the University of Alaska Fairbanks&lt;/a&gt;, we have some nice R code to map all of the taxa names to the six Linnean levels (kingdom, phylum, class, order, family, and genus). We’ll run the following code from within R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;map.in &amp;lt;- read.table(&quot;tax_slv_ssu_138.txt&quot;,header=F,sep=&quot;\t&quot;,stringsAsFactors=F)
map.in &amp;lt;- map.in[,c(1,3)]
colnames(map.in) &amp;lt;- c(&quot;taxlabel&quot;,&quot;taxlevel&quot;)
map.in &amp;lt;- rbind(map.in, c(&quot;Bacteria;RsaHf231;&quot;, &quot;phylum&quot;)) #wasn&apos;t in tax_slv_ssu_138.txt

#fix Escherichia nonsense
map.in$taxlevel[which(map.in$taxlabel==&quot;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;&quot;)] &amp;lt;- &quot;genus&quot;

taxlevels &amp;lt;- c(&quot;root&quot;,&quot;domain&quot;,&quot;major_clade&quot;,&quot;superkingdom&quot;,&quot;kingdom&quot;,&quot;subkingdom&quot;,&quot;infrakingdom&quot;,&quot;superphylum&quot;,&quot;phylum&quot;,&quot;subphylum&quot;,&quot;infraphylum&quot;,&quot;superclass&quot;,&quot;class&quot;,&quot;subclass&quot;,&quot;infraclass&quot;,&quot;superorder&quot;,&quot;order&quot;,&quot;suborder&quot;,&quot;superfamily&quot;,&quot;family&quot;,&quot;subfamily&quot;,&quot;genus&quot;)
taxabb &amp;lt;- c(&quot;ro&quot;,&quot;do&quot;,&quot;mc&quot;,&quot;pk&quot;,&quot;ki&quot;,&quot;bk&quot;,&quot;ik&quot;,&quot;pp&quot;,&quot;ph&quot;,&quot;bp&quot;,&quot;ip&quot;,&quot;pc&quot;,&quot;cl&quot;,&quot;bc&quot;,&quot;ic&quot;,&quot;po&quot;,&quot;or&quot;,&quot;bo&quot;,&quot;pf&quot;,&quot;fa&quot;,&quot;bf&quot;,&quot;ge&quot;)
tax.mat &amp;lt;- matrix(data=&quot;&quot;,nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] &amp;lt;- &quot;root&quot;
colnames(tax.mat) &amp;lt;- taxlevels

outlevels &amp;lt;- c(&quot;domain&quot;,&quot;phylum&quot;,&quot;class&quot;,&quot;order&quot;,&quot;family&quot;,&quot;genus&quot;)

for(i in 1:nrow(map.in)) {
	taxname &amp;lt;- unlist(strsplit(as.character(map.in[i,1]), split=&apos;;&apos;))
	#print(taxname);

	while ( length(taxname) &amp;gt; 0) {
		#regex to look for exact match

		tax.exp &amp;lt;- paste(paste(taxname,collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
		tax.match &amp;lt;- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] &amp;lt;- tail(taxname,1)
		taxname &amp;lt;- head(taxname,-1)
	}
}

for(i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don&apos;t want this behavior, cut it out
	for(j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] &amp;lt; 0) { tax.mat[i,j] &amp;lt;- paste(tmptax,taxabb[j],sep=&quot;_&quot;)}
		else { tmptax &amp;lt;- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,&quot;taxout&quot;] &amp;lt;- paste(paste(tax.mat[i,outlevels],collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
}

# replace spaces with underscores
map.in$taxout &amp;lt;- gsub(&quot; &quot;,&quot;_&quot;,map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in &amp;lt;- read.table(&quot;silva.nr_v138.full&quot;,header=F,stringsAsFactors=F,sep=&quot;\t&quot;)
colnames(tax.in) &amp;lt;- c(&quot;taxid&quot;,&quot;taxlabel&quot;)

# Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem
tax.in$taxlabel &amp;lt;- gsub(&quot;Polaribacter;Polaribacter;&quot;, &quot;Polaribacter;&quot;, tax.in$taxlabel)
tax.in$taxlabel &amp;lt;- gsub(&quot;;[[:space:]]+$&quot;, &quot;;&quot;, tax.in$taxlabel)

tax.in$id &amp;lt;- 1:nrow(tax.in)

tax.write &amp;lt;- merge(tax.in,map.in,all.x=T,sort=F)
tax.write &amp;lt;- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth &amp;lt;- function(taxonString){
	initial &amp;lt;- nchar(taxonString)
	removed &amp;lt;- nchar(gsub(&quot;;&quot;, &quot;&quot;, taxonString))
	return(initial-removed)
}

depth &amp;lt;- getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria &amp;lt;- grepl(&quot;Bacteria;&quot;, tax.write$taxout)
archaea &amp;lt;- grepl(&quot;Archaea;&quot;, tax.write$taxout)
eukarya &amp;lt;- grepl(&quot;Eukaryota;&quot;, tax.write$taxout)

tax.write[depth &amp;gt; 6 &amp;amp; bacteria,] #if zero, we&apos;re good to go
tax.write[depth &amp;gt; 6 &amp;amp; archaea,]  #if zero, we&apos;re good to go
tax.write[depth &amp;gt; 6 &amp;amp; eukarya,]  #if zero, we&apos;re good to go

write.table(tax.write[,c(&quot;taxid&quot;,&quot;taxout&quot;)],file=&quot;silva.full_v138.tax&quot;,sep=&quot;\t&quot;,row.names=F,quote=F,col.names=F)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;building-the-seed-references&quot;&gt;Building the SEED references&lt;/h2&gt;

&lt;p&gt;The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_138 taxonomy file as well. The following code will be run from within a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;grep &quot;&amp;gt;&quot; silva.nr_v138.align | cut -f 1,2 | grep &quot;\t100&quot; | cut -f 1 | cut -c 2- &amp;gt; silva.seed_v138.accnos
mothur &quot;#get.seqs(fasta=silva.nr_v138.align, taxonomy=silva.full_v138.tax, accnos=silva.seed_v138.accnos)&quot;
mv silva.nr_v138.pick.align silva.seed_v138.align
mv silva.full_v138.pick.tax silva.seed_v138.tax

mothur &quot;#get.seqs(taxonomy=silva.full_v138.tax, accnos=silva.full_v138.good.pcr.ng.unique.accnos)&quot;
mv silva.full_v138.pick.tax silva.nr_v138.tax
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;taxonomic-representation&quot;&gt;Taxonomic representation&lt;/h2&gt;

&lt;p&gt;Let’s look to see how many different taxa we have for each taxonomic level within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138.tax&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v138.tax&lt;/code&gt;. To do this we’ll run the following in R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;getNumTaxaNames &amp;lt;- function(file, kingdom){
  taxonomy &amp;lt;- read.table(file=file, row.names=1)
  sub.tax &amp;lt;- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  phyla &amp;lt;- sum(!grepl(kingdom, phyla))

  class &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  class &amp;lt;- sum(!grepl(kingdom, class))

  order &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  order &amp;lt;- sum(!grepl(kingdom, order))

  family &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  family &amp;lt;- sum(!grepl(kingdom, family))

  genus &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  genus &amp;lt;- sum(!grepl(kingdom, genus))

  n.seqs &amp;lt;- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms &amp;lt;- c(&quot;Bacteria&quot;, &quot;Archaea&quot;, &quot;Eukaryota&quot;)
tax.levels &amp;lt;- c(&quot;phyla&quot;, &quot;class&quot;, &quot;order&quot;, &quot;family&quot;, &quot;genus&quot;, &quot;n.seqs&quot;)

nr.file &amp;lt;- &quot;silva.nr_v138.tax&quot;
nr.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
nr.matrix[1,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) &amp;lt;- kingdoms
colnames(nr.matrix) &amp;lt;- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     87   239   646   1138  3897 129063
#Archaea      15    33    57     97   219   2846
#Eukaryota    91   242   557    766  1727  14887

seed.file &amp;lt;- &quot;silva.seed_v138.tax&quot;
seed.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
seed.matrix[1,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) &amp;lt;- kingdoms
colnames(seed.matrix) &amp;lt;- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     51   124   312    522  1172   5741
#Archaea       7    17    23     30    44     81
#Eukaryota    40    98   238    380   727   1834

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.5862069 0.5188285 0.4829721 0.4586995 0.3007442 0.04448215
#Archaea   0.4666667 0.5151515 0.4035088 0.3092784 0.2009132 0.02846100
#Eukaryota 0.4395604 0.4049587 0.4272890 0.4960836 0.4209612 0.12319473
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Archaea take a beating and recall they lost a bunch of sequences in the initial steps since many of the arachaeal sequences in SILVA are between 900 and 1200 nt long. If you are interested in analyzing the Archaea and the Eukaryota, I would suggest duplicating my efforts here but modify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; steps to target your region of interest.&lt;/p&gt;

&lt;p&gt;Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tar cvzf silva.nr_v138.tgz silva.nr_v138.tax silva.nr_v138.align README.md
tar cvzf silva.seed_v138.tgz silva.seed_v138.tax silva.seed_v138.align README.md
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;application&quot;&gt;Application&lt;/h2&gt;

&lt;p&gt;So… which to use for what application? If you have the RAM, I’d suggest using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138.align&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt;. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences if you only use a single processor. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#pcr.seqs(fasta=silva.nr_v138.align, start=11894, end=25319, keepdots=F, processors=8);
        unique.seqs()&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will get you down to 107,001 unique sequences to then align against. Other tricks to consider would be to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.lineage&lt;/code&gt; to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter.seqs&lt;/code&gt; with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know &lt;em&gt;a priori&lt;/em&gt;). It’s likely that you can just use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v138.align&lt;/code&gt; reference for aligning. For classifying sequences, I would strongly recommend using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138.align&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138.tax&lt;/code&gt; references after running pcr.seqs on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v138.align&lt;/code&gt;. I probably wouldn’t advise using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt; on the output.&lt;/p&gt;

&lt;h2 id=&quot;legalese&quot;&gt;Legalese&lt;/h2&gt;

&lt;p&gt;If you are going to use the files generated in this README, you should be aware that this release is available under &lt;a href=&quot;https://www.arb-silva.de/silva-license-information&quot;&gt;a CC-BY license&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Wed, 04 Mar 2020 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2020/SILVA-v138-reference-files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2020/SILVA-v138-reference-files/</guid>
        
        
      </item>
    
      <item>
        <title>README for the SILVA v132 reference files</title>
        <description>&lt;p&gt;The good people at &lt;a href=&quot;https://arb-silva.de&quot;&gt;SILVA&lt;/a&gt; have released a new version of the SILVA database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;/wiki/Silva_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;curation-of-references&quot;&gt;Curation of references&lt;/h2&gt;

&lt;h3 id=&quot;getting-the-data-in-and-out-of-the-arb-database&quot;&gt;Getting the data in and out of the ARB database&lt;/h3&gt;

&lt;p&gt;This README file explains how we generated the silva reference files for use with mothur’s classify.seqs and align.seqs commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using version 6.0. First we need to download the database and decompress it. From the command line we do the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget -N https://www.arb-silva.de/fileadmin/arb_web_db/release_132/ARB_files/SILVA_132_SSURef_NR99_13_12_17_opt.arb.gz
gunzip SILVA_132_SSURef_NR99_13_12_17_opt.arb.gz
arb SILVA_132_SSURef_NR99_13_12_17_opt.arb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 695,171 sequences within it that are not more than 99% similar to each other. The release notes for &lt;a href=&quot;https://www.arb-silva.de/documentation/release-132/&quot;&gt;this database&lt;/a&gt; as well as the idea behind the &lt;a href=&quot;https://www.arb-silva.de/projects/ssu-ref-nr/&quot;&gt;non-redundant database&lt;/a&gt; are available from the silva website. Within arb do the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Click the search button&lt;/li&gt;
  &lt;li&gt;Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)&lt;/li&gt;
  &lt;li&gt;Click ‘Search’. This yielded 629,211 hits&lt;/li&gt;
  &lt;li&gt;Click the “Mark Listed Unmark Rest” button&lt;/li&gt;
  &lt;li&gt;Close the “Search and Query” box&lt;/li&gt;
  &lt;li&gt;Now click on File-&amp;gt;export-&amp;gt;export to external format&lt;/li&gt;
  &lt;li&gt;In this box the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Export&lt;/code&gt; option should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;marked&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Filter&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compression&lt;/code&gt; should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;no&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;In the field for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Choose an output file name make sure the path has you in the correct working directory and enter &lt;/code&gt;silva.full_v132.fasta`.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fasta_mothur.eft&lt;/code&gt; in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/opt/local/share/arb/lib/export/&lt;/code&gt; folder with the following:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SUFFIX          fasta    
BEGIN    
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;acc&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;name&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;align_ident_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;tax_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;|export_sequence&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;    
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Save this as silva.full_v132.fasta&lt;/li&gt;
  &lt;li&gt;You can now quit arb.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;screening-the-sequences&quot;&gt;Screening the sequences&lt;/h3&gt;

&lt;p&gt;Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line. Run the following commands from a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#screen.seqs(fasta=silva.full_v132.fasta, start=1044, end=43116, maxambig=5, processors=8);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();&quot;

#identify the unique sequences without regard to their alignment
grep &quot;&amp;gt;&quot; silva.full_v132.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- &amp;gt; silva.full_v132.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur &quot;#get.seqs(fasta=silva.full_v132.good.pcr.fasta, accnos=silva.full_v132.good.pcr.ng.unique.accnos)&quot;

#generate alignment file
mv silva.full_v132.good.pcr.pick.fasta silva.nr_v132.align

#generate taxonomy file
grep &apos;&amp;gt;&apos; silva.nr_v132.align | cut -f1,3 | cut -f2 -d&apos;&amp;gt;&apos; &amp;gt; silva.nr_v132.full
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The mothur commands above do several things. First the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (&amp;gt;900 bp) archaeal 16S rRNA gene sequences. Second, pcr.seqs convert any base calls that occur before position 1044 and after 43116 to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.&lt;/code&gt; to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (degap.seqs) and identify the unique sequences (unique.seqs). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.seqs&lt;/code&gt;).&lt;/p&gt;

&lt;h3 id=&quot;formatting-the-taxonomy-files&quot;&gt;Formatting the taxonomy files&lt;/h3&gt;

&lt;p&gt;Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_132.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thanks to &lt;a href=&quot;https://forum.mothur.org/viewtopic.php?f=3&amp;amp;t=3652&amp;amp;p=20249#p12680&quot;&gt;Eric Collins at the University of Alaska Fairbanks&lt;/a&gt;, we have some nice R code to map all of the taxa names to the six Linnean levels (kingdom, phylum, class, order, family, and genus). We’ll run the following code from within R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;map.in &amp;lt;- read.table(&quot;tax_slv_ssu_132.txt&quot;,header=F,sep=&quot;\t&quot;,stringsAsFactors=F)
map.in &amp;lt;- map.in[,c(1,3)]
colnames(map.in) &amp;lt;- c(&quot;taxlabel&quot;,&quot;taxlevel&quot;)
map.in &amp;lt;- rbind(map.in, c(&quot;Bacteria;RsaHf231;&quot;, &quot;phylum&quot;)) #wasn&apos;t in tax_slv_ssu_132.txt

#fix Escherichia nonsense
map.in$taxlevel[which(map.in$taxlabel==&quot;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;&quot;)] &amp;lt;- &quot;genus&quot;

taxlevels &amp;lt;- c(&quot;root&quot;,&quot;domain&quot;,&quot;major_clade&quot;,&quot;superkingdom&quot;,&quot;kingdom&quot;,&quot;subkingdom&quot;,&quot;infrakingdom&quot;,&quot;superphylum&quot;,&quot;phylum&quot;,&quot;subphylum&quot;,&quot;infraphylum&quot;,&quot;superclass&quot;,&quot;class&quot;,&quot;subclass&quot;,&quot;infraclass&quot;,&quot;superorder&quot;,&quot;order&quot;,&quot;suborder&quot;,&quot;superfamily&quot;,&quot;family&quot;,&quot;subfamily&quot;,&quot;genus&quot;)
taxabb &amp;lt;- c(&quot;ro&quot;,&quot;do&quot;,&quot;mc&quot;,&quot;pk&quot;,&quot;ki&quot;,&quot;bk&quot;,&quot;ik&quot;,&quot;pp&quot;,&quot;ph&quot;,&quot;bp&quot;,&quot;ip&quot;,&quot;pc&quot;,&quot;cl&quot;,&quot;bc&quot;,&quot;ic&quot;,&quot;po&quot;,&quot;or&quot;,&quot;bo&quot;,&quot;pf&quot;,&quot;fa&quot;,&quot;bf&quot;,&quot;ge&quot;)
tax.mat &amp;lt;- matrix(data=&quot;&quot;,nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] &amp;lt;- &quot;root&quot;
colnames(tax.mat) &amp;lt;- taxlevels

outlevels &amp;lt;- c(&quot;domain&quot;,&quot;phylum&quot;,&quot;class&quot;,&quot;order&quot;,&quot;family&quot;,&quot;genus&quot;)

for(i in 1:nrow(map.in)) {
	taxname &amp;lt;- unlist(strsplit(as.character(map.in[i,1]), split=&apos;;&apos;))
	#print(taxname);

	while ( length(taxname) &amp;gt; 0) {
		#regex to look for exact match

		tax.exp &amp;lt;- paste(paste(taxname,collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
		tax.match &amp;lt;- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] &amp;lt;- tail(taxname,1)
		taxname &amp;lt;- head(taxname,-1)
	}
}

for(i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don&apos;t want this behavior, cut it out
	for(j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] &amp;lt; 0) { tax.mat[i,j] &amp;lt;- paste(tmptax,taxabb[j],sep=&quot;_&quot;)}
		else { tmptax &amp;lt;- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,&quot;taxout&quot;] &amp;lt;- paste(paste(tax.mat[i,outlevels],collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
}

# replace spaces with underscores
map.in$taxout &amp;lt;- gsub(&quot; &quot;,&quot;_&quot;,map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in &amp;lt;- read.table(&quot;silva.nr_v132.full&quot;,header=F,stringsAsFactors=F,sep=&quot;\t&quot;)
colnames(tax.in) &amp;lt;- c(&quot;taxid&quot;,&quot;taxlabel&quot;)

# Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem
tax.in$taxlabel &amp;lt;- gsub(&quot;Polaribacter;Polaribacter;&quot;, &quot;Polaribacter;&quot;, tax.in$taxlabel)
tax.in$taxlabel &amp;lt;- gsub(&quot;;[[:space:]]+$&quot;, &quot;;&quot;, tax.in$taxlabel)

tax.in$id &amp;lt;- 1:nrow(tax.in)

tax.write &amp;lt;- merge(tax.in,map.in,all.x=T,sort=F)
tax.write &amp;lt;- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth &amp;lt;- function(taxonString){
	initial &amp;lt;- nchar(taxonString)
	removed &amp;lt;- nchar(gsub(&quot;;&quot;, &quot;&quot;, taxonString))
	return(initial-removed)
}

depth &amp;lt;- getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria &amp;lt;- grepl(&quot;Bacteria;&quot;, tax.write$taxout)
archaea &amp;lt;- grepl(&quot;Archaea;&quot;, tax.write$taxout)
eukarya &amp;lt;- grepl(&quot;Eukaryota;&quot;, tax.write$taxout)

tax.write[depth &amp;gt; 6 &amp;amp; bacteria,] #good to go
tax.write[depth &amp;gt; 6 &amp;amp; archaea,]  #good to go
tax.write[depth &amp;gt; 6 &amp;amp; eukarya,]  #good to go

write.table(tax.write[,c(&quot;taxid&quot;,&quot;taxout&quot;)],file=&quot;silva.full_v132.tax&quot;,sep=&quot;\t&quot;,row.names=F,quote=F,col.names=F)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;building-the-seed-references&quot;&gt;Building the SEED references&lt;/h3&gt;

&lt;p&gt;The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_132 taxonomy file as well. The following code will be run from within a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;grep &quot;&amp;gt;&quot; silva.nr_v132.align | cut -f 1,2 | grep &quot;\t100&quot; | cut -f 1 | cut -c 2- &amp;gt; silva.seed_v132.accnos
mothur &quot;#get.seqs(fasta=silva.nr_v132.align, taxonomy=silva.full_v132.tax, accnos=silva.seed_v132.accnos)&quot;
mv silva.nr_v132.pick.align silva.seed_v132.align
mv silva.full_v132.pick.tax silva.seed_v132.tax

mothur &quot;#get.seqs(taxonomy=silva.full_v132.tax, accnos=silva.full_v132.good.pcr.ng.unique.accnos)&quot;
mv silva.full_v132.pick.tax silva.nr_v132.tax
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;taxonomic-representation&quot;&gt;Taxonomic representation&lt;/h3&gt;

&lt;p&gt;Let’s look to see how many different taxa we have for each taxonomic level within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v132.tax&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v132.tax&lt;/code&gt;. To do this we’ll run the following in R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;getNumTaxaNames &amp;lt;- function(file, kingdom){
  taxonomy &amp;lt;- read.table(file=file, row.names=1)
  sub.tax &amp;lt;- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  phyla &amp;lt;- sum(!grepl(kingdom, phyla))

  class &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  class &amp;lt;- sum(!grepl(kingdom, class))

  order &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  order &amp;lt;- sum(!grepl(kingdom, order))

  family &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  family &amp;lt;- sum(!grepl(kingdom, family))

  genus &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  genus &amp;lt;- sum(!grepl(kingdom, genus))

  n.seqs &amp;lt;- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms &amp;lt;- c(&quot;Bacteria&quot;, &quot;Archaea&quot;, &quot;Eukaryota&quot;)
tax.levels &amp;lt;- c(&quot;phyla&quot;, &quot;class&quot;, &quot;order&quot;, &quot;family&quot;, &quot;genus&quot;, &quot;n.seqs&quot;)

nr.file &amp;lt;- &quot;silva.nr_v132.tax&quot;
nr.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
nr.matrix[1,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) &amp;lt;- kingdoms
colnames(nr.matrix) &amp;lt;- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     80   204   580   1052  3971 188247
#Archaea      11    30    52     85   210   4626
#Eukaryota    93   240   648    923  3018  20246

seed.file &amp;lt;- &quot;silva.seed_v132.tax&quot;
seed.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
seed.matrix[1,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) &amp;lt;- kingdoms
colnames(seed.matrix) &amp;lt;- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     50   110   301    530  1436   8517
#Archaea       7    15    26     39    62    147
#Eukaryota    41   100   287    478  1040   2516

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.6250000 0.5392157 0.5189655 0.5038023 0.3616218 0.04524375
#Archaea   0.6363636 0.5000000 0.5000000 0.4588235 0.2952381 0.03177691
#Eukaryota 0.4408602 0.4166667 0.4429012 0.5178765 0.3445991 0.12427146
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Archaea take a beating and recall they lost a bunch of sequences in the initial steps since many of the arachaeal sequences in SILVA are between 900 and 1200 nt long. If you are interested in analyzing the Archaea and the Eukaryota, I would suggest duplicating my efforts here but modify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; steps to target your region of interest.&lt;/p&gt;

&lt;p&gt;Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tar cvzf silva.nr_v132.tgz silva.nr_v132.tax silva.nr_v132.align README.md
tar cvzf silva.seed_v132.tgz silva.seed_v132.tax silva.seed_v132.align README.md
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;application&quot;&gt;Application&lt;/h2&gt;

&lt;p&gt;So… which to use for what application? If you have the RAM, I’d suggest using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v132.align&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt;. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#pcr.seqs(fasta=silva.nr_v132.align, start=11894, end=25319, keepdots=F, processors=8);
        unique.seqs()&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will get you 139,321 unique sequences to then align against (meh.). Other tricks to consider would be to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.lineage&lt;/code&gt; to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter.seqs&lt;/code&gt; with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know &lt;em&gt;a priori&lt;/em&gt;). It’s likely that you can just use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v132.align&lt;/code&gt; reference for aligning. For classifying sequences, I would strongly recommend using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v132.align&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v132.tax&lt;/code&gt; references after running pcr.seqs on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v132.align&lt;/code&gt;. I probably wouldn’t advise using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt; on the output.&lt;/p&gt;

&lt;h2 id=&quot;legalese&quot;&gt;Legalese&lt;/h2&gt;

&lt;p&gt;If you are going to use the files generated in this README, you should be aware of &lt;a href=&quot;https://www.arb-silva.de/silva-license-information&quot;&gt;SILVA’s dual use license&lt;/a&gt;. We’ll leave it to you to work out the details.&lt;/p&gt;
</description>
        <pubDate>Wed, 10 Jan 2018 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2018/SILVA-v132-reference-files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2018/SILVA-v132-reference-files/</guid>
        
        
      </item>
    
      <item>
        <title>README for the SILVA v128 reference files</title>
        <description>&lt;p&gt;The good people at &lt;a href=&quot;https://arb-silva.de&quot;&gt;SILVA&lt;/a&gt; have released a new version of the SILVA database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;/wiki/Silva_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;curation-of-references&quot;&gt;Curation of references&lt;/h2&gt;

&lt;h3 id=&quot;getting-the-data-in-and-out-of-the-arb-database&quot;&gt;Getting the data in and out of the ARB database&lt;/h3&gt;

&lt;p&gt;This README file explains how we generated the silva reference files for use with mothur’s classify.seqs and align.seqs commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using version 6.0. First we need to download the database and decompress it. From the command line we do the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget -N https://www.arb-silva.de/fileadmin/arb_web_db/release_128/ARB_files/SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
gunzip SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
arb SSURef_NR99_128_SILVA_07_09_16_opt.arb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 597,607 sequences within it that are not more than 99% similar to each other. The release notes for &lt;a href=&quot;https://www.arb-silva.de/documentation/release-128/&quot;&gt;this database&lt;/a&gt; as well as the idea behind the &lt;a href=&quot;https://www.arb-silva.de/projects/ssu-ref-nr/&quot;&gt;non-redundant database&lt;/a&gt; are available from the silva website. Within arb do the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Click the search button&lt;/li&gt;
  &lt;li&gt;Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)&lt;/li&gt;
  &lt;li&gt;Click ‘Search’. This yielded 577,832 hits&lt;/li&gt;
  &lt;li&gt;Click the “Mark Listed Unmark Rest” button&lt;/li&gt;
  &lt;li&gt;Close the “Search and Query” box&lt;/li&gt;
  &lt;li&gt;Now click on File-&amp;gt;export-&amp;gt;export to external format&lt;/li&gt;
  &lt;li&gt;In this box the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Export&lt;/code&gt; option should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;marked&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Filter&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compression&lt;/code&gt; should be set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;no&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;In the field for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Choose an output file name enter&lt;/code&gt; make sure the path has you in the correct working directory and enter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.full_v128.fasta&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fasta_mothur.eft&lt;/code&gt; in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/opt/local/share/arb/lib/export/&lt;/code&gt; folder with the following:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SUFFIX          fasta    
BEGIN    
&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;acc&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;name&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;align_ident_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;tax_slv&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;    
&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;|export_sequence&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;    
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Save this as silva.full_v128.fasta&lt;/li&gt;
  &lt;li&gt;You can now quit arb.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;screening-the-sequences&quot;&gt;Screening the sequences&lt;/h3&gt;

&lt;p&gt;Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line. Run the following commands from a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#screen.seqs(fasta=silva.full_v128.fasta, start=1044, end=43116, maxambig=5, processors=8);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();&quot;

#identify the unique sequences without regard to their alignment
grep &quot;&amp;gt;&quot; silva.full_v128.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- &amp;gt; silva.full_v128.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur &quot;#get.seqs(fasta=silva.full_v128.good.pcr.fasta, accnos=silva.full_v128.good.pcr.ng.unique.accnos)&quot;

#generate alignment file
mv silva.full_v128.good.pcr.pick.fasta silva.nr_v128.align

#generate taxonomy file
grep &apos;&amp;gt;&apos; silva.nr_v128.align | cut -f1,3 | cut -f2 -d&apos;&amp;gt;&apos; &amp;gt; silva.nr_v128.full
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The mothur commands above do several things. First the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (&amp;gt;900 bp) archaeal 16S rRNA gene sequences. Second, pcr.seqs convert any base calls that occur before position 1044 and after 43116 to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.&lt;/code&gt; to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (degap.seqs) and identify the unique sequences (unique.seqs). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.seqs&lt;/code&gt;).&lt;/p&gt;

&lt;h3 id=&quot;formatting-the-taxonomy-files&quot;&gt;Formatting the taxonomy files&lt;/h3&gt;

&lt;p&gt;Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_128.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thanks to &lt;a href=&quot;https://forum.mothur.org/viewtopic.php?f=3&amp;amp;t=3652&amp;amp;p=20249#p12680&quot;&gt;Eric Collins at the University of Alaska Fairbanks&lt;/a&gt;, we have some nice R code to map all of the taxa names to the six Linnean levels (kingdom, phylum, class, order, family, and genus). We’ll run the following code from within R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;map.in &amp;lt;- read.table(&quot;tax_slv_ssu_128.txt&quot;,header=F,sep=&quot;\t&quot;,stringsAsFactors=F)
map.in &amp;lt;- map.in[,c(1,3)]
colnames(map.in) &amp;lt;- c(&quot;taxlabel&quot;,&quot;taxlevel&quot;)
map.in &amp;lt;- rbind(map.in, c(&quot;Bacteria;RsaHf231;&quot;, &quot;phylum&quot;)) #wasn&apos;t in tax_slv_ssu_128.txt

#fix Escherichia nonsense
map.in$taxlevel[which(map.in$taxlabel==&quot;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;&quot;)] &amp;lt;- &quot;genus&quot;

taxlevels &amp;lt;- c(&quot;root&quot;,&quot;domain&quot;,&quot;major_clade&quot;,&quot;superkingdom&quot;,&quot;kingdom&quot;,&quot;subkingdom&quot;,&quot;infrakingdom&quot;,&quot;superphylum&quot;,&quot;phylum&quot;,&quot;subphylum&quot;,&quot;infraphylum&quot;,&quot;superclass&quot;,&quot;class&quot;,&quot;subclass&quot;,&quot;infraclass&quot;,&quot;superorder&quot;,&quot;order&quot;,&quot;suborder&quot;,&quot;superfamily&quot;,&quot;family&quot;,&quot;subfamily&quot;,&quot;genus&quot;)
taxabb &amp;lt;- c(&quot;ro&quot;,&quot;do&quot;,&quot;mc&quot;,&quot;pk&quot;,&quot;ki&quot;,&quot;bk&quot;,&quot;ik&quot;,&quot;pp&quot;,&quot;ph&quot;,&quot;bp&quot;,&quot;ip&quot;,&quot;pc&quot;,&quot;cl&quot;,&quot;bc&quot;,&quot;ic&quot;,&quot;po&quot;,&quot;or&quot;,&quot;bo&quot;,&quot;pf&quot;,&quot;fa&quot;,&quot;bf&quot;,&quot;ge&quot;)
tax.mat &amp;lt;- matrix(data=&quot;&quot;,nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] &amp;lt;- &quot;root&quot;
colnames(tax.mat) &amp;lt;- taxlevels

outlevels &amp;lt;- c(&quot;domain&quot;,&quot;phylum&quot;,&quot;class&quot;,&quot;order&quot;,&quot;family&quot;,&quot;genus&quot;)

for (i in 1:nrow(map.in)) {
	taxname &amp;lt;- unlist(strsplit(as.character(map.in[i,1]), split=&apos;;&apos;))
	#print(taxname);

	while ( length(taxname) &amp;gt; 0) {
		#regex to look for exact match

		tax.exp &amp;lt;- paste(paste(taxname,collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
		tax.match &amp;lt;- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] &amp;lt;- tail(taxname,1)
		taxname &amp;lt;- head(taxname,-1)
	}
}

for (i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don&apos;t want this behavior, cut it out
	for (j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] &amp;lt; 0) { tax.mat[i,j] &amp;lt;- paste(tmptax,taxabb[j],sep=&quot;_&quot;)}
		else { tmptax &amp;lt;- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,&quot;taxout&quot;] &amp;lt;- paste(paste(tax.mat[i,outlevels],collapse=&quot;;&quot;),&quot;;&quot;,sep=&quot;&quot;)
}

# replace spaces with underscores
map.in$taxout &amp;lt;- gsub(&quot; &quot;,&quot;_&quot;,map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in &amp;lt;- read.table(&quot;silva.nr_v128.full&quot;,header=F,stringsAsFactors=F,sep=&quot;\t&quot;)
colnames(tax.in) &amp;lt;- c(&quot;taxid&quot;,&quot;taxlabel&quot;)

tax.in$taxlabel &amp;lt;- gsub(&quot;[[:space:]]+;&quot;, &quot;;&quot;, tax.in$taxlabel) #fix extra space in &quot;Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Dothideomycetes;Pleosporales;Phaeosphaeriaceae;Parastagonospora ;&quot;
tax.in$taxlabel &amp;lt;- gsub(&quot;;[[:space:]]+$&quot;, &quot;;&quot;, tax.in$taxlabel)

tax.in$id &amp;lt;- 1:nrow(tax.in)

tax.write &amp;lt;- merge(tax.in,map.in,all.x=T,sort=F)
tax.write &amp;lt;- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth &amp;lt;- function(taxonString){
  initial &amp;lt;- nchar(taxonString)
    removed &amp;lt;- nchar(gsub(&quot;;&quot;, &quot;&quot;, taxonString))
    return(initial-removed)
}

depth &amp;lt;- getDepth(tax.write$taxout)
summary(depth) #should all be 6
bacteria &amp;lt;- grepl(&quot;Bacteria;&quot;, tax.write$taxout)
archaea &amp;lt;- grepl(&quot;Archaea;&quot;, tax.write$taxout)
eukarya &amp;lt;- grepl(&quot;Eukaryota;&quot;, tax.write$taxout)

tax.write[depth &amp;gt; 6 &amp;amp; bacteria,] #good to go
tax.write[depth &amp;gt; 6 &amp;amp; archaea,]  #good to go
tax.write[depth &amp;gt; 6 &amp;amp; eukarya,]  #good to go

write.table(tax.write[,c(&quot;taxid&quot;,&quot;taxout&quot;)],file=&quot;silva.full_v128.tax&quot;,sep=&quot;\t&quot;,row.names=F,quote=F,col.names=F)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;building-the-seed-references&quot;&gt;Building the SEED references&lt;/h3&gt;

&lt;p&gt;The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_128 taxonomy file as well. The following code will be run from within a bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;grep &quot;&amp;gt;&quot; silva.nr_v128.align | cut -f 1,2 | grep &quot;\t100&quot; | cut -f 1 | cut -c 2- &amp;gt; silva.seed_v128.accnos
mothur &quot;#get.seqs(fasta=silva.nr_v128.align, taxonomy=silva.full_v128.tax, accnos=silva.seed_v128.accnos)&quot;
mv silva.nr_v128.pick.align silva.seed_v128.align
mv silva.full_v128.pick.tax silva.seed_v128.tax

mothur &quot;#get.seqs(taxonomy=silva.full_v128.tax, accnos=silva.full_v128.good.pcr.ng.unique.accnos)&quot;
mv silva.full_v128.pick.tax silva.nr_v128.tax
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;taxonomic-representation&quot;&gt;Taxonomic representation&lt;/h3&gt;

&lt;p&gt;Let’s look to see how many different taxa we have for each taxonomic level within the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v128.tax&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v128.tax&lt;/code&gt;. To do this we’ll run the following in R:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;getNumTaxaNames &amp;lt;- function(file, kingdom){
  taxonomy &amp;lt;- read.table(file=file, row.names=1)
  sub.tax &amp;lt;- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  phyla &amp;lt;- sum(!grepl(kingdom, phyla))

  class &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  class &amp;lt;- sum(!grepl(kingdom, class))

  order &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  order &amp;lt;- sum(!grepl(kingdom, order))

  family &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  family &amp;lt;- sum(!grepl(kingdom, family))

  genus &amp;lt;- as.vector(levels(as.factor(gsub(&quot;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*&quot;, &quot;\\1&quot;, sub.tax))))
  genus &amp;lt;- sum(!grepl(kingdom, genus))

  n.seqs &amp;lt;- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms &amp;lt;- c(&quot;Bacteria&quot;, &quot;Archaea&quot;, &quot;Eukaryota&quot;)
tax.levels &amp;lt;- c(&quot;phyla&quot;, &quot;class&quot;, &quot;order&quot;, &quot;family&quot;, &quot;genus&quot;, &quot;n.seqs&quot;)

nr.file &amp;lt;- &quot;silva.nr_v128.tax&quot;
nr.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
nr.matrix[1,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] &amp;lt;- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) &amp;lt;- kingdoms
colnames(nr.matrix) &amp;lt;- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     74   261   500   1001  3478 168111
#Archaea      24    52    59    101   217   4337
#Eukaryota   102   252   654    912  2673  18213

seed.file &amp;lt;- &quot;silva.seed_v128.tax&quot;
seed.matrix &amp;lt;- matrix(rep(0,18), nrow=3)
seed.matrix[1,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] &amp;lt;- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) &amp;lt;- kingdoms
colnames(seed.matrix) &amp;lt;- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     54   146   252    471  1375   8512
#Archaea       9    17    24     37    62    147
#Eukaryota    38    96   273    465   957   2554

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.7297297 0.5593870 0.5040000 0.4705295 0.3953422 0.05063321
#Archaea   0.3750000 0.3269231 0.4067797 0.3663366 0.2857143 0.03389440
#Eukaryota 0.3725490 0.3809524 0.4174312 0.5098684 0.3580247 0.14022951
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Archaea take a beating and recall they lost a bunch of sequences in the initial steps since many of the arachaeal sequences in SILVA are between 900 and 1200 nt long. If you are interested in analyzing the Archaea and the Eukaryota, I would suggest duplicating my efforts here but modify the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;screen.seqs&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pcr.seqs&lt;/code&gt; steps to target your region of interest.&lt;/p&gt;

&lt;p&gt;Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;tar cvzf silva.nr_v128.tgz silva.nr_v128.tax silva.nr_v128.align README.*
tar cvzf silva.seed_v128.tgz silva.seed_v128.tax silva.seed_v128.align README.*
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;application&quot;&gt;Application&lt;/h2&gt;

&lt;p&gt;So… which to use for what application? If you have the RAM, I’d suggest using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v128.align&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;align.seqs&lt;/code&gt;. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mothur &quot;#pcr.seqs(fasta=silva.nr_v128.align, start=11894, end=25319, keepdots=F, processors=8);
        unique.seqs()&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will get you 104,711 unique sequences to then align against (meh.). Other tricks to consider would be to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get.lineage&lt;/code&gt; to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter.seqs&lt;/code&gt; with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know &lt;em&gt;a priori&lt;/em&gt;). It’s likely that you can just use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.seed_v128.align&lt;/code&gt; reference for aligning. For classifying sequences, I would strongly recommend using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v128.align&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v128.tax&lt;/code&gt; references after running pcr.seqs on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silva.nr_v128.align&lt;/code&gt;. I probably wouldn’t advise using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unique.seqs&lt;/code&gt; on the output.&lt;/p&gt;

&lt;h2 id=&quot;legalese&quot;&gt;Legalese&lt;/h2&gt;

&lt;p&gt;If you are going to use the files generated in this README, you should be aware of &lt;a href=&quot;https://www.arb-silva.de/silva-license-information&quot;&gt;SILVA’s dual use license&lt;/a&gt;. We’ll leave it to you to work out the details.&lt;/p&gt;
</description>
        <pubDate>Wed, 22 Mar 2017 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2017/SILVA-v128-reference-files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2017/SILVA-v128-reference-files/</guid>
        
        
      </item>
    
      <item>
        <title>README for the RDP v16 reference files</title>
        <description>&lt;p&gt;The good people at the &lt;a href=&quot;https://rdp.cme.msu.edu&quot;&gt;RDP&lt;/a&gt; have released a new version of the RDP database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the &lt;a href=&quot;https://mothur.org/wiki/RDP_reference_files&quot;&gt;mothur-compatible reference files&lt;/a&gt;. The original files are available from the RDPs &lt;a href=&quot;https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/&quot;&gt;sourceforge server&lt;/a&gt; and were used as the starting point for this README.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://rdp.cme.msu.edu/misc/rel10info.jsp#release11_history&quot;&gt;release notes&lt;/a&gt; indicate the following:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;RDP Release 11.5 consists of 3,356,809 aligned and annotated 16S rRNA sequences and 125,525 Fungal 28S rRNA sequences. The Bacteria and Archaea hierarchy model used by RDP Classifier and RDP Hierarchy Browser have been updated to training set No. 16. This new training set has over 300 new genera and 2000 new sequences added. There are some rearrangements in genera Gp1, Gp3 and Gp4 of the Acidobacteria due to addition of recently proposed new genera.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s get going…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; RDPClassifier_16S_trainsetNo16_rawtrainingdata&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

wget &lt;span class=&quot;nt&quot;&gt;-N&lt;/span&gt; https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo16_rawtrainingdata.zip
unzip &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; RDPClassifier_16S_trainsetNo16_rawtrainingdata.zip
&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;RDPClassifier_16S_trainsetNo16_rawtrainingdata/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; ./&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we’d like to start to form the taxonomy file and the fasta file that will be our reference. Again, using bash commands…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;trainset16_022016.fa trainset16_022016.rdp.fasta
&lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&amp;gt;&quot;&lt;/span&gt; trainset16_022016.rdp.fasta | &lt;span class=&quot;nb&quot;&gt;cut&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; 2- &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset16_022016_rmdup.tax&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Next, we’d like to get our taxonomy file properly formatted. First we’ll read in the taxonomy data. The following steps are done in R…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;tax_file&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset16_022016_rmdup.tax&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;what&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\n&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quiet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;TRUE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;accession&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^(\\S*).*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#some are separated by tabs or spaces or both&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;.*(Root.*)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;	&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#remove spaces and replace with &apos;_&apos;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\t&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;	&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#remove extra tab characters&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;[^;]*_incertae_sedis$&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;\&quot;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;&apos;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;#remove quote marks&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The RDP inserts a variety of sub taxonomic levels (e.g. suborder) that will get in the way of us having a consistent number of taxonomic levels for our analyses. Let’s use the data in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;trainset16_db_taxid.txt&lt;/code&gt; to remove these extra taxonomic levels:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;levels&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset16_db_taxid.txt&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringsAsFactors&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;levels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sub&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;levels&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;V5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub.names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;V2&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.split&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strsplit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;taxonomy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remove.subs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.vector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
	&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.vector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;which&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.vector&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%in%&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub.names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tax.split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remove.subs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unlist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paste&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collapse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gsub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;^Root;(.*)$&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\\1;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Finally, we can output the taxonomy data to a file we’ll call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;trainset16_022016.rdp.tax&lt;/code&gt; to have a consistent naming scheme with previous versions of those files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;n&quot;&gt;write.table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cbind&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;as.character&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;accession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;no.subs.str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;trainset16_022016.rdp.tax&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;row.names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;col.names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quote&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;\t&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The RDP training sets do not include mitochondria or sequences from eukaryotes. We find that it is helpful to have these sequences because we can get non-specific amplification at times and would like to be able to remove these lineages. Let’s go ahead and pull down the pds version of training set v.9 and copy those sequences over to our new training set. The following steps will be done in bash:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;wget &lt;span class=&quot;nt&quot;&gt;-N&lt;/span&gt; https://mothur.org/w/images/2/24/Trainset10_082014.pds.tgz
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;xvzf Trainset10_082014.pds.tgz
&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;trainset10_082014.pds/trainset10_082014&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; ./
&lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; trainset10_082014.pds Trainset10_082014.pds.tgz&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now let’s run a mothur command to pull out the extra sequences that are in the pds files:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;mothur &lt;span class=&quot;s2&quot;&gt;&quot;#get.lineage(fasta=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, taxon=Eukaryota-Mitochondria)&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This last command gets us the extra “pds” sequences that we can now use to paste on to the end of the normal RDP training set&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;trainset16_022016.rdp.tax trainset10_082014.pds.pick.tax &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset16_022016.pds.tax
&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;trainset16_022016.rdp.fasta trainset10_082014.pds.pick.fasta &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; trainset16_022016.pds.fasta&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;While we’ve got the old version of the training set, it might be nice to see what the differences are. It would have been nice for them to provide a README indicating what changed, but, well, no, they didn’t.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;wc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;.pds.tax&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##    10773 trainset10_082014.pds.tax
##    13335 trainset16_022016.pds.tax
##    24108 total&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we’re ready to compress the taxonomy files. First we do the RDP files…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;trainset16_022016.rdp
&lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;README.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; trainset16_022016.rdp.fasta trainset16_022016.rdp.tax trainset16_022016.rdp
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;cvzf trainset16_022016.rdp.tgz  trainset16_022016.rdp/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

a trainset16_022016.rdp/README.md
a trainset16_022016.rdp/trainset16_022016.rdp.fasta
a trainset16_022016.rdp/trainset16_022016.rdp.tax&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;… and then the pds files…&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;trainset16_022016.pds
&lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;README.&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; trainset16_022016.pds.fasta trainset16_022016.pds.tax trainset16_022016.pds
&lt;span class=&quot;nb&quot;&gt;tar &lt;/span&gt;cvzf trainset16_022016.pds.tgz  trainset16_022016.pds/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;

a trainset16_022016.pds/README.md
a trainset16_022016.pds/trainset16_022016.pds.fasta
a trainset16_022016.pds/trainset16_022016.pds.tax&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

</description>
        <pubDate>Wed, 15 Mar 2017 00:00:00 +0000</pubDate>
        <link>https://mothur.org/blog/2017/RDP-v16-reference_files/</link>
        <guid isPermaLink="true">https://mothur.org/blog/2017/RDP-v16-reference_files/</guid>
        
        
      </item>
    
  </channel>
</rss>
