mothur

README for the RDP v16 reference files

Tue, 12 Mar 2024 00:00:00 +0000

The good people at the RDP have released a new version of the RDP database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the mothur-compatible reference files. The original files are available from the RDPs sourceforge server and were used as the starting point for this README.

The release notes indicate the following:

The Bacteria and Archaea hierarchy model used by RDP Classifier has been updated to training set No. 19. The new version has over 600 new genera and 2500 new species added since last version No. 18 released in July 2020. The information that is used to update the RDP taxonomy to training set version No. 19, and RDP Classifier version 2.14 came from publicly available scientific articles and public sequence repository, mostly from International Journal of Systematic and Evolutionary Microbiology (IJSEM), the All-Species Living Tree Project (LTP) and GenBank.

It is worth noting that most of the phyla have new names, according to article “ Oren A, Garrity GM. Valid publication of the names of forty-two phyla of prokaryotes. Int J Syst Evol Microbiol. 2021 Oct;71(10). doi: 10.1099/ijsem.0.005056. PMID: 34694987.”

Let’s get going…

rm -rf RDPClassifier_16S_trainsetNo19_rawtrainingdata*

wget -N http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo19_rawtrainingdata.zip
unzip -o RDPClassifier_16S_trainsetNo19_rawtrainingdata.zip
mv RDPClassifier_16S_trainsetNo19_rawtrainingdata/* ./

Now we’d like to start to form the taxonomy file and the fasta file that will be our reference. Again, using bash commands…

grep ">" trainset19_072023_speciesrank.fa | cut -c 2- > trainset19_072023_rmdup.tax
cp trainset19_072023_speciesrank.fa trainset19_072023.rdp.fasta

Next, we’d like to get our taxonomy file properly formatted. First we’ll read in the taxonomy data. Then we’ll output the taxonomy data to a file we’ll call trainset19_072023.rdp.tax to have a consistent naming scheme with previous versions of those files. The following steps are done in R…

library(tidyverse)

incertae_sedis <- function(x) {

  to_fix <- c(1:6)[str_detect(x, "domain__")][1]

  if(is.na(to_fix)) { return(x) }

  x[to_fix] <- if_else(str_detect(x[to_fix - 1], "incertae_sedis"),
                      x[to_fix - 1],
                      paste0(x[to_fix-1], "_incertae_sedis")
              )

  incertae_sedis(x)

}

parse_taxonomy <- function(x) {

  c(domain = str_replace(x, ".*domain__([^;]*);.*", "\\1"),
    phylum = str_replace(x, ".* phylum__([^;]*);.*", "\\1"),
    class = str_replace(x, ".* class__([^;]*);.*", "\\1"),
    order = str_replace(x, ".* order__([^;]*);.*", "\\1"),
    family = str_replace(x, ".* family__([^;]*);.*", "\\1"),
    genus = str_replace(x, ".* genus__([^;]*).*", "\\1")) %>%
                incertae_sedis() %>%
                as_tibble_row(.) %>%
                mutate(across(phylum:genus,
                              ~str_replace(.x,
                                          pattern = " ",
                                          replacement = "_")))

}

tax_data <- read_tsv(file="trainset19_072023_rmdup.tax",
                    col_names = c("accession", "species_strain", "taxonomy"),
                    col_types = cols(.default = col_character())) %>%
            select(accession, taxonomy)

tax_data %>%
  mutate(parsed = map(.data$taxonomy, parse_taxonomy)) %>%
  select(-taxonomy) %>%
  unnest(parsed) %>%
  mutate(taxonomy = paste(domain, phylum, class,
                          order, family, genus, "", sep = ";")) %>%
  select(accession, taxonomy) %>%
  write_tsv("trainset19_072023.rdp.tax", col_names=FALSE, quote="none")

The RDP training sets do not include mitochondria or sequences from eukaryotes. We find that it is helpful to have these sequences because we can get non-specific amplification at times and would like to be able to remove these lineages. Let’s go ahead and pull down the pds version of training set v.9 and copy those sequences over to our new training set. The following steps will be done in bash:

wget -N https://mothur.s3.us-east-2.amazonaws.com/wiki/trainset10_082014.pds.tgz
tar xvzf trainset10_082014.pds.tgz
mv trainset10_082014.pds/trainset10_082014* ./
rm -rf trainset10_082014.pds trainset10_082014.pds.tgz

Now let’s run a mothur command to pull out the extra sequences that are in the pds files:

mothur "#get.lineage(fasta=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, taxon=Eukaryota-Mitochondria)"

This last command gets us the extra “pds” sequences that we can now use to paste on to the end of the normal RDP training set

cat trainset19_072023.rdp.tax trainset10_082014.pds.pick.tax > trainset19_072023.pds.tax
cat trainset19_072023.rdp.fasta trainset10_082014.pds.pick.fasta > trainset19_072023.pds.fasta

While we’ve got the old version of the training set, it might be nice to see what the differences are. It would have been nice for them to provide a README indicating what changed, but, well, no, they didn’t.

wc -l *.pds.tax

##   10773 trainset10_082014.pds.tax
##   24765 trainset19_072023.pds.tax
##   35538 total

Now we’re ready to compress the taxonomy files. First we do the RDP files…

mkdir trainset19_072023.rdp
cp README.* trainset19_072023.rdp.fasta trainset19_072023.rdp.tax trainset19_072023.rdp
tar cvzf trainset19_072023.rdp.tgz  trainset19_072023.rdp/*

##   a trainset19_072023.rdp/README.md
##   a trainset19_072023.rdp/trainset19_072023.rdp.fasta
##   a trainset19_072023.rdp/trainset19_072023.rdp.tax

… and then the pds files…

mkdir trainset19_072023.pds
cp README.* trainset19_072023.pds.fasta trainset19_072023.pds.tax trainset19_072023.pds
tar cvzf trainset19_072023.pds.tgz  trainset19_072023.pds/*

##   a trainset19_072023.pds/README.md
##   a trainset19_072023.pds/trainset19_072023.pds.fasta
##   a trainset19_072023.pds/trainset19_072023.pds.tax

README for the SILVA v138.1 reference files

Tue, 23 Feb 2021 00:00:00 +0000

The good people at SILVA have released a new version of the SILVA v138 database. My understanding is that this is a minor update to correct some taxonomic information. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the mothur-compatible reference files.

Curation of references

Getting the data in and out of the ARB database

This README file explains how we generated the silva reference files for use with mothur’s classify.seqs and align.seqs commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using version 6.0. First we need to download the database and decompress it. From the command line we do the following:

wget -N https://www.arb-silva.de/fileadmin/arb_web_db/release_138_1/ARB_files/SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz
gunzip SILVA_138.1_SSURef_NR99_05_01_20_opt.arb.gz
arb SILVA_138.1_SSURef_NR99_12_06_20_opt.arb

This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 510,508 sequences within it that are not more than 99% similar to each other. The release notes for this database as well as the idea behind the non-redundant database are available from the silva website. Within arb do the following:

Click the search button
Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)
Click ‘Search’. This yielded 446,881 hits
Click the “Mark Listed Unmark Rest” button
Close the “Search and Query” box
Now click on File->export->export to external format
In this box the Export option should be set to marked, Filter to none, and Compression should be set to no.
In the field for Choose an output file name make sure the path has you in the correct working directory and enter silva.full_v138_1.fasta`.
Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create fasta_mothur.eft in the $ARBHOME/lib/export/ folder with the following:
```
SUFFIX          fasta
BEGIN
>*(acc).*(name)\t*(align_ident_slv)\t*(tax_slv);
*(|export_sequence)
```
Save this as silva.full_v138_1.fasta
You can now quit arb.

Screening the sequences

Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line. Run the following commands from a bash terminal:

mothur "#screen.seqs(fasta=silva.full_v138_1.fasta, start=1044, end=43116, maxambig=5);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();"

#identify the unique sequences without regard to their alignment
grep ">" silva.full_v138_1.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v138_1.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur "#get.seqs(fasta=silva.full_v138_1.good.pcr.fasta, accnos=silva.full_v138_1.good.pcr.ng.unique.accnos)"

#generate alignment file
mv silva.full_v138_1.good.pcr.pick.fasta silva.nr_v138_1.align

#generate taxonomy file
grep '>' silva.nr_v138_1.align | cut -f1,3 | cut -f2 -d'>' > silva.nr_v138.full

The mothur commands above do several things. First the screen.seqs command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (>900 bp) archaeal 16S rRNA gene sequences. Second, pcr.seqs converts any base calls that occur before position 1044 and after 43116 to . to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (degap.seqs) and identify the unique sequences (unique.seqs). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (get.seqs).

Formatting the taxonomy files

Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in bash:

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz
gunzip tax_slv_ssu_138.1.txt.gz

Thanks to Eric Collins at the University of Alaska Fairbanks, we have some nice R code to map all of the taxa names to the six Linnean levels (kingdom, phylum, class, order, family, and genus). We’ll run the following code from within R:

map.in <- read.table("tax_slv_ssu_138.1.txt",header=F,sep="\t",stringsAsFactors=F)
map.in <- map.in[,c(1,3)]
colnames(map.in) <- c("taxlabel","taxlevel")
<!-- map.in <- rbind(map.in, c("Bacteria;RsaHf231;", "phylum")) #wasn't in tax_slv_ssu_138.txt -->

#fix Escherichia nonsense
<!-- map.in$taxlevel[which(map.in$taxlabel=="Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;")] <- "genus" -->

taxlevels <- c("root","domain","major_clade","superkingdom","kingdom","subkingdom","infrakingdom","superphylum","phylum","subphylum","infraphylum","superclass","class","subclass","infraclass","superorder","order","suborder","superfamily","family","subfamily","genus")
taxabb <- c("ro","do","mc","pk","ki","bk","ik","pp","ph","bp","ip","pc","cl","bc","ic","po","or","bo","pf","fa","bf","ge")
tax.mat <- matrix(data="",nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] <- "root"
colnames(tax.mat) <- taxlevels

outlevels <- c("domain","phylum","class","order","family","genus")

for(i in 1:nrow(map.in)) {
	taxname <- unlist(strsplit(as.character(map.in[i,1]), split=';'))
	#print(taxname);

	while ( length(taxname) > 0) {
		#regex to look for exact match

		tax.exp <- paste(paste(taxname,collapse=";"),";",sep="")
		tax.match <- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] <- tail(taxname,1)
		taxname <- head(taxname,-1)
	}
}

for(i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don't want this behavior, cut it out
	for(j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] < 0) { tax.mat[i,j] <- paste(tmptax,taxabb[j],sep="_")}
		else { tmptax <- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,"taxout"] <- paste(paste(tax.mat[i,outlevels],collapse=";"),";",sep="")
}

# replace spaces with underscores
map.in$taxout <- gsub(" ","_",map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in <- read.table("silva.nr_v138.full",header=F,stringsAsFactors=F,sep="\t")
colnames(tax.in) <- c("taxid","taxlabel")

# Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem
<!-- tax.in$taxlabel <- gsub("Polaribacter;Polaribacter;", "Polaribacter;", tax.in$taxlabel) -->
tax.in$taxlabel <- gsub(";[[:space:]]+$", ";", tax.in$taxlabel)

tax.in$id <- 1:nrow(tax.in)

tax.write <- merge(tax.in,map.in,all.x=T,sort=F)
tax.write <- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth <- function(taxonString){
	initial <- nchar(taxonString)
	removed <- nchar(gsub(";", "", taxonString))
	return(initial-removed)
}

depth <- getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria <- grepl("Bacteria;", tax.write$taxout)
archaea <- grepl("Archaea;", tax.write$taxout)
eukarya <- grepl("Eukaryota;", tax.write$taxout)

tax.write[depth > 6 & bacteria,] #if zero, we're good to go
tax.write[depth > 6 & archaea,]  #if zero, we're good to go
tax.write[depth > 6 & eukarya,]  #if zero, we're good to go

write.table(tax.write[,c("taxid","taxout")], file="silva.full_v138_1.tax",sep="\t",row.names=F,quote=F,col.names=F)

Building the SEED references

The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_138 taxonomy file as well. The following code will be run from within a bash terminal:

grep ">" silva.nr_v138_1.align | cut -f 1,2 | grep "\t100" | cut -f 1 | cut -c 2- > silva.seed_v138.accnos
mothur "#get.seqs(fasta=silva.nr_v138_1.align, taxonomy=silva.full_v138_1.tax, accnos=silva.seed_v138.accnos)"
mv silva.nr_v138.pick.align silva.seed_v138_1.align
mv silva.full_v138_1.pick.tax silva.seed_v138_1.tax

mothur "#get.seqs(taxonomy=silva.full_v138_1.tax, accnos=silva.full_v138_1.good.pcr.ng.unique.accnos)"
mv silva.full_v138_1.pick.tax silva.nr_v138_1.tax

Taxonomic representation

Let’s look to see how many different taxa we have for each taxonomic level within the silva.nr_v138_1.tax, silva.nr_v138_1.tax. To do this we’ll run the following in R:

getNumTaxaNames <- function(file, kingdom){
  taxonomy <- read.table(file=file, row.names=1)
  sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla <- as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
  phyla <- sum(!grepl(kingdom, phyla))

  class <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  class <- sum(!grepl(kingdom, class))

  order <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  order <- sum(!grepl(kingdom, order))

  family <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  family <- sum(!grepl(kingdom, family))

  genus <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  genus <- sum(!grepl(kingdom, genus))

  n.seqs <- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms <- c("Bacteria", "Archaea", "Eukaryota")
tax.levels <- c("phyla", "class", "order", "family", "genus", "n.seqs")

nr.file <- "silva.nr_v138_1.tax"
nr.matrix <- matrix(rep(0,18), nrow=3)
nr.matrix[1,] <- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] <- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] <- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) <- kingdoms
colnames(nr.matrix) <- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     87   238   631   1139  3955 128884
#Archaea      15    33    57     97   222   2846
#Eukaryota    92   243   644    871  2682  14871


seed.file <- "silva.seed_v138_1.tax"
seed.matrix <- matrix(rep(0,18), nrow=3)
seed.matrix[1,] <- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] <- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] <- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) <- kingdoms
colnames(seed.matrix) <- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     51   123   299    523  1182   5736
#Archaea       7    17    23     30    44     81
#Eukaryota    40    98   272    422   855   1824

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.5862069 0.5168067 0.4738510 0.4591747 0.2988622 0.04450514
#Archaea   0.4666667 0.5151515 0.4035088 0.3092784 0.1981982 0.02846100
#Eukaryota 0.4347826 0.4032922 0.4223602 0.4845006 0.3187919 0.12265483

The Archaea take a beating and recall they lost a bunch of sequences in the initial steps since many of the arachaeal sequences in SILVA are between 900 and 1200 nt long. If you are interested in analyzing the Archaea and the Eukaryota, I would suggest duplicating my efforts here but modify the screen.seqs and pcr.seqs steps to target your region of interest.

Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:

tar cvzf silva.nr_v138_1.tgz silva.nr_v138_1.tax silva.nr_v138_1.align README.md
tar cvzf silva.seed_v138_1.tgz silva.seed_v138_1.tax silva.seed_v138_1.align README.md

Application

So… which to use for what application? If you have the RAM, I’d suggest using silva.nr_v138_1.align in align.seqs. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences if you only use a single processor. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:

mothur "#pcr.seqs(fasta=silva.nr_v138_1.align, start=11894, end=25319, keepdots=F);
        unique.seqs()"

This will get you down to 106,985 unique sequences to then align against. Other tricks to consider would be to use get.lineage to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using filter.seqs with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know a priori). It’s likely that you can just use the silva.seed_v138_1.align reference for aligning. For classifying sequences, I would strongly recommend using the silva.nr_v138_1.align and silva.nr_v138_1.tax references after running pcr.seqs on silva.nr_v138_1.align. I probably wouldn’t advise using unique.seqs on the output.

Legalese

If you are going to use the files generated in this README, you should be aware that this release is available under a CC-BY license.

README for the RDP v18 reference files

Thu, 04 Feb 2021 00:00:00 +0000

The release notes indicate the following:

The Bacteria and Archaea hierarchy model used by RDP Classifier has been updated to training set No. 18. The new version has over 800 new genera and 4000 new species added. Major rearrangements for Classifier training set No. 18 include the following: (go check out the release notes that are linked above for the list of changes).

Let’s get going…

rm -rf RDPClassifier_16S_trainsetNo18_rawtrainingdata*

wget -N http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo18_rawtrainingdata.zip
unzip -o RDPClassifier_16S_trainsetNo18_rawtrainingdata.zip
mv RDPClassifier_16S_trainsetNo18_rawtrainingdata/* ./

Now we’d like to start to form the taxonomy file and the fasta file that will be our reference. Again, using bash commands…

mv trainset18_062020.fa trainset18_062020.rdp.fasta
grep ">" trainset18_062020.rdp.fasta | cut -c 2- > trainset18_062020_rmdup.tax

Next, we’d like to get our taxonomy file properly formatted. First we’ll read in the taxonomy data. The following steps are done in R…

tax_file <- scan(file="trainset18_062020_rmdup.tax", what="", sep="\n", quiet=TRUE)

accession <- gsub("^(\\S*).*", "\\1", tax_file) #some are separated by tabs or spaces or both

taxonomy <- gsub(".*(Root.*)", "\\1", tax_file)
taxonomy <- gsub(" ", "_", taxonomy)	#remove spaces and replace with '_'
taxonomy <- gsub("\t", "", taxonomy)	#remove extra tab characters
taxonomy <- gsub("[^;]*_incertae_sedis$", "", taxonomy)
taxonomy <- gsub('\"', '', taxonomy) #remove quote marks

The RDP inserts a variety of sub taxonomic levels (e.g. suborder) that will get in the way of us having a consistent number of taxonomic levels for our analyses. Let’s use the data in trainset18_db_taxid.txt to remove these extra taxonomic levels:

levels <- read.table(file="trainset18_db_taxid.txt", sep="*", stringsAsFactors=FALSE)
subs <- levels[grep("sub", levels$V5),]
sub.names <- subs$V2

tax.split <- strsplit(taxonomy, split=";")

remove.subs <- function(tax.vector){
	return(tax.vector[which(!tax.vector %in% sub.names)])
}

no.subs <- lapply(tax.split, remove.subs)
no.subs.str <- unlist(lapply(no.subs, paste, collapse=";"))
no.subs.str <- gsub("^Root;(.*)$", "\\1;", no.subs.str)

Finally, we can output the taxonomy data to a file we’ll call trainset18_062020.rdp.tax to have a consistent naming scheme with previous versions of those files:

write.table(cbind(as.character(accession), no.subs.str), "trainset18_062020.rdp.tax", row.names=F, col.names=F, quote=F, sep="\t")

wget -N https://mothur.s3.us-east-2.amazonaws.com/wiki/trainset10_082014.pds.tgz
tar xvzf trainset10_082014.pds.tgz
mv trainset10_082014.pds/trainset10_082014* ./
rm -rf trainset10_082014.pds trainset10_082014.pds.tgz

Now let’s run a mothur command to pull out the extra sequences that are in the pds files:

mothur "#get.lineage(fasta=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, taxon=Eukaryota-Mitochondria)"

This last command gets us the extra “pds” sequences that we can now use to paste on to the end of the normal RDP training set

cat trainset18_062020.rdp.tax trainset10_082014.pds.pick.tax > trainset18_062020.pds.tax
cat trainset18_062020.rdp.fasta trainset10_082014.pds.pick.fasta > trainset18_062020.pds.fasta

wc -l *.pds.tax

## 10773 trainset10_082014.pds.tax
## 21318 trainset18_062020.pds.tax
## 32091 total

Now we’re ready to compress the taxonomy files. First we do the RDP files…

mkdir trainset18_062020.rdp
cp README.* trainset18_062020.rdp.fasta trainset18_062020.rdp.tax trainset18_062020.rdp
tar cvzf trainset18_062020.rdp.tgz  trainset18_062020.rdp/*

## a trainset18_062020.rdp/README.md
## a trainset18_062020.rdp/trainset18_062020.rdp.fasta
## a trainset18_062020.rdp/trainset18_062020.rdp.tax

… and then the pds files…

mkdir trainset18_062020.pds
cp README.* trainset18_062020.pds.fasta trainset18_062020.pds.tax trainset18_062020.pds
tar cvzf trainset18_062020.pds.tgz  trainset18_062020.pds/*

## a trainset18_062020.pds/README.md
## a trainset18_062020.pds/trainset18_062020.pds.fasta
## a trainset18_062020.pds/trainset18_062020.pds.tax

README for the SILVA v138 reference files

Wed, 04 Mar 2020 00:00:00 +0000

The good people at SILVA have released a new version of the SILVA database. A little bit of tweaking is needed to get their files to be compatible with mothur. This README document describes the process that I used to generate the mothur-compatible reference files.

Getting the data in and out of the ARB database

wget -N https://www.arb-silva.de/fileadmin/silva_databases/release_138/ARB_files/SILVA_138_SSURef_NR99_05_01_20_opt.arb.gz
gunzip SILVA_138_SSURef_NR99_05_01_20_opt.arb.gz
arb SILVA_138_SSURef_NR99_05_01_20_opt.arb

This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 510,984 sequences within it that are not more than 99% similar to each other. The release notes for this database as well as the idea behind the non-redundant database are available from the silva website. Within arb do the following:

Click the search button
Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)
Click ‘Search’. This yielded 447,349 hits
Click the “Mark Listed Unmark Rest” button
Close the “Search and Query” box
Now click on File->export->export to external format
In this box the Export option should be set to marked, Filter to none, and Compression should be set to no.
In the field for Choose an output file name make sure the path has you in the correct working directory and enter silva.full_v138.fasta.
Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create fasta_mothur.eft in the $ARBHOME/lib/export/ folder with the following:
```
SUFFIX          fasta    
BEGIN    
>*(acc).*(name)\t*(align_ident_slv)\t*(tax_slv);    
*(|export_sequence)    
```
Save this as silva.full_v138.fasta
You can now quit arb.

Screening the sequences

mothur "#screen.seqs(fasta=silva.full_v138.fasta, start=1044, end=43116, maxambig=5, processors=8);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();"

#identify the unique sequences without regard to their alignment
grep ">" silva.full_v138.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v138.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur "#get.seqs(fasta=silva.full_v138.good.pcr.fasta, accnos=silva.full_v138.good.pcr.ng.unique.accnos)"

#generate alignment file
mv silva.full_v138.good.pcr.pick.fasta silva.nr_v138.align

#generate taxonomy file
grep '>' silva.nr_v138.align | cut -f1,3 | cut -f2 -d'>' > silva.nr_v138.full

Formatting the taxonomy files

Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in bash:

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/taxonomy/tax_slv_ssu_138.txt.gz
gunzip tax_slv_ssu_138.txt.gz

map.in <- read.table("tax_slv_ssu_138.txt",header=F,sep="\t",stringsAsFactors=F)
map.in <- map.in[,c(1,3)]
colnames(map.in) <- c("taxlabel","taxlevel")
map.in <- rbind(map.in, c("Bacteria;RsaHf231;", "phylum")) #wasn't in tax_slv_ssu_138.txt

#fix Escherichia nonsense
map.in$taxlevel[which(map.in$taxlabel=="Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;")] <- "genus"

taxlevels <- c("root","domain","major_clade","superkingdom","kingdom","subkingdom","infrakingdom","superphylum","phylum","subphylum","infraphylum","superclass","class","subclass","infraclass","superorder","order","suborder","superfamily","family","subfamily","genus")
taxabb <- c("ro","do","mc","pk","ki","bk","ik","pp","ph","bp","ip","pc","cl","bc","ic","po","or","bo","pf","fa","bf","ge")
tax.mat <- matrix(data="",nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] <- "root"
colnames(tax.mat) <- taxlevels

outlevels <- c("domain","phylum","class","order","family","genus")

for(i in 1:nrow(map.in)) {
	taxname <- unlist(strsplit(as.character(map.in[i,1]), split=';'))
	#print(taxname);

	while ( length(taxname) > 0) {
		#regex to look for exact match

		tax.exp <- paste(paste(taxname,collapse=";"),";",sep="")
		tax.match <- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] <- tail(taxname,1)
		taxname <- head(taxname,-1)
	}
}

for(i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don't want this behavior, cut it out
	for(j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] < 0) { tax.mat[i,j] <- paste(tmptax,taxabb[j],sep="_")}
		else { tmptax <- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,"taxout"] <- paste(paste(tax.mat[i,outlevels],collapse=";"),";",sep="")
}

# replace spaces with underscores
map.in$taxout <- gsub(" ","_",map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in <- read.table("silva.nr_v138.full",header=F,stringsAsFactors=F,sep="\t")
colnames(tax.in) <- c("taxid","taxlabel")

# Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem
tax.in$taxlabel <- gsub("Polaribacter;Polaribacter;", "Polaribacter;", tax.in$taxlabel)
tax.in$taxlabel <- gsub(";[[:space:]]+$", ";", tax.in$taxlabel)

tax.in$id <- 1:nrow(tax.in)

tax.write <- merge(tax.in,map.in,all.x=T,sort=F)
tax.write <- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth <- function(taxonString){
	initial <- nchar(taxonString)
	removed <- nchar(gsub(";", "", taxonString))
	return(initial-removed)
}

depth <- getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria <- grepl("Bacteria;", tax.write$taxout)
archaea <- grepl("Archaea;", tax.write$taxout)
eukarya <- grepl("Eukaryota;", tax.write$taxout)

tax.write[depth > 6 & bacteria,] #if zero, we're good to go
tax.write[depth > 6 & archaea,]  #if zero, we're good to go
tax.write[depth > 6 & eukarya,]  #if zero, we're good to go

write.table(tax.write[,c("taxid","taxout")],file="silva.full_v138.tax",sep="\t",row.names=F,quote=F,col.names=F)

Building the SEED references

grep ">" silva.nr_v138.align | cut -f 1,2 | grep "\t100" | cut -f 1 | cut -c 2- > silva.seed_v138.accnos
mothur "#get.seqs(fasta=silva.nr_v138.align, taxonomy=silva.full_v138.tax, accnos=silva.seed_v138.accnos)"
mv silva.nr_v138.pick.align silva.seed_v138.align
mv silva.full_v138.pick.tax silva.seed_v138.tax

mothur "#get.seqs(taxonomy=silva.full_v138.tax, accnos=silva.full_v138.good.pcr.ng.unique.accnos)"
mv silva.full_v138.pick.tax silva.nr_v138.tax

Taxonomic representation

Let’s look to see how many different taxa we have for each taxonomic level within the silva.nr_v138.tax, silva.seed_v138.tax. To do this we’ll run the following in R:

getNumTaxaNames <- function(file, kingdom){
  taxonomy <- read.table(file=file, row.names=1)
  sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla <- as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
  phyla <- sum(!grepl(kingdom, phyla))

  class <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  class <- sum(!grepl(kingdom, class))

  order <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  order <- sum(!grepl(kingdom, order))

  family <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  family <- sum(!grepl(kingdom, family))

  genus <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  genus <- sum(!grepl(kingdom, genus))

  n.seqs <- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms <- c("Bacteria", "Archaea", "Eukaryota")
tax.levels <- c("phyla", "class", "order", "family", "genus", "n.seqs")

nr.file <- "silva.nr_v138.tax"
nr.matrix <- matrix(rep(0,18), nrow=3)
nr.matrix[1,] <- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] <- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] <- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) <- kingdoms
colnames(nr.matrix) <- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     87   239   646   1138  3897 129063
#Archaea      15    33    57     97   219   2846
#Eukaryota    91   242   557    766  1727  14887

seed.file <- "silva.seed_v138.tax"
seed.matrix <- matrix(rep(0,18), nrow=3)
seed.matrix[1,] <- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] <- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] <- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) <- kingdoms
colnames(seed.matrix) <- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     51   124   312    522  1172   5741
#Archaea       7    17    23     30    44     81
#Eukaryota    40    98   238    380   727   1834

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.5862069 0.5188285 0.4829721 0.4586995 0.3007442 0.04448215
#Archaea   0.4666667 0.5151515 0.4035088 0.3092784 0.2009132 0.02846100
#Eukaryota 0.4395604 0.4049587 0.4272890 0.4960836 0.4209612 0.12319473

Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:

tar cvzf silva.nr_v138.tgz silva.nr_v138.tax silva.nr_v138.align README.md
tar cvzf silva.seed_v138.tgz silva.seed_v138.tax silva.seed_v138.align README.md

Application

So… which to use for what application? If you have the RAM, I’d suggest using silva.nr_v138.align in align.seqs. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences if you only use a single processor. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:

mothur "#pcr.seqs(fasta=silva.nr_v138.align, start=11894, end=25319, keepdots=F, processors=8);
        unique.seqs()"

This will get you down to 107,001 unique sequences to then align against. Other tricks to consider would be to use get.lineage to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using filter.seqs with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know a priori). It’s likely that you can just use the silva.seed_v138.align reference for aligning. For classifying sequences, I would strongly recommend using the silva.nr_v138.align and silva.nr_v138.tax references after running pcr.seqs on silva.nr_v138.align. I probably wouldn’t advise using unique.seqs on the output.

Legalese

If you are going to use the files generated in this README, you should be aware that this release is available under a CC-BY license.

README for the SILVA v132 reference files

Wed, 10 Jan 2018 00:00:00 +0000

Curation of references

Getting the data in and out of the ARB database

This README file explains how we generated the silva reference files for use with mothur’s classify.seqs and align.seqs commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using version 6.0. First we need to download the database and decompress it. From the command line we do the following:

wget -N https://www.arb-silva.de/fileadmin/arb_web_db/release_132/ARB_files/SILVA_132_SSURef_NR99_13_12_17_opt.arb.gz
gunzip SILVA_132_SSURef_NR99_13_12_17_opt.arb.gz
arb SILVA_132_SSURef_NR99_13_12_17_opt.arb

This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 695,171 sequences within it that are not more than 99% similar to each other. The release notes for this database as well as the idea behind the non-redundant database are available from the silva website. Within arb do the following:

Click the search button
Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)
Click ‘Search’. This yielded 629,211 hits
Click the “Mark Listed Unmark Rest” button
Close the “Search and Query” box
Now click on File->export->export to external format
In this box the Export option should be set to marked, Filter to none, and Compression should be set to no.
In the field for Choose an output file name make sure the path has you in the correct working directory and enter silva.full_v132.fasta`.
Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create fasta_mothur.eft in the /opt/local/share/arb/lib/export/ folder with the following:
```
SUFFIX          fasta    
BEGIN    
>*(acc).*(name)\t*(align_ident_slv)\t*(tax_slv);    
*(|export_sequence)    
```
Save this as silva.full_v132.fasta
You can now quit arb.

Screening the sequences

mothur "#screen.seqs(fasta=silva.full_v132.fasta, start=1044, end=43116, maxambig=5, processors=8);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();"

#identify the unique sequences without regard to their alignment
grep ">" silva.full_v132.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v132.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur "#get.seqs(fasta=silva.full_v132.good.pcr.fasta, accnos=silva.full_v132.good.pcr.ng.unique.accnos)"

#generate alignment file
mv silva.full_v132.good.pcr.pick.fasta silva.nr_v132.align

#generate taxonomy file
grep '>' silva.nr_v132.align | cut -f1,3 | cut -f2 -d'>' > silva.nr_v132.full

The mothur commands above do several things. First the screen.seqs command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (>900 bp) archaeal 16S rRNA gene sequences. Second, pcr.seqs convert any base calls that occur before position 1044 and after 43116 to . to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (degap.seqs) and identify the unique sequences (unique.seqs). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (get.seqs).

Formatting the taxonomy files

Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in bash:

wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_132.txt

map.in <- read.table("tax_slv_ssu_132.txt",header=F,sep="\t",stringsAsFactors=F)
map.in <- map.in[,c(1,3)]
colnames(map.in) <- c("taxlabel","taxlevel")
map.in <- rbind(map.in, c("Bacteria;RsaHf231;", "phylum")) #wasn't in tax_slv_ssu_132.txt

#fix Escherichia nonsense
map.in$taxlevel[which(map.in$taxlabel=="Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;")] <- "genus"

taxlevels <- c("root","domain","major_clade","superkingdom","kingdom","subkingdom","infrakingdom","superphylum","phylum","subphylum","infraphylum","superclass","class","subclass","infraclass","superorder","order","suborder","superfamily","family","subfamily","genus")
taxabb <- c("ro","do","mc","pk","ki","bk","ik","pp","ph","bp","ip","pc","cl","bc","ic","po","or","bo","pf","fa","bf","ge")
tax.mat <- matrix(data="",nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] <- "root"
colnames(tax.mat) <- taxlevels

outlevels <- c("domain","phylum","class","order","family","genus")

for(i in 1:nrow(map.in)) {
	taxname <- unlist(strsplit(as.character(map.in[i,1]), split=';'))
	#print(taxname);

	while ( length(taxname) > 0) {
		#regex to look for exact match

		tax.exp <- paste(paste(taxname,collapse=";"),";",sep="")
		tax.match <- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] <- tail(taxname,1)
		taxname <- head(taxname,-1)
	}
}

for(i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don't want this behavior, cut it out
	for(j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] < 0) { tax.mat[i,j] <- paste(tmptax,taxabb[j],sep="_")}
		else { tmptax <- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,"taxout"] <- paste(paste(tax.mat[i,outlevels],collapse=";"),";",sep="")
}

# replace spaces with underscores
map.in$taxout <- gsub(" ","_",map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in <- read.table("silva.nr_v132.full",header=F,stringsAsFactors=F,sep="\t")
colnames(tax.in) <- c("taxid","taxlabel")

# Following line corrects the Bacteria;Bacteroidetes;Bacteroidia;Flavobacteriales;Flavobacteriaceae;Polaribacter;Polaribacter; problem
tax.in$taxlabel <- gsub("Polaribacter;Polaribacter;", "Polaribacter;", tax.in$taxlabel)
tax.in$taxlabel <- gsub(";[[:space:]]+$", ";", tax.in$taxlabel)

tax.in$id <- 1:nrow(tax.in)

tax.write <- merge(tax.in,map.in,all.x=T,sort=F)
tax.write <- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth <- function(taxonString){
	initial <- nchar(taxonString)
	removed <- nchar(gsub(";", "", taxonString))
	return(initial-removed)
}

depth <- getDepth(tax.write$taxout)
summary(depth) #should all be 6 and there should be no NAs
bacteria <- grepl("Bacteria;", tax.write$taxout)
archaea <- grepl("Archaea;", tax.write$taxout)
eukarya <- grepl("Eukaryota;", tax.write$taxout)

tax.write[depth > 6 & bacteria,] #good to go
tax.write[depth > 6 & archaea,]  #good to go
tax.write[depth > 6 & eukarya,]  #good to go

write.table(tax.write[,c("taxid","taxout")],file="silva.full_v132.tax",sep="\t",row.names=F,quote=F,col.names=F)

Building the SEED references

The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_132 taxonomy file as well. The following code will be run from within a bash terminal:

grep ">" silva.nr_v132.align | cut -f 1,2 | grep "\t100" | cut -f 1 | cut -c 2- > silva.seed_v132.accnos
mothur "#get.seqs(fasta=silva.nr_v132.align, taxonomy=silva.full_v132.tax, accnos=silva.seed_v132.accnos)"
mv silva.nr_v132.pick.align silva.seed_v132.align
mv silva.full_v132.pick.tax silva.seed_v132.tax

mothur "#get.seqs(taxonomy=silva.full_v132.tax, accnos=silva.full_v132.good.pcr.ng.unique.accnos)"
mv silva.full_v132.pick.tax silva.nr_v132.tax

Taxonomic representation

Let’s look to see how many different taxa we have for each taxonomic level within the silva.nr_v132.tax, silva.seed_v132.tax. To do this we’ll run the following in R:

getNumTaxaNames <- function(file, kingdom){
  taxonomy <- read.table(file=file, row.names=1)
  sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla <- as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
  phyla <- sum(!grepl(kingdom, phyla))

  class <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  class <- sum(!grepl(kingdom, class))

  order <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  order <- sum(!grepl(kingdom, order))

  family <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  family <- sum(!grepl(kingdom, family))

  genus <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  genus <- sum(!grepl(kingdom, genus))

  n.seqs <- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms <- c("Bacteria", "Archaea", "Eukaryota")
tax.levels <- c("phyla", "class", "order", "family", "genus", "n.seqs")

nr.file <- "silva.nr_v132.tax"
nr.matrix <- matrix(rep(0,18), nrow=3)
nr.matrix[1,] <- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] <- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] <- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) <- kingdoms
colnames(nr.matrix) <- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     80   204   580   1052  3971 188247
#Archaea      11    30    52     85   210   4626
#Eukaryota    93   240   648    923  3018  20246

seed.file <- "silva.seed_v132.tax"
seed.matrix <- matrix(rep(0,18), nrow=3)
seed.matrix[1,] <- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] <- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] <- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) <- kingdoms
colnames(seed.matrix) <- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     50   110   301    530  1436   8517
#Archaea       7    15    26     39    62    147
#Eukaryota    41   100   287    478  1040   2516

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.6250000 0.5392157 0.5189655 0.5038023 0.3616218 0.04524375
#Archaea   0.6363636 0.5000000 0.5000000 0.4588235 0.2952381 0.03177691
#Eukaryota 0.4408602 0.4166667 0.4429012 0.5178765 0.3445991 0.12427146

Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:

tar cvzf silva.nr_v132.tgz silva.nr_v132.tax silva.nr_v132.align README.md
tar cvzf silva.seed_v132.tgz silva.seed_v132.tax silva.seed_v132.align README.md

Application

So… which to use for what application? If you have the RAM, I’d suggest using silva.nr_v132.align in align.seqs. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:

mothur "#pcr.seqs(fasta=silva.nr_v132.align, start=11894, end=25319, keepdots=F, processors=8);
        unique.seqs()"

This will get you 139,321 unique sequences to then align against (meh.). Other tricks to consider would be to use get.lineage to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using filter.seqs with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know a priori). It’s likely that you can just use the silva.seed_v132.align reference for aligning. For classifying sequences, I would strongly recommend using the silva.nr_v132.align and silva.nr_v132.tax references after running pcr.seqs on silva.nr_v132.align. I probably wouldn’t advise using unique.seqs on the output.

Legalese

If you are going to use the files generated in this README, you should be aware of SILVA’s dual use license. We’ll leave it to you to work out the details.

README for the SILVA v128 reference files

Wed, 22 Mar 2017 00:00:00 +0000

Curation of references

Getting the data in and out of the ARB database

wget -N https://www.arb-silva.de/fileadmin/arb_web_db/release_128/ARB_files/SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
gunzip SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
arb SSURef_NR99_128_SILVA_07_09_16_opt.arb

This will launch us into the arb environment with the ‘‘Ref NR 99’’ database opened. This database has 597,607 sequences within it that are not more than 99% similar to each other. The release notes for this database as well as the idea behind the non-redundant database are available from the silva website. Within arb do the following:

Click the search button
Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)
Click ‘Search’. This yielded 577,832 hits
Click the “Mark Listed Unmark Rest” button
Close the “Search and Query” box
Now click on File->export->export to external format
In this box the Export option should be set to marked, Filter to none, and Compression should be set to no.
In the field for Choose an output file name enter make sure the path has you in the correct working directory and enter silva.full_v128.fasta.
Select a format: fasta_mothur.eft. This is a custom formatting file that I have created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create fasta_mothur.eft in the /opt/local/share/arb/lib/export/ folder with the following:
```
SUFFIX          fasta    
BEGIN    
>*(acc).*(name)\t*(align_ident_slv)\t*(tax_slv);    
*(|export_sequence)    
```
Save this as silva.full_v128.fasta
You can now quit arb.

Screening the sequences

mothur "#screen.seqs(fasta=silva.full_v128.fasta, start=1044, end=43116, maxambig=5, processors=8);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();"

#identify the unique sequences without regard to their alignment
grep ">" silva.full_v128.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v128.good.pcr.ng.unique.accnos

#get the unique sequences without regard to their alignment
mothur "#get.seqs(fasta=silva.full_v128.good.pcr.fasta, accnos=silva.full_v128.good.pcr.ng.unique.accnos)"

#generate alignment file
mv silva.full_v128.good.pcr.pick.fasta silva.nr_v128.align

#generate taxonomy file
grep '>' silva.nr_v128.align | cut -f1,3 | cut -f2 -d'>' > silva.nr_v128.full

The mothur commands above do several things. First the screen.seqs command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (>900 bp) archaeal 16S rRNA gene sequences. Second, pcr.seqs convert any base calls that occur before position 1044 and after 43116 to . to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (degap.seqs) and identify the unique sequences (unique.seqs). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (get.seqs).

Formatting the taxonomy files

Now we want to make sure the taxonomy file is properly formatted for use with mothur. First we want to grab the SILVA taxa mapping file by running the following in bash:

wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_128.txt

map.in <- read.table("tax_slv_ssu_128.txt",header=F,sep="\t",stringsAsFactors=F)
map.in <- map.in[,c(1,3)]
colnames(map.in) <- c("taxlabel","taxlevel")
map.in <- rbind(map.in, c("Bacteria;RsaHf231;", "phylum")) #wasn't in tax_slv_ssu_128.txt

#fix Escherichia nonsense
map.in$taxlevel[which(map.in$taxlabel=="Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;")] <- "genus"

taxlevels <- c("root","domain","major_clade","superkingdom","kingdom","subkingdom","infrakingdom","superphylum","phylum","subphylum","infraphylum","superclass","class","subclass","infraclass","superorder","order","suborder","superfamily","family","subfamily","genus")
taxabb <- c("ro","do","mc","pk","ki","bk","ik","pp","ph","bp","ip","pc","cl","bc","ic","po","or","bo","pf","fa","bf","ge")
tax.mat <- matrix(data="",nrow=nrow(map.in),ncol=length(taxlevels))
tax.mat[,1] <- "root"
colnames(tax.mat) <- taxlevels

outlevels <- c("domain","phylum","class","order","family","genus")

for (i in 1:nrow(map.in)) {
	taxname <- unlist(strsplit(as.character(map.in[i,1]), split=';'))
	#print(taxname);

	while ( length(taxname) > 0) {
		#regex to look for exact match

		tax.exp <- paste(paste(taxname,collapse=";"),";",sep="")
		tax.match <- match(tax.exp,map.in$taxlabel)
		tax.mat[i,map.in[tax.match,2]] <- tail(taxname,1)
		taxname <- head(taxname,-1)
	}
}

for (i in 1:nrow(tax.mat)) {
	#this fills in the empty gaps by using the closest higher taxonomic level appended with an abbreviation for the current taxonomic level
	#if you don't want this behavior, cut it out
	for (j in 1:ncol(tax.mat)) {
		if(tax.mat[i,j] < 0) { tax.mat[i,j] <- paste(tmptax,taxabb[j],sep="_")}
		else { tmptax <- tax.mat[i,j]}
	}

	#this maps the new name to the input taxonomic levels
	map.in[i,"taxout"] <- paste(paste(tax.mat[i,outlevels],collapse=";"),";",sep="")
}

# replace spaces with underscores
map.in$taxout <- gsub(" ","_",map.in$taxout)

# bring in the old taxonomic levels from SILVA and remap them using the new levels
tax.in <- read.table("silva.nr_v128.full",header=F,stringsAsFactors=F,sep="\t")
colnames(tax.in) <- c("taxid","taxlabel")

tax.in$taxlabel <- gsub("[[:space:]]+;", ";", tax.in$taxlabel) #fix extra space in "Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Pezizomycotina;Dothideomycetes;Pleosporales;Phaeosphaeriaceae;Parastagonospora ;"
tax.in$taxlabel <- gsub(";[[:space:]]+$", ";", tax.in$taxlabel)

tax.in$id <- 1:nrow(tax.in)

tax.write <- merge(tax.in,map.in,all.x=T,sort=F)
tax.write <- tax.write[order(tax.write$id),]


#we want to see whether everything has 6 taxonomic level (kingdom to genus)
getDepth <- function(taxonString){
  initial <- nchar(taxonString)
    removed <- nchar(gsub(";", "", taxonString))
    return(initial-removed)
}

depth <- getDepth(tax.write$taxout)
summary(depth) #should all be 6
bacteria <- grepl("Bacteria;", tax.write$taxout)
archaea <- grepl("Archaea;", tax.write$taxout)
eukarya <- grepl("Eukaryota;", tax.write$taxout)

tax.write[depth > 6 & bacteria,] #good to go
tax.write[depth > 6 & archaea,]  #good to go
tax.write[depth > 6 & eukarya,]  #good to go

write.table(tax.write[,c("taxid","taxout")],file="silva.full_v128.tax",sep="\t",row.names=F,quote=F,col.names=F)

Building the SEED references

The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to try to figure out which sequences are part of the seed. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the sequences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_128 taxonomy file as well. The following code will be run from within a bash terminal:

grep ">" silva.nr_v128.align | cut -f 1,2 | grep "\t100" | cut -f 1 | cut -c 2- > silva.seed_v128.accnos
mothur "#get.seqs(fasta=silva.nr_v128.align, taxonomy=silva.full_v128.tax, accnos=silva.seed_v128.accnos)"
mv silva.nr_v128.pick.align silva.seed_v128.align
mv silva.full_v128.pick.tax silva.seed_v128.tax

mothur "#get.seqs(taxonomy=silva.full_v128.tax, accnos=silva.full_v128.good.pcr.ng.unique.accnos)"
mv silva.full_v128.pick.tax silva.nr_v128.tax

Taxonomic representation

Let’s look to see how many different taxa we have for each taxonomic level within the silva.nr_v128.tax, silva.seed_v128.tax. To do this we’ll run the following in R:

getNumTaxaNames <- function(file, kingdom){
  taxonomy <- read.table(file=file, row.names=1)
  sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla <- as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
  phyla <- sum(!grepl(kingdom, phyla))

  class <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  class <- sum(!grepl(kingdom, class))

  order <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  order <- sum(!grepl(kingdom, order))

  family <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  family <- sum(!grepl(kingdom, family))

  genus <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  genus <- sum(!grepl(kingdom, genus))

  n.seqs <- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}

kingdoms <- c("Bacteria", "Archaea", "Eukaryota")
tax.levels <- c("phyla", "class", "order", "family", "genus", "n.seqs")

nr.file <- "silva.nr_v128.tax"
nr.matrix <- matrix(rep(0,18), nrow=3)
nr.matrix[1,] <- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] <- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] <- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) <- kingdoms
colnames(nr.matrix) <- tax.levels
nr.matrix
#          phyla class order family genus n.seqs
#Bacteria     74   261   500   1001  3478 168111
#Archaea      24    52    59    101   217   4337
#Eukaryota   102   252   654    912  2673  18213

seed.file <- "silva.seed_v128.tax"
seed.matrix <- matrix(rep(0,18), nrow=3)
seed.matrix[1,] <- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] <- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] <- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) <- kingdoms
colnames(seed.matrix) <- tax.levels
seed.matrix
#          phyla class order family genus n.seqs
#Bacteria     54   146   252    471  1375   8512
#Archaea       9    17    24     37    62    147
#Eukaryota    38    96   273    465   957   2554

seed.matrix / nr.matrix
#              phyla     class     order    family     genus     n.seqs
#Bacteria  0.7297297 0.5593870 0.5040000 0.4705295 0.3953422 0.05063321
#Archaea   0.3750000 0.3269231 0.4067797 0.3663366 0.2857143 0.03389440
#Eukaryota 0.3725490 0.3809524 0.4174312 0.5098684 0.3580247 0.14022951

Finally, we want to compress the resulting alignment and this README file into the full length and SEED archives using commands in the bash terminal:

tar cvzf silva.nr_v128.tgz silva.nr_v128.tax silva.nr_v128.align README.*
tar cvzf silva.seed_v128.tgz silva.seed_v128.tax silva.seed_v128.align README.*

Application

So… which to use for what application? If you have the RAM, I’d suggest using silva.nr_v128.align in align.seqs. It took about 10 minutes to read in the database file and a minute or so to align a 1000 full-length sequences. Here is an example workflow for use within mothur that will get you the V4 region of the 16S rRNA gene:

mothur "#pcr.seqs(fasta=silva.nr_v128.align, start=11894, end=25319, keepdots=F, processors=8);
        unique.seqs()"

This will get you 104,711 unique sequences to then align against (meh.). Other tricks to consider would be to use get.lineage to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using filter.seqs with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know a priori). It’s likely that you can just use the silva.seed_v128.align reference for aligning. For classifying sequences, I would strongly recommend using the silva.nr_v128.align and silva.nr_v128.tax references after running pcr.seqs on silva.nr_v128.align. I probably wouldn’t advise using unique.seqs on the output.

Legalese

If you are going to use the files generated in this README, you should be aware of SILVA’s dual use license. We’ll leave it to you to work out the details.

README for the RDP v16 reference files

Wed, 15 Mar 2017 00:00:00 +0000

The release notes indicate the following:

RDP Release 11.5 consists of 3,356,809 aligned and annotated 16S rRNA sequences and 125,525 Fungal 28S rRNA sequences. The Bacteria and Archaea hierarchy model used by RDP Classifier and RDP Hierarchy Browser have been updated to training set No. 16. This new training set has over 300 new genera and 2000 new sequences added. There are some rearrangements in genera Gp1, Gp3 and Gp4 of the Acidobacteria due to addition of recently proposed new genera.

Let’s get going…

rm -rf RDPClassifier_16S_trainsetNo16_rawtrainingdata*

wget -N https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo16_rawtrainingdata.zip
unzip -o RDPClassifier_16S_trainsetNo16_rawtrainingdata.zip
mv RDPClassifier_16S_trainsetNo16_rawtrainingdata/* ./

Now we’d like to start to form the taxonomy file and the fasta file that will be our reference. Again, using bash commands…

mv trainset16_022016.fa trainset16_022016.rdp.fasta
grep ">" trainset16_022016.rdp.fasta | cut -c 2- > trainset16_022016_rmdup.tax

Next, we’d like to get our taxonomy file properly formatted. First we’ll read in the taxonomy data. The following steps are done in R…

tax_file <- scan(file="trainset16_022016_rmdup.tax", what="", sep="\n", quiet=TRUE)

accession <- gsub("^(\\S*).*", "\\1", tax_file) #some are separated by tabs or spaces or both

taxonomy <- gsub(".*(Root.*)", "\\1", tax_file)
taxonomy <- gsub(" ", "_", taxonomy)	#remove spaces and replace with '_'
taxonomy <- gsub("\t", "", taxonomy)	#remove extra tab characters
taxonomy <- gsub("[^;]*_incertae_sedis$", "", taxonomy)
taxonomy <- gsub('\"', '', taxonomy) #remove quote marks

The RDP inserts a variety of sub taxonomic levels (e.g. suborder) that will get in the way of us having a consistent number of taxonomic levels for our analyses. Let’s use the data in trainset16_db_taxid.txt to remove these extra taxonomic levels:

levels <- read.table(file="trainset16_db_taxid.txt", sep="*", stringsAsFactors=FALSE)
subs <- levels[grep("sub", levels$V5),]
sub.names <- subs$V2

tax.split <- strsplit(taxonomy, split=";")

remove.subs <- function(tax.vector){
	return(tax.vector[which(!tax.vector %in% sub.names)])
}

no.subs <- lapply(tax.split, remove.subs)
no.subs.str <- unlist(lapply(no.subs, paste, collapse=";"))
no.subs.str <- gsub("^Root;(.*)$", "\\1;", no.subs.str)

Finally, we can output the taxonomy data to a file we’ll call trainset16_022016.rdp.tax to have a consistent naming scheme with previous versions of those files:

write.table(cbind(as.character(accession), no.subs.str), "trainset16_022016.rdp.tax", row.names=F, col.names=F, quote=F, sep="\t")

wget -N https://mothur.org/w/images/2/24/Trainset10_082014.pds.tgz
tar xvzf Trainset10_082014.pds.tgz
mv trainset10_082014.pds/trainset10_082014* ./
rm -rf trainset10_082014.pds Trainset10_082014.pds.tgz

Now let’s run a mothur command to pull out the extra sequences that are in the pds files:

mothur "#get.lineage(fasta=trainset10_082014.pds.fasta, taxonomy=trainset10_082014.pds.tax, taxon=Eukaryota-Mitochondria)"

This last command gets us the extra “pds” sequences that we can now use to paste on to the end of the normal RDP training set

cat trainset16_022016.rdp.tax trainset10_082014.pds.pick.tax > trainset16_022016.pds.tax
cat trainset16_022016.rdp.fasta trainset10_082014.pds.pick.fasta > trainset16_022016.pds.fasta

wc -l *.pds.tax

##    10773 trainset10_082014.pds.tax
##    13335 trainset16_022016.pds.tax
##    24108 total

Now we’re ready to compress the taxonomy files. First we do the RDP files…

mkdir trainset16_022016.rdp
cp README.* trainset16_022016.rdp.fasta trainset16_022016.rdp.tax trainset16_022016.rdp
tar cvzf trainset16_022016.rdp.tgz  trainset16_022016.rdp/*

a trainset16_022016.rdp/README.md
a trainset16_022016.rdp/trainset16_022016.rdp.fasta
a trainset16_022016.rdp/trainset16_022016.rdp.tax

… and then the pds files…

mkdir trainset16_022016.pds
cp README.* trainset16_022016.pds.fasta trainset16_022016.pds.tax trainset16_022016.pds
tar cvzf trainset16_022016.pds.tgz  trainset16_022016.pds/*

a trainset16_022016.pds/README.md
a trainset16_022016.pds/trainset16_022016.pds.fasta
a trainset16_022016.pds/trainset16_022016.pds.tax

The mothur AMI

Tue, 12 Jul 2016 00:00:00 +0000

We get asked a lot of questions by mothur users. Perhaps the one I hate the most is, “What type of computer should I get?” I hate this question because I don’t want to spend other people’s money and because I honestly don’t have the answer. I used to encourage people to get the biggest, baddest computer they could afford. I’ve followed this advice myself.

Over the years, we have literally spent upwards of $50,000 on a high performance computer cluster with a ton of processors, RAM, and storage. Then the System Administrator told us that we were really only using 10% of the cluster’s capacity. In other words, we were effectively spending $50,000 to get $5,000 worth of service. I’ve come to realize that you can do amazing and very affordable bioinformatics on a pretty crappy computer. Just to make the point clear, I’ve run mothur using my iPhone. The caveat, of course, is that you are able to log into a remote high performance computer cluster. Many institutions have high performance computing clusters (HPCCs) that they make very cheap for their constituents. Not everyone is so fortunate. For this latter group of researchers, there is the Amazon Web Server (AWS). Although this tends to be a bit more expensive than institutional HPCCs, it is a very powerful and well-supported option.

Think of AWS as your computer, but it’s off in the ether - the cloud. You can trick it out with all sorts of applications and settings. Think yours is pretty cool? Well, you can take a snapshot of that computer and then make it available for others to use. This is what is called an Amazon Machine Image (AMI). This has the potential to be a very powerful tool for reproducibility. Think of it - you use AWS to do your analysis. Once your analysis is done, you want to make those files and your code available to others. You can make an AMI of the final product and then share the name of the AMI in your manuscript. I could then take that AMI and add data or add an analysis to supplement yours. But we’re getting ahead of ourselves. I’ve done step one - creating an AMI that is tricked out for mothur users that you can build upon for your own use. Head on over to the wiki to follow the tutorial on how to setup and use our AMI that comes with mothur and RStudio installed.

I’m curious what people think of this AMI. I hope to achieve a few goals with this. First, we want to provide an easier on-ramp for analyzing large datasets for people that don’t have access to large amounts of computing power. Part of this involves putting mothur into the path, preloading the AMI with various references, and throwing in RStudio so people can work with their data where it lives in the cloud. Second, we want to be a bit opinionated on how people set up their data analyses by doing things like separating reference, raw, and processed data and keeping data separate from the code. These steps are considered to be pretty good data hygiene habits. Third, by creating an AMI that researchers can modify to do their own analyses, they can create a derivative AMI that could be made publicly accessible to other researchers. The result could be clearer documentation and more reproducible analyses.

Let us know what you think! If there are other tools that you would like to see loaded with the AMI, let us know.

Customize your reference alignment for your favorite region

Thu, 07 Jul 2016 00:00:00 +0000

One of the surprisingly unique aspects of the mothur-based SOP that I encourage people to follow for their MiSeq and previously their 454 and Sanger data was to use an alignment-based approach to analyze their 16S rRNA gene sequence data. I laid this out in a commentary responding to a misguided article that claimed otherwise. The basic points are that you need to insure that your alignment preserves positional homology across the sequences. In addition to resulting in better inter-sequence comparisons, this also makes sure that your sequences actually overlap with each other. When you sequence DNA to high depth, you realize that non-specific PCR products also get sequenced and need to be culled from the analysis.

To carry out these steps we align 16S rRNA gene sequence data against a reference alignment that is based on the SILVA alignment using the mothur align.seqs command. As I’ve pointed out elsewhere (here and here), this reference is the superior alignment and there are no excuses for using a different reference. The problem with the SILVA reference alignment is that it is 50,000 columns wide. Keep in mind that there are only about 1,500 nt in the 16S rRNA gene and we are only generating a few hundred bases of sequence data. Why all the extra columns? One explanation is that the developers wanted to make it so people compare all SSU rRNAs across the tree of life and the 18S rRNA gene is longer than the 16S rRNA gene. Also, some lineages (e.g. TM7) have introns. Also, we haven’t sequenced everything so we can’t anticipate where all of the insertions will be along the gene. To accommodate all of these contingencies the alignment has ballooned to 50,000 columns. If you removed every column from the reference than only had a gap character that provides the extra padding, the alignment would be a more reasonable few thousand columns wide. When we align a 250 nt sequence against a 50,000 column wide alignment, we get a ridiculously long aligned sequence. When you repeat this a few hundred thousand times for MiSeq data, you get a ridiculously large file. To circumvent this, we have encouraged people to use pcr.seqs to trim the reference alignment to the region of the gene that they actually sequenced. In the MiSeq SOP we provide the coordinates for doing this with the V4 region of the gene.

Despite my best efforts, not everyone has taken our paper developing a MiSeq-based protocol for sequencing the V4 region of the 16S rRNA gene seriously. Not a week goes by that I don’t end up referring someone to my rant on why you really don’t want to deviate from this protocol. The tl;dr is that unless your sequence reads fully overlap, you are going to get a wicked high error rate and all sorts of horrible problems. You really want to sequence the V4 region with paired 250 nt reads. Honest. This hasn’t changed with the advent of the v3 chemistry that generates paired 300 nt reads; that just seems to make things worse. Whatever. You’re stuck with the data you’ve got. Your PI screwed up. The previous post-doc or previous student in your lab thought they new best. Alternatively, you’re working with a group that has been sequencing a specific region forever and they aren’t interested in changing because they want to compare new data to data generated on a first generation 454 machine. Ok, enough snark. How do you figure out the coordinates for your region? I have purposely been obtuse on the forum about this question because I really think it’s a mistake to use regions like the V4-V5 or V3-V4 regions for the above mentioned reasons. Let me be a little less obtuse and show you how I’d figure this out for the V3 region and you can adapt it accordingly for your region.

First things first, get your PCR primer sequences. I’ll assume that to amplify the V3 region you used the CCTACGGGAGGCAGCAG AND TTACCGCGGCKGCTGGCAC primer pair. Let’s take an E. coli 16S rRNA gene sequence. This will do…

>E.coli
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTCGCTGACGAGT
GGCGGACGGGTGAGTAATGTCTGGGAAGCTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCA
TAATGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGG
TGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACG
GTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGT
ATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGAC
GTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGA
ATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCA
TCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGG
AGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGA
TTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCT
AACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAG
CGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAG
ATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTT
AAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGA
TAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGC
GCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAA
CTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTAC
ACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTT
TGTGATTCATGACTGGGGTG

Now we want to trim our sequence to start and end with our primer pair. Note that you’ll need to get the reverse complement of the reverse primer sequence (i.e. GTGCCAGCMGCCGCGGTAA)

>E.coli.v3
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAG
GCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGC
AGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAA

Save the content of this gray box to a new file that we’ll call ecoli_v3.fasta. Also, you should feel free to remove the primers themselves. I tend to leave them in for no real good reason. Alrighty, now you’ll want to download the SILVA seed reference file. Put ecoli_v3.fasta and silva.seed_v123.align (available here) in the same folder and align the former to the latter. Then run summary.seqs on ecoli_v3.align

mothur > align.seqs(fasta=ecoli_v3.fasta, reference=silva.seed_v123.align)

mothur > summary.seqs(fasta=ecoli_v3.align)

You’ll see the output of summary.seqs indicates the starting position was 6388 and the end position was 13861. Viola. You know what these are? They’re the coordinates you’ll use in pcr.seqs to trim silva.seed_v123.align to the V3 region.

mothur > pcr.seqs(fasta=silva.seed_v123.align, start=6388, end=13861, keepdots=FALSE)

Finally, rename silva.seed_v123.pcr.fasta to silva.v3.align and you’re good to go.

mothur and QIIME

Tue, 12 Jan 2016 00:00:00 +0000

Despite their differences in philosophy, most of the differences in mothur and QIIME are cosmetic. Both packages have been successful. Having both of them around is good for microbial ecology. Within both packages there are warts - inconveniences to the users and antiquated/bad ideas. Within both packages there are strengths. If you are going to criticize someone for their choice of software, do it for some specific point. If you are going to campaign for mothur or QIIME, do your best to accurately represent the strengths of your pet package.

When I teach workshops or field users’ questions, I am often asked what I think of QIIME. I suspect that because I direct the development of mothur, people are expecting me to come out with guns blazing to blow up QIIME. In fact, people ask and then kind of step aside to avoid the onslaught. Meh. I pause and say it’s a successful program, but that I obviously like mothur better. After this non-answer, people then tell me the analogies that they use to compare the two: Patagonia vs Columbia, Mac vs. PC, or Coke vs. Pepsi are common (it’s never clear or consistent which brand is to be preferred or why). I suppose this is fair with the caveat that all analogies are limited. I think these analogies reflect the point that a lot of the differences are cosmetic. Both programs were released within months of each other in 2009 and 2010. I often get the impression from reviewers and other software developers that mothur is a distant second fiddle to QIIME. Yet, the Web of Science shows that mothur has 3,410 citations and QIIME has 2,779 (as of January 8, 2016). Although I cite these numbers in grant proposals to sway reviewers that mothur is the leading tool used by microbial ecologists to analyze 16S rRNA genes, citation counts really say little about quality and once you get over a few hundred citations you’re hot stuff.

I have gotten this question with growing frequency over the past months (hence this blog post) and I know that the topic recently came up over at the QIIME forum. Often times the person asking the question or making the dogmatic statement doesn’t seem to have a full appreciation for the differences and similarities and whether the differences are really meaningful. I frequently work with researchers to craft rebuttals to reviewers who think that an otherwise benign paper is the battle ground for such a debate. Take this gem, for example…

Finally, I like mothur for some of its unique features but QIIME is better for others and is backed up by a very strong group of bioinformatics tools and researchers, in my opinion you should consider using QIIME in future studies.

It’s really hard to know where to begin with that kind of statement. I hope that this would even make the QIIME developers cringe. It would make me cringe if the software names were switched. Sadly, this isn’t the only example of this that I have seen. Such comments beg the question, “Which mothur features were used in this paper that would have been better in QIIME?” There may be some, but what is the author to do with such a vague comment from the reviewer? How do these types of statements help anyone? But I digress… My point is that to this reviewer and many others the differences that they think are huge are largely cosmetic. For example, I received feedback a year ago that someone didn’t like using the mothur wiki site because it had a very modest picture of my wife nursing our 4th child, John and that grossed them out. For. Reals. Many people are QIIME or mothur people because their PI or collaborators are, not for any deep seated philosophical justification.

So, what are the differences? Are they meaningful or just cosmetic? Is this all a matter of personal preference? In full disclosure, my experience with QIIME came from a recent paper I wrote in which we compared various clustering algorithms including those implemented in QIIME. I’ve looked over their SOPs, talked with a number of QIIME users, and feel that I generally have a good sense of what their strategy is about. I’m not looking to get into a mothur vs. QIIME debate (note: the title of this post is mothur and QIIME), but if I misrepresent something or am being unfair, please let me know and I’ll edit this post. In what follows allow me to lay out what I see as the similarities, differences, strengths, and weaknesses of each program. To reduce my bias, I’ll leave it to you to determine whether any of it matters. I’m happy to extend the list if you think I’m missing something critical. My hope is that people getting started in the field and that the person that forwarded that reviewer comment to me (and others with similar reviewers) will find the discussion useful.

Development strategy

mothur. When you run a function from within mothur, you are running mothur code. When you run classify.seqs, you are not running the code developed by the RDP. You are running our reimplementation of their code. We have done this with several functions to make them operating system (OS)-independent, make them open source, parallelize them, make them faster, generalize their application, and expand their features. Our first draft of any function is to translate the code from its original language to C++ and make sure we get the same output with the same input. In some cases (e.g. chimera.uchime and metastats) the original authors have made their C code open and we have directly integrated that code into mothur. In most cases, however, you are running code that we developed from scratch (e.g. chimera.uchime). This strategy has strengths (we think the code is better, more uniform, and easier to maintain), but also has weaknesses (it can be hard to incorporate new tools if they weren’t written in C/C++). Regardless, we seem to do a good job of keeping current with the needs and wants of the community.

QIIME. QIIME is essentially a big wrapper that helps users to transition data between independent packages. Certainly, a lot of the functionality within QIIME is written by the QIIME developers, but much of the heavy lifting (e.g. OTU clustering, classification) is from code written by others. In fact, you can actually run mothur from within QIIME. It’s a very old version of mothur, but you can use mothur to cluster sequences in QIIME. As you might expect, the advantage of this approach is that if the developers can write a light wrapper for a new package in Python, then it is pretty painless to bring in other people’s software. Of course, that software comes in warts and all and creates dependency hell. When I’ve heard QIIME developers talk at conferences, one comment they make is that they allow people to use methods the way it was originally implemented. I get that. At the same time, if you told people that you had to use Gosset’s original implementation of Student’s T-test and it couldn’t be ported to SAS, SPSS, R, Excel, etc. then you’d rightly be laughed at. I would be curious to know how much dead wood exists in QIIME - I could foresee functions that were developed by a contributor that then graduate or lose interest and there is no one left to maintain or update the code. The QIIME development team may have a mechanism to deal with this type of problem, but the fact that they are using mothur v1.25, which was released in May 2012, suggests that there is some slippage. A big part of what we do with mothur development is to modify functions to work with bigger and more diverse datasets and we continue to maintain everything within mothur.

Language

mothur. We write mothur in C and C++. C and C++ are compiled languages, which means that once the code is compiled, you don’t need another program to run it. Pretty much everything you likely have on your computer right now is a compiled program. There are a few reasons we do this. First, C/C++ runs much faster than other languages including R, Python, Perl, or Java. I suppose we could be writing it in Fortran, but I put my last punch card in my daughter’s bicycle spokes. Most of the source code has been written by Sarah Westcott and myself. It is an open source package that others are free to contribute to or build upon for other applications. It is somewhat disappointing that we haven’t had more contributions, but then again, how many microbiologists know C or C++? If we had written it in Python or made it an R package, we would have far more contributors. But of course, then we’d have performance issues. We might make an R wrapper … wait for it … mothuR, but it will just be wrapping our C++ code.

QIIME. QIIME is written in Python, which is a very powerful, popular, and well-developed language. I look for reasons not to learn Python (I’m on team R), but deep down know that I should learn Python. My kids are learning Python. Through the efforts of Software Carpentry, Codecademy, and other groups, many biologists are learning Python. You should definitely learn Python. For all of these reasons, I think QIIME has gotten a lot of code contributions from their user base. Python is also great for doing light lifting functions like wrapping functionality and converting file formats. It’s not so great at heavy lifting. Part of this is because it is not a compiled language - the language itself is written in C. As an example of this consider our aligners, which both implement the NAST algorithm. The paper describing QIIME’s Python-based aligner, pynast, states that it can align a full-length sequence in 1.46 seconds. In contrast, our paper describing mothur’s C++-based aligner, align.seqs, could align 15 full-length sequences per second (21.9-times faster).

Installation

mothur. Because of our overall development strategy we have worked very hard to make mothur a standalone software package. When you download mothur, you have mothur. All of it. You don’t have to chase down external dependencies or worry about software licenses. The only thing you have to go get are the databases that are required for aligning or classifying sequences. As described above, this is possible because all of the functionality is baked into the source code. You can get the executable binaries or the source code from our project’s GitHub releases page. You can even download the code that we are working on for the next release through that GitHub repository.

QIIME. Installation is one of the things that seems to drive people nuts about QIIME and to their credit, I think their developers have worked hard to overcome these problems. These problems are largely because of their development strategy. If you have Python installed and are running it on your local computer, a simple pip install qiime should suffice to install QIIME. But then you have to get all of the guts (admittedly many of the dependencies can be installed using an intaller. Some of the more important guts (e.g. USEARCH) require separate downloads and may be proprietary and require you to pay a fee. To overcome some of these problems, the developers have created virtual machines and other abstractions to make it easier to install. Alas, installing and running a virtual machine on one’s computer is not trivial and results in a hit in processing speed. Although some of the individual packages within QIIME may be closed source and pricey, they make their source code available through their GitHub repository like we do.

Accessibility

mothur. When we survey our users and run workshops, we consistently find that more than half of them are using computers that run the Window’s operating system. Guess which operating systems most bioinformatics software packages are designed for? Linux and Mac. It has been very important to us to make mothur as platform-independent as possible. One advantage of rewriting software is that we can make sure it compiles and works in Windows. For some reason the same commands run with the same data do run a bit slower on a Windows machine than a Mac or Linux. Because of this, we think that people will eventually want to move towards a Linux-based cloud solution, but we also want to meet people where they are with their existing hardware without excluding them because of their choice of operating system.

QIIME. Considering many of the tools wrapped within QIIME were designed to run on Linux, QIIME runs easiest on a machine running Linux. The developers have made a Mac-based port and as mentioned above, they have created virtual machines to run on Windows. Difficulties installing and running virtual machines are not trivial for people just learning bioinformatics and they will experience hits in performance.

Openness

mothur. All of the functionality of mothur is available as source code under the GPL v3 License. If you want to know how mothur does something (and can read C/C++), its right there on our GitHub repository. I suspect that 99% of our users have no interest in going through the source code, but it’s there. mothur is free as in freedom as well as beer. You don’t have to pay a dime to use any component of mothur if you are a academic, garage scientist, or work for big pharma. This also goes for our online materials and technical support.

QIIME. For most purposes, QIIME is just as free as mothur and is available under the GPL v2 License. The one caveat is their use of USEARCH. Robert Edgar, the developer of USEARCH provides the 32-bit version for free to academics and non-profits after registering their use; however, if you’re at a for-profit or need the 64-bit version, you’re going to have to pay for it. This is not entirely trivial considering the primary clustering methods in QIIME are based on USEARCH. I’ve heard that QIIME is working towards replacing USEARCH with the free VSEARCH, but I believe this is still in the testing stages.

User community

mothur. As indicated by the number of citations, both software packages have large and loyal followings and some people actually use both on the same project. Both groups have discussion lists, user forums, online documentation, instructional materials, swag, and devotees. Engaging those devotees and potential users to advance the software and surrounding resources is a challenge for all open source software efforts. For example, take a look at the list of co-authors on the mothur paper. There are 15 names, 3 of us (Westcott, Ryabin, Schloss) wrote code and most of the wiki (Ryabin was an undergrad). The remaining 12 co-authors took me up on an offer to get their name on the paper if they contributed a wiki page describing how they used mothur for their application. These are the posts on the Analysis Examples wiki page. Unfortunately, none of them have been updated since they were posted and having created this mechanism to share, no one else has contributed their analysis. On a similar note, I will regularly get emails from people telling me that there’s a typo on the wiki. Apparently they don’t know that the point of a wiki is that anyone can edit it! Oh well. We also have a very active user forum where people mostly ask questions and very few people (mostly Sarah and I) answer them. Although I’d love for more people to be involved in this, I think the questions and answers do provide others a form of very useful documentation. Ultimately, the lack of engagement is probably more a product of culture that our users aren’t used to. We’re open to suggestions. Still it’s pretty awesome what we’ve been able to do over the past several years with 1.0 FTE working on mothur.

QIIME. Greg Caporaso rightly points out that QIIME has a great collaborative network of developers. I think a lot of this is because they develop in Python, which a lot of people know and that the strong computer science background of Knight, Caporaso, and the other developers has collaboration baked into it. I also think that because the Knight lab is heavily involved in a lot of big science projects they have an amazing list of collaborators that go on to use QIIME, publish in high-impact papers, and reinforce collaboration with the QIIME developers. What can I say? I’m jealous.

Method transparency

mothur. There are currently 145 commands in mothur. Many of these commands implement various methods to do the same thing while generating the same output formats. For example, the cluster command implements three ways of clustering sequences into OTUs. To run cluster, you first have to run a number of other steps. Each step is a different step in the pipeline. By making each step discrete like this, users have very fine control on the knobs of their pipeline and they know exactly what is going on. Of course, we give people the default parameter values and usually have papers to backup the defaults, but people are free to alter the commands at each step. This gives users great control, but at the same time can be somewhat overwhelming if they feel the need to do something different. For example, we have 8 chimera functions that each implement a different algorithm - I would only ever suggest using chimera.uchime.

QIIME. In contrast to the mothur approach, my experience and that of people I’ve talked with is that most QIIME users tend to treat functions as a “black box”. If you want to use the open reference clustering algorithm that command will align, classify, and assign sequences to OTUs. Although it is possible to tweak parameters for each of those steps, it isn’t always clear how. It’s also not entirely clear how one might add steps for making sure sequences overlap the same region or to identify and remove chimeras.

Reproducibility

mothur. An ongoing problem in science that has recently gotten a lot of important attention has been the ability to reproduce work of other scientists. One place where we can hopefully make progress on this is in the world of computational analysis. I should be able to take your data and reproduce a figure from your paper. Sadly this isn’t always possible or as easy as we think it should be. As a reviewer I see this problem frequently where people will say they used mothur/QIIME to analyze their data. Um… there are an infinite number of permutations of functions and parameter values that one could use. Help? We have worked to help users make their results more reproducible by outputting log files and posting SOPs. Over the last two years we have also worked really hard to put all of our data and code online for others to reproduce. There are currently two primary tools that people use - Jupyter (previously IPython Notebooks) and R Markdown documents. We have created a mothur hook for use with IPython notebooks and are developing hooks for use with R Markdown documents. I personally prefer R Markdown because I can embed results in my text to write a paper. In contrast, Jupyter is a notebook, which is useful for demonstrating how you did an analysis and the results, but isn’t really able to produce a manuscript ready to submit. Needless to say this is an area of active development.

QIIME. Likely because of its strong roots in Python, the QIIME developers are making great use of Jupyter to demonstrate how to use QIIME and disseminate their methods. My understanding is that QIIME v2.0 will make extensive use of this format.

Data Accessibility

mothur. Related to the previous point, analyses cannot be reproducible if the data are not available. Previously, submitting 16S rRNA gene sequences to NCBI’s Sequence Read Archive (SRA) has been a pain in the tuckus. This resulted in labs posting data to their personal websites or to 3rd party sites such as MG-RAST. I’m guilty of the former, although we’re working to correct this. The problem with these approaches is that often people are not depositing their raw data (i.e. sff and fastq files), only their processed data and may not be depositing their metadata. Also, although the SRA is difficult to search and access, it is a breeze to use compared to MG-RAST. To overcome this problem, we worked with the curators at the SRA to develop the make.sra command, which helps to simplify the process. This feature has been live since March 2015 and has been widely used by microbial ecologists. These are low estimates, but as of the beginning of January there have been 86 submission from 61 studies containing 6367 runs representing 116 GBp submitted using make.sra. There really is no excuse at this point to use anything but the SRA for depositing raw sequence data. We are also in the process of developing an sra.info command that will convert data out of SRA format.

QIIME. As an alternative to the SRA, the developers of QIIME also developed QIITA. QIITA is an online database for storing and analyzing 16S rRNA gene sequence data. The goal appears to involve applying a common pipeline to datasets so that they can be compared. This makes use of the open and closed-reference clustering algorithms that are critiqued below. It also provides researchers with the ability to deposit raw data. I recently tried to access the Earth Microbiome Project (EMP) data that was used in one of their open-reference clustering papers. I failed. There was no obvious way to download large number of files like one can with the SRA. When I asked some of the EMP researchers for help, it was clear QIITA is still under development and that it really isn’t designed to do what I wanted. My understanding is that they are in the process of uploading the data to either the European Nucleotide Archive (ENA) or the SRA.

Data visualization

mothur. We initially attempted to develop functions that would build heatmaps and venn diagrams as SVG files. Although these data visualization tools are useful, I don’t feel like we did a great job of making the output from these functions as elegant as they could be. After experimenting a bit, we decided we would never be able to generate figures as nice as one could in R or Python using the extensive codebase that has been developed there. Instead, we focus on outputting data in formats that people can manipulate in other packages. To that end, all of our output files are text files and we can output a shared file as a BIOM-formatted file for integration with other microbial ecology tools that use that format.

QIIME. I applaud the QIIME developers efforts to build data visualization tools for analyzing microbial ecology data. I’m not personally a big fan of their black background ordinations or 2D depictions of 3D ordinations. The demos I’ve seen do a nice job of showing how users can re-color points in ordinations by metadata. Of course, this is something you can also do in R, but you need to know R first.

Clustering / OTU picking

mothur. We got started with creating DOTUR, which was the first open source tool for assigning sequences to OTUs. The plan for mothur was to make DOTUR able to process 454 data, but then we got to having so much fun… mothur currently implements hierarchical clustering algorithms in the cluster command including the average (the default), weighted, nearest, and furthest neighbor. We also have cluster.split, which is a way of dividing your data by taxonomy and then clustering. The output of cluster.split is the same as cluster, but it is faster and can be parallelized. We have done a lot of benchmarking to show that the average neighbor clustering algorithm gives the best clusters. It may be slower than other methods, but the data suggest it is consistently the best approach. You can read those papers here and here.

QIIME. As mentioned above, you can run an old version of mothur from within QIIME. Looking at their papers and online documentation, it is clear they want people to use their greedy de novo or their open reference OTU assignment commands, which are both based on USEARCH. In our 2015 paper that benchmarked a diverse collection of clustering algorithms, we showed that in some cases their distance-based greedy clustering algorithm could be as good as the average neighbor algorithm. However, we point out myriad problems with their open and closed reference clustering algorithms. It is very hard for me to encourage people to use these algorithms in QIIME. A common rejoinder is that some datasets are too large for the average neighbor algorithm. Our experience has been that this is more a product of sequencing error than anything else. As we point out in that paper, speed and memory usage are important, but cannot be used as the basis to say one method is better than another when there are clear differences in OTU quality. High quality clustering is a problem that will continue to plague us as datasets grow, even if they have a very low sequencing error rate.

Illumina sequence processing

mothur. The make.contigs command is our method of assembling paired sequence reads into contigs. If your reads fully overlap then it is possible to use this command with the rest of the pipeline to get an error rate below 0.02%. This is easily an order of magnitude lower than what we have seen other groups describe. Actually, very few other groups (any?) are sequencing mock communities to report an error rate. You can find the wetlab protocol at our GitHub repository. This approach and the bioinformatics benchmarking was published at AEM. Even with the advance in Illumina’s MiSeq technology to paired 300 nt reads, we are sticking with the paired 250 nt reads to sequence the V4 region because the new chemistry sucks.

QIIME. Greg Caporaso was the first author on the original method describing how to sequence 16S rRNA genes on an Illumina machine. Unfortunately, much of their benchmarking consisted of showing it was possible and that they got similar results to 454 data. They didn’t actually report error rates. In subsequent work, they suggested that it is possible to make comparisons with single reads or using HiSeq data. From our work and experience, we know that these data are very noisy and problematic. If you want to distinguish very different communities, use HiSeq data, if you are looking at more similar communities, you need high quality data. With this as background, it really isn’t clear what QIIME proposes for assembling reads into contigs or how it compares to what we’re doing with make.contigs. Their primary tutorial uses their Moving Pictures of the Human Microbiome dataset and is disseminate as a Jupyter notebook. Interestingly, that tutorial does not mention what to do with paired reads. Other parts of their website discuss assembly, but it isn’t clear to the typical user how the output fits in with the rest of the pipeline. They indicate that a paper is forthcoming so hopefully they’ll expand on this in the future and be able to compare their results to ours.

Classification

mothur. Taxonomic classification of sequences is handled within the classify.seqs command. The default method is to use the naive Bayesian classifier that was originally developed by the RDP team. We also enable researchers to use the k-Nearest Neighbor algorithm based on distance, blast, and kmer-based distances. The naive Bayesian classifier uses a pseudo-bootstrapping procedure to generate a confidence score. The RDP website uses 80%, by default cluster.seqs does not apply a threshold although our SOPs all tell people to use 80%. The next release of mothur (v1.37.0) will use 80% as the default threshold for classify.seqs.

QIIME. The default classification method in QIIME is to use USEARCH to find the closest match in a reference database using the assign_taxonomy.py script. This appears to be a quasi nearest neighbors algorithm and if you use the defaults, then it requires 2 of the 3 top matches to have the same classification. I haven’t seen this approach published or validated anywhere. This would appear to be problematic since many taxa in the references only have one representative - so you would never get a sequence to classify to that taxon. I would also worry about the problems we saw with using USEARCH for database searching and closed reference clustering - basically, classification will change with the order of the database sequences. QIIME also provides the naive Bayesian classifier as an option in this script. As the default threshold, they require a confidence score of 50%. Although the original paper suggests a threshold of 80%, the RDP site curiously suggests 50% for short sequences; however, this makes no sense since your confidence should not vary with sequence length. In our own unpublished analyses, quality of assignment is proportional to the confidence score.

Databases

mothur. For whatever reason, when I talk with people about the differences between mothur and QIIME, one of the things people think makes a huge difference between the packages is the databases we use. Supposedly, I like RDP and QIIME likes greengenes. Given that QIIME ships with greengenes, I suspect that the QIIME developers do like greengenes; however, we have no particular affinity for the RDP taxonomy. We use it in the SOP because it is smaller and easier to work with. I try to make it clear that you can use the RDP, SILVA, or greengenes for classification. We actually make the three reference taxonomies publicly available. When it comes to alignment database, I do think the only way to go is to use a reference alignment based on the SILVA database. As I tell people, the alignment within the variable regions in the greengenes alignment looks like the person was somewhat drunk at the time while the SILVA alignment looks like it was done by Germans. Oh wait… :) This isn’t a matter of personal conjecture, we actually quantified the difference in an earlier paper and then looked again recently and found the same thing. As the database curators update their databases, we also update the references. I don’t know that they change much, but people get vexed when we don’t keep up with the Joneses.

QIIME. QIIME comes prepackaged with the greengenes database from 2013. This is a nice feature, because it limits the difficulties of keeping track of things. This also makes for bigger downloads. The QIIME commands appear to allow people to use their own database if they aren’t interested in the greengenes database. There are two important points to make with regard to QIIME’s use of databases. First, as I’ve mentioned elsewhere in this posting, USEARCH is sensitive to the ordering of the sequences in your reference. As we showed in that paper, for whatever reason, the default ordering of the greengenes database produces high error rates and could be substantially improved by randomizing the sequences before using USEARCH. Second, To build trees for used for phylogenetic approaches, they use sequences that are aligned to the greengenes alignment that I described above. This is somewhat disconcerting as the the poorly greengenes-aligned sequences artificially increases the distances between sequences. But, as their documentation indicates “FILTERING ALIGNMENTS WHICH WERE BUILT WITH PYNAST AGAINST THE GREENGENES CORE SET ALIGNMENT SHOULD BE CONSIDERED AN ESSENTIAL STEP (caps in original)”. This filtering is no doubt necessary to remove the poorly aligned variable regions. In the same publication, we showed that applying filters like the traditional Lane mask significantly mutes the differences between sequences. Although I agree that such filtering is necessary for building phylogenies where you are trying to propose new lineages of taxa, it seems mistaken to filter when you are looking for differences between sequences. It really would be nice to see them remove so much emphasis on the greengenes database and make the SILVA references alignment the prepackaged default instead.

Conclusions

I hope that you have found this comparison to be useful. As much as possible, I have tried to be balanced in my critiques of both mothur and QIIME. Hopefully, you will find that most of the differences between the two programs are pretty cosmetic. The more substantive difference are in aspects of the programs that are admittedly under active development. One of the things that absolutely drives me crazy is when people say they like program X because it gives them “good results”. I am unclear what that means. Unless you have objective criteria or know the correct answer, you can’t be certain that you have “good results”. Be honest to admit that you use program X because you just like it better for cosmetic reasons or have some actual data to suggest that it is better. Of course, if X happens to be QIIME, please let me know - my group uses mothur and we want to make sure we’re using the best software possible. I know you want the same thing. Finally, I really appreciate the input that I received from a number of people that I asked to review this review before I posted it. I worked very hard to remove any snark, cynicism, sarcasm, etc to provide as balanced a review as possible. These people have held me to this goal and I appreciate their feedback.