Dereplicate parameter example

This page includes a short example to highlight the dereplicate parameter used by several of mothur’s chimera command.

The dereplicate parameter can be used when checking for chimeras by group. When the dereplicate parameter is false, if one group finds the sequence to be chimeric, then all groups find it to be chimeric, default=f. If you set dereplicate=t, and then when a sequence is found to be chimeric it is removed from it’s group, not the entire dataset.

Note: When you set dereplicate=t, mothur generates a new count table with the chimeras removed and counts adjusted by sample.

Let’s look at a small example. Lets assume that seq1 and seq5 are flagged as chimeric in group2, seq1 is chimeric in group1, seq8 is chimeric in group3.

small.fasta file (original):

>seq1
……
>seq4
…….
>seq5
…….
>seq8
…….

small.count_table file (original):

Representative_Sequence	total	group1	group2	group3
Seq1	25	15	10	0
Seq4	10	5	3	2
Seq5	15	0	7	8
Seq8	20	10	5	5


mothur > chimera.vsearch(fasta=small.fasta, count=small.count_table, dereplicate=f)

When derepilcate=f, no modified count file is created. The accnos file will contain seq1, seq5, and seq8. Next you will run the remove.seqs command WITH the count and fasta files

mothur > remove.seqs(fasta=small.fasta, count=small.count_table, accnos=current)

and the resulting files will look like:

Fasta file (derepilcate=f, remove.seqs with count file):

>seq8
…….

Count file:

Representative_Sequence	total	group1	group2	group3
Seq4	10	5	3	2

When derepilcate=t, mothur creates a modified count file. The accnos file will contain seq1, the only read where all samples found seq1 to be chimeric. You will run the remove.seqs command with ONLY the fasta file because the chimeras were already removed.

mothur > remove.seqs(fasta=small.fasta, accnos=current)

and the resulting files will look like:

Fasta file (dereplicate=t, remove.seqs WITHOUT count file):

>seq4
…….
>seq5
…….
>seq8
…….

Count file (modified -> dereplicate=t, created by chimera command):

Representative_Sequence	total	group1	group2	group3
Seq4	10	5	3	2
Seq5	8	0	0	8
Seq8	15	10	5	0

You can see dereplicate=t is a much more conservative approach to chimera removal, and the method we recommend.

One common mistake when using the dereplicate=t parameter and the remove.seqs command that may result in unintended results:

Using the original fasta and count file in the remove.seqs command. Consider the results from the example above:

As above, lets assume that seq1 and seq5 are flagged as chimeric in group2, seq1 is chimeric in group1, seq8 is chimeric in group3. The accnos file will contain seq1, the only read where all samples found seq1 to be chimeric.

Running remove.seqs with the original count file

mothur > remove.seqs(fasta=small.fasta, small.count_table, accnos=current)

results in this:

Fasta file (original):

seq4
…….
seq5
…….
seq8
…….

Count file (after remove.seqs):

Representative_Sequence	total	group1	group2	group3
Seq4	10	5	3	2
Seq5	15	0	7	8
Seq8	20	10	5	5

Instead of

Count file (modified -> dereplicate=t, created by chimera command):

Representative_Sequence	total	group1	group2	group3
Seq4	10	5	3	2
Seq5	8	0	0	8
Seq8	15	10	5	0

As you can see, the sequences flagged as chimeric in some samples, but not all samples, are not removed from the dataset. This results in a falsely inflated number of good reads, 45 instead of 33.