pre.cluster

The pre.cluster command implements a pseudo-single linkage algorithm with the goal of removing sequences that are likely due to pyrosequencing errors. A version of this algorithm was developed by Sue Huse (paper here). The basic idea is that abundant sequences are more likely to generate erroneous sequences than rare sequences. With that in mind, the algorithm proceeds by ranking sequences in order of their abundance. Then we walk through the list of sequences looking for rarer sequences that are within some threshold of the original sequence. Those that are within the threshold are merged with the larger sequence. The original Huse method performs this task on a distance matrix, whereas we do it based on the original sequences. The advantage of our approach is that the algorithm works on aligned sequences instead of a distance matrix. This is advantageous because by pre-clustering you remove a large number of sequences making the distance calculation much faster.

Default settings

The pre.cluster command expects a fasta-formatted file and a names or count file and that the sequences are in the same order in both files. Both of these files can be generated by the unique.seqs command. For example, if you are following along with the Sogin data analysis example and have aligned, filtered, and unique’d your sequences, then enter the following to perform the pre.clustering command:

mothur > unique.seqs(fasta=sogin.unique.filter.fasta, name=sogin.names)
mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names)

or

mothur > unique.seqs(fasta=sogin.unique.filter.fasta, count=sogin.count_table)
mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, count=sogin.unique.filter.count_table)

Will result in the following output:

0  21821   86
100    20286   1621
200    19824   2083
...
21700  16380   5527
21800  16377   5530
21900  16376   5531

Total number of sequences before precluster was 21907. pre.cluster removed 5531 sequences.

This output indicates, by column, the number of sequences processed, the number of sequences that will be found in the final dataset, and the number of sequences that have been clustered away. This should accelerate as the function runs. In this example, this step merged 5,531 sequences with other sequences, leaving you with a set of 16,376 sequences to work with. As an additional benefit to removing potentially erroneous sequences, the reduced dataset will run about 1.8 times faster through dist.seqs than the original and should cluster much faster as well. Two files are created - a *.precluster.fasta and a *.precluster.name file containing the new sequence and name file or *.precluster.count_table for further processing.

Options

count

If you provide a count file that contains group information, mothur will pre.cluster sample by sample.

mothur > pre.cluster(fasta=stool.trim.unique.good.filter.unique.fasta, count=stool.trim.unique.good.filter.count_table)

The screen output will look like:

Processing group F11Fcsw:
0  355 5
100    292 68
200    284 76
300    281 79
360    281 79
Total number of sequences before pre.cluster was 360.
pre.cluster removed 79 sequences.
...

diffs

By default the pre.cluster command will look for sequences that are within 1 mismatch of the sequence being considered. With the diffs option you can change this threshold. For example:

mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names, diffs=2)

0  21777   130
100    19165   2742
200    18419   3488
...
21700  13273   8634
21800  13270   8637
21900  13267   8640

Total number of sequences before precluster was 21907. pre.cluster removed 8640 sequences.

align

When using unaligned sequences, the pre.cluster command allows you to select between two alignment methods - gotoh and needleman - needleman is the default setting:

mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names, diffs=2, align=needleman)

The needleman algorithm penalizes the same amount for opening and extending a gap. Alternatively, you could use the gotoh algorithm, which charges a different penalty for opening (default=-2) and extending (default=-1) gaps:

mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names, diffs=2, align=gotoh)

Our experience has shown that the added parameters in the gotoh algorithm do not improve the pairwise alignment and increases the time required for the alignment.

match, mismatch, gapopen, and gapextend

If you are using unaligned sequences, in the pairwise alignment portion of the aligning procedure, the default reward for a match is +1 and the penalties for a mismatch, opening and extending a gap are -1, -2, and -1. Our experience has shown that these produce the best alignments for 16S rRNA gene sequences. You are encouraged to play around with these to suit your own purposes as shown below:

mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names, diffs=2, match=1, mismatch=-3)

or

mothur > pre.cluster(fasta=sogin.unique.filter.unique.fasta, name=sogin.unique.filter.names, diffs=2, gapopen=-5)

etc.

method

The method parameter allows you to specify the algorithm to use to complete the preclustering step. Possible methods include simple, tree, unoise, and deblur. Default=simple. There are subtle, but important differences between our implementation of the uniose and deblur algorithms. First, we do not impose an abundance threshold like the original implementations do. Second, original unoise and deblur algorithms take in unalinged sequences to run the algorithm. The mothur implementations assume that the sequences are already aligned; this significantly speeds up the deblur algorithm. The deblur algorithm trims all of the sequences to a common length. Our implementation of the algorithm does not do this. Finally, the original implementations of unoise and deblur combine denoising with chimera checking. To remove the black box aspect of this and give greater control over parameter settings our implementation does not do chimera checking

clump

The clump parameter allows you to specify which reads can be combined. Possible options include lessthan and lessthanequal. lessthan -> merge reads with less abundance. lessthanequal -> merge reads with less than or equal abundance. Default=lessthan.

alpha

The alpha parameter allows you to specify the alpha value for the beta formula, which is used in the unoise algorithm. The default is 2.0.

delta

The delta parameter allows you to specify the delta value, which describes the amount of amplification between rounds of PCR. It is used in the tree algorithm. The default is 2.0.

error_rate, indel_prob, max_indels, max_indels and error_dist

These are parameters that are used with the deblur algorithm. The default values we use are taken from the original deblur implementation, which were determined using 1x150 nt reads of mock communities. They are most certainly wrong if you are sequencing any other way.

  • error_rate is the expected mean error rate, as a fraction, of the data going into this command.
  • indel_prob is the expected fraction of sequences that have an insertion or deletion, of the data going into this command.
  • max_indels is the maximum number of insertions or deletions you expect to be in the data going into this command.
  • error_dist is the fraction of sequences you expect to have 0, 1, 2, 3, etc. errors. Should start with 1 and be separated by hyphens (e.g. 1-0.06-0.02-0.02-0.01-0.005-0.005-0.005-0.001-0.001-0.001-0.0005). Alternatively, you can use error_dist=binomial and the command will determine the distribution for you based on a binomial distribution usin gthe error_rate value.

fasta only

As shown above, pre.cluster expects you to provide a count file so that it can acquire the abundance information from each sequence. If you do not provide the count file the command will automatically run your data through unique.seqs to generate to get the information it needs.

processors

The processors option enables you to accelerate the clustering by using multiple processors. Default processors=Autodetect number of available processors and use all available.

The name option allows you to provide a name file associated with your fasta file.

We DO NOT recommend using the name file. Instead we recommend using a count file. The count file reduces the time and resources needed to process commands. It is a smaller file and can contain group information.

The group parameter allows you to provide a group file.

We DO NOT recommend using the name / group file combination. Instead we recommend using a count file. The count file reduces the time and resources needed to process commands. It is a smaller file and can contain group information.

Caveat emptor

Something to keep in mind is that when you set the number of mismatches to 2, you are allowing that the maximum difference between sequences within a cluster to be 4 (2 from the dominant sequence in one direction, and 2 in any other direction). This difference of 4 bases, could compromise your ability to distinguish signal from noise. For example, with this Sogin datasets, the sequences are ~60 bp V6 pyrotags. A difference of 4 bases is 6.7%! Alternatively, when using diffs=1, the difference of 2 bases is 3.3%. The assumption of the algorithm is that these mismatches are noise; however, it doesn’t make sense to then analyze your data at 3% by either level of diffs. Remember that this method does not actually remove the noise, it just clusters sequences that are likely to be noisy. To remove the noise you would need to use a program like Chris Quince’s pyronoise or ampliconnoise. Considering using PyroNoise/AmpliconNoise may not be practical for many people, the pre.cluster option may be your best bet.

Revisions

  • 1.22.0 Added group option.
  • 1.23.0 Added processors option for by group processing.
  • 1.23.0 Added map file to output.
  • 1.28.0 Added count option
  • 1.30.0 Added topdown parameter
  • 1.33.0 Improved work balance load between processors.
  • 1.36.0 Added cluster method for unaligned sequences. Added align, mismatch, match, gapopen, gapextend parameters.
  • 1.39.0 Major speed improvement for pre.cluster command with aligned files.
  • 1.39.0 Major memory requirement reduction for pre.cluster command with by group option.
  • 1.39.1 Pre.cluster(without groups) not clustering properly
  • 1.40.0 Rewrite of threaded code. Default processors=Autodetect number of available processors and use all available.
  • 1.40.4 Bug Fix: Pre.cluster name / group error. #461
  • 1.40.4 Bug Fix: Pre.cluster single sample crash. #451
  • 1.41.0 Adds unoise, tree, deblur methods.
  • 1.42.0 Reduces memory needed for pre.cluster #589
  • 1.43.0 Makes current file class thread safe. Caused random crashes in pre.cluster. #643
  • 1.45.0 Adds clump paramter. #749
  • 1.45.0 Fixes bug with pre.cluster’s unoise method.
  • 1.47.0 Removes blast #801
  • 1.48.0 Significant speed improvements to command’s split by sample process.