The sff.multiple command reads a file containing sff filenames and optional oligos filenames. It runs the files through sffinfo, trim.flows, shhh.flows and trim.seqs combining the results.
To get going with sff.multiple you first need to create a file containing the names of the sff files you would like to process. Here’s an example:
F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.sff F5MMO9001.oligos F5MMO9001.1.F5MMO9001.water_blank_2.WUGSC.v13.sff F5MMO9001.oligos #F5MMO9001.1.F5MMO9001.positive_mock_2.WUGSC.v13.sff F5MMO9001.oligos
The ‘#’ character indicates to mothur you wish to ignore that line in the file. The oligos file name is optional.
mothur > sff.multiple(file=sfffiles.txt)
The output to the screen will look like:
mothur > sff.multiple(file=sfffiles.txt) Using 1 processors. >>>>> Processing F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.sff (file 1 of 3) <<<<< /******************************************/ Running command: sffinfo(sff=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.sff, flow=T, trim=T) Extracting info from F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.sff ... 102 It took 0 secs to extract 102. Output File Names: F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.fasta F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.qual release/F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.flow Running command: summary.seqs(fasta=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.fasta, processors=1) Using 1 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 50 50 0 3 1 2.5%-tile: 1 53 53 0 3 3 25%-tile: 1 377 377 0 4 26 Median: 1 535 535 0 4 52 75%-tile: 1 549 549 0 5 77 97.5%-tile: 1 571 571 0 6 100 Maximum: 1 574 574 1 7 102 Mean: 1 432.912 432.912 0.00980392 4.4902 # of Seqs: 102 Output File Name: F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.summary Running command: trim.flows(flow=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.flow, oligos=/F5MMO9001.oligos, maxhomop=9, maxflows=450, minflows=450, pdiffs=0, bdiffs=0, ldiffs=0, sdiffs=0, tdiffs=0, signal=0.5, noise=0.7, order=TACG, processors=1) Using 1 processors. 102 Output File Names: F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.trim.flow F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.scrap.flow F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.F5MMO9001.water_blank_1.WUGSC.v35.flow F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.flow.files Running command: shhh.flows(file=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.flow.files, lookup=LookUp_Titanium.pat, cutoff=0.01, sigma=60, mindelta=0.000001, order=TACG, processors=1) Using 1 processors. >>>>> Processing F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.F5MMO9001.water_blank_1.WUGSC.v35.flow (file 1 of 1) <<<<< Reading flowgrams... Identifying unique flowgrams... Calculating distances between flowgrams... ... Output File Names: F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.F5MMO9001.water_blank_1.WUGSC.v35.shhh.qual F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.F5MMO9001.water_blank_1.WUGSC.v35.shhh.fasta F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.F5MMO9001.water_blank_1.WUGSC.v35.shhh.names F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.F5MMO9001.water_blank_1.WUGSC.v35.shhh.counts F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.F5MMO9001.water_blank_1.WUGSC.v35.shhh.groups F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.fasta F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.names Running command: trim.seqs(fasta=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.fasta, name=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.names, oligos=F5MMO9001.oligos, allfiles=f, flip=f, keepforward=f, pdiffs=0, bdiffs=0, ldiffs=0, sdiffs=0, tdiffs=0, maxambig=-1, minlength=0, maxlength=0, processors=1) Using 1 processors. 37 Group count: F5MMO9001.water_blank_1.WUGSC.v35 66 Total of all groups is 66 Output File Names: F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.trim.fasta F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.scrap.fasta F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.trim.names F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.scrap.names F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.groups Running command: summary.seqs(fasta=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.trim.fasta, processors=1, name=F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.trim.names) Using 1 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 263 263 0 4 1 2.5%-tile: 1 263 263 0 4 2 25%-tile: 1 267 267 0 4 17 Median: 1 276 276 0 4 34 75%-tile: 1 283 283 0 4 50 97.5%-tile: 1 305 305 0 6 65 Maximum: 1 305 305 0 7 66 Mean: 1 278.636 278.636 0 4.33333 # of unique seqs: 37 total # of seqs: 66 Output File Name: F5MMO9001.1.F5MMO9001.water_blank_1.WUGSC.v35.shhh.trim.summary /******************************************/ ...
minflows & maxflows
The minflows parameter will set the minimum number of flows that each sequence must contain to make it in to a “trim” file while running the trim.flows command. By default this is set to 450; Chris Quince has preferred 360 in his documentation for processing GSFLX and Titanium data.
bdiffs & pdiffs & ldiffs & sdiffs & tdiffs
These parameters are used to allow differences in the barcode, primers, linkers and spacers. pdiffs is maximum number of differences to the primer sequence, default=0. bdiffs is maximum number of differences to the barcode sequence, default=0. ldiffs is maximum number of differences to the linker sequence, default=0. sdiffs is maximum number of differences to the spacer sequence, default=0. tdiffs is maximum total number of differences to the barcode, primer, linker and spacer (default to pdiffs + bdiffs + ldiffs + sdiffs).
signal & noise
By default, trim.flows will treat any intensity signal greater than 0.50 as a real signal and any intensity less than 0.70 as noise. If an intensity falls between 0.50 and 0.70, it is treated as ambiguous and set as a trim point. The settings of 0.50 and 0.70 are suggested by Quince and we really see no need to change it, but in case people want to play with the values here is how you’d do it:
mothur > sff.multiple(file=sfffiles.txt, signal=0.60, noise=0.65)
Looking at the summary.seqs output for your dataset you may notice that the longest homopolymer in the dataset is 31 bases long. This is highly suspect as it is well-established that 454 technology struggles with homopolymers. To cap the homopolymer length you use the maxhomop option:
mothur > sff.multiple(file=sfffiles.txt, maxhomop=9)
The order parameter is used to select the flow order. Options are A, B and I. Default=A, meaning flow order of TACG.
mothur > sff.multiple(file=sfffiles.txt, order=A)
A lookup file is required to run shhh.flows and it needs to be located either in the same folder as your data, next to your executable, or in the path you give this option. You can obtain the various lookup files that are compatible with mothur.
The maxiter option tells shhh.flows the maximum iterations to run if the delta value does not first drop below the mindelta value. The minimum number of iterations is 10. By default maxiter is set to 1000. If you set maxiter to 0, then the number of iterations do not matter and the mindelta criteria will be used. To change the value of maxiter, do so as follows:
mothur > sff.multiple(file=sfffiles.txt, maxiter=1000)
The mindelta sets a threshold for determining how much change in the flowgram correction is allowed before saying that the job is done. By default this is set to 0.000001 (i.e. 10\^-6). We took this default value from Chris Quince. It can be changed as follows:
mothur > sff.multiple(file=sfffiles.txt, mindelta=0.000001)
The cutoff option is used in the initial clustering step to seed the expectation-maximizaton step. Quince suggests a default value of 0.01 and we have no reason to suggest otherwise:
mothur > sff.multiple(file=sfffiles.txt, cutoff=0.01)
The sigma option is used to set the dispersion of the data in the expectation-maximization step of the algorithm. Quince suggests a default value of 0.06 and we have no reason to suggest otherwise:
mothur > sff.multiple(file=sfffiles.txt, sigma=0.06)
minlength & maxlength
Huse and colleagues also found that large variations in sequence length was an indicator of poor sequence quality. Of course, they were working with sequences that they knew the correct sequence lengths, so you should be cautious about screening based on length. Looking at these data, you might predict that sequences shorter than 200 bp or longer than 300 bp are probably bad, but it is a judgement call:
mothur > sff.multiple(file=sfffiles.txt, minlength=200, maxlength=300)
keepfirst & removelast
The keepfirst parameter in trim.seqs trims the sequence to the first keepfirst number of bases after the barcode or primers are removed, before the sequence is checked to see if it meets the other requirements. The removelast removes the last removelast number of bases after the barcode or primers are removed, before the sequence is checked to see if it meets the other requirements.
With the barcodes it is possible to sequence many different samples in a single 454 run. It may happen that some of these samples should not be analyzed together for your analysis (e.g. comparing fecal to rhizosphere microbial communities). To parse apart the sequences that belong to each sample, use the allfiles option. This will generate a fasta and groups file for each barcode defined in your oligos file:
mothur > sff.multiple(file=sfffiles.txt, allfiles=T)
The keepforward parameter allows you indicate you want to keep the primer for the trim.seqs command. Default=F.
The processors parameter allows you to run the command with multiple processors. Default processors=Autodetect number of available processors and use all available.
mothur > sff.multiple(file=sfffiles.txt, processors=2)
- 1.28.0 - First introduced
- 1.30.0 - Bug Fix: - end of file character added to oligos filenames when the input file did not end in blank line, resulting in cannot find file error.
- 1.31.0 - Bug Fix: fixed order parameter to use A, B and I.
- 1.40.0 - Rewrite of threaded code. Default processors=Autodetect number of available processors and use all available.