screen.seqs

The screen.seqs command enables you to keep sequences that fulfill certain user defined criteria. Furthermore, it enables you to cull those sequences not meeting the criteria from a names, group, contigsreport, alignreport and summary file. As an example, we will use the example from the Sogin data analysis example.

Default settings

First, we can get a summary of the characteristics of the unaligned or aligned sequences:

mothur > summary.seqs(fasta=sogin.unique.align)

       Start   End NBases  Ambigs  Polymer
Minimum:   109 222 40  0   2
2.5%-tile: 4648    4875    57  0   3
25%-tile:  4655    4928    60  0   3
Median:    4655    4928    62  0   4
75%-tile:  4655    4928    65  0   4
97.5%-tile:    4655    4939    73  0   6
Maximum:   6793    6850    100 0   31
# of Seqs: 21907

You will be able to screen for each of these characteristics using the start, end, maxambig, maxhomop, minlength, and maxlength options. Any of these options can be combined in a single command. Problematic sequence names can also be removed from the name and group files with the name, and group options.

Options

start & end

You may have noticed that when you make an alignment there are some sequences that do not align in the same region as most of the sequences that you are analyzing. For example, in the Sogin dataset, there was at least one sequence whose alignment started at position 109 and a sequence that ended by position 222 while most sequences started by 4655 and ended by position 4928. The following command will remove sequences that start after position 4655:

mothur > screen.seqs(fasta=sogin.unique.align, start=4655)

To remove the sequences that end before position 4928, the following command will work:

mothur > screen.seqs(fasta=sogin.unique.align, end=4928)

Putting it all together the following command will remove those sequences that don’t start by position 4655 and do end before position 4928:

mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928)
mothur > summary.seqs(fasta=sogin.unique.good.align)

       Start   End NBases  Ambigs  Polymer
Minimum:   4615    4928    50  0   2
2.5%-tile: 4652    4928    57  0   3
25%-tile:  4655    4928    60  0   3
Median:    4655    4928    62  0   4
75%-tile:  4655    4928    65  0   4
97.5%-tile:    4655    4938    72  0   6
Maximum:   4655    5014    100 0   31
# of Seqs: 20670

minlength & maxlength

In some studies an investigator may get a bunch of 5’ sequences and some full length sequences, in others an investigator might want to make sure that pyrosequences are within some window for length as a quality control measure. As an example, the Sogin data set had 16S rRNA V6 pyrosequences ranging in length between 40 and 100 bp. To remove sequences shorter than 50 bp, the minlength option should be used:

mothur > screen.seqs(fasta=sogin.unique.align, minlength=50)

Alternatively, to remove sequences longer than 70 bp, the maxlength option should be used:

mothur > screen.seqs(fasta=sogin.unique.align, maxlength=70)

Putting it together to keep only those sequences that are between 50 and 70 bp:

mothur > screen.seqs(fasta=sogin.unique.align, minlength=50, maxlength=70)
mothur > summary.seqs(fasta=sogin.unique.good.align)

       Start   End NBases  Ambigs  Polymer
Minimum:   109 222 50  0   2
2.5%-tile: 4648    4875    57  0   3
25%-tile:  4655    4928    60  0   3
Median:    4655    4928    62  0   4
75%-tile:  4655    4928    64  0   4
97.5%-tile:    4655    4939    69  0   6
Maximum:   6785    6850    70  0   9
# of Seqs: 20160

maxambig

As part of their initial screening procedure, Sogin and colleagues removed any sequences with any ambiguous bases. You can do this with the maxambig option:

mothur > screen.seqs(fasta=sogin.unique.align, maxambig=0)

maxn

The maxn parameter has the same function as maxambig. It is retained as a separate command to avoid problems for users running scripts developed on older versions of the program. As a sequence is read in, any degenerate bases are converted to ‘N’s.

maxhomop

While we don’t necessarily know the longest acceptable homopolymer for a 16S rRNA gene, the max length of 31 is clearly a sequencing artifact. If you are interested in removing sequences with excessively long homopolymers, then you should use the maxhomop option:

mothur > screen.seqs(fasta=sogin.unique.align, maxhomop=8)
mothur > summary.seqs(fasta=sogin.unique.good.align)

       Start   End NBases  Ambigs  Polymer
Minimum:   109 222 40  0   2
2.5%-tile: 4648    4875    57  0   3
25%-tile:  4655    4928    60  0   3
Median:    4655    4928    62  0   4
75%-tile:  4655    4928    65  0   4
97.5%-tile:    4655    4939    73  0   6
Maximum:   6793    6850    100 0   8
# of Seqs: 21896

group, name, count, qfile, taxonomy

If you apply screen.seqs as we have in the previous sections and then attempt to run any downstream commands that need a name, group, qfile, count and taxnomy file, the files will be incompatible since those files will still contain information for sequences that you culled. To get around this you can use the group and name option:

mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928, maxhomop=6, name=sogin.names, group=sogin.groups, qfile=sogin.qual, alignreport=sogin.unique.align.report, taxonomy=sogin.unique.rdp.taxonomy)
mothur > summary.seqs(fasta=sogin.unique.good.align)

       Start   End NBases  Ambigs  Polymer
Minimum:   4615    4928    50  0   2
2.5%-tile: 4652    4928    57  0   3
25%-tile:  4655    4928    60  0   3
Median:    4655    4928    62  0   4
75%-tile:  4655    4928    65  0   4
97.5%-tile:    4655    4938    72  0   6
Maximum:   4655    5014    100 0   6
# of Seqs: 20530

Looking at the sogin.good.name files and sogin.good.unique.good.align.report files we see that there are now 20,530 entries representing a total of 216,243 sequences, which agrees with sogin.unique.good.align. Next, looking at sogin.good.groups we see that there are a total of 216,243 sequences, which also agrees.

or with a count file

mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928, maxhomop=6, count=sogin.count_table, qfile=sogin.qual, alignreport=sogin.unique.align.report, taxonomy=sogin.unique.rdp.taxonomy)

optimize && criteria

The optimize and criteria parameters allow you set the start, end, maxabig, maxhomop, minlength, maxlength, minoverlap, ostart, oend, mismatches, maxn, minscore, maxinsert and minsim parameters relative to your set of sequences. The following command would remove all sequences that started after the position that 90% of the sequences do.

mothur > screen.seqs(fasta=sogin.unique.align, optimize=start, criteria=90) 

Using the optimize parameter trumps the value given for the start parameter if you have provided it. You can optimize to several parameters by separating by dashes. Example: optimize=start-end-minlength, would remove any sequence that starts after the position that 90% of the sequences do, or ends before the position that 90% of the sequences do, or whose length is shorter than 90% of the sequences.

summary

The summary file allows you to enter the summary file created by summary.seqs to save processing time when screening with parameters in the summary file. The parameters in the summary file are: start, end, maxambig, maxhomop, minlength and maxlength.

mothur > summary.seqs(fasta=sogin.unique.align)
mothur > screen.seqs(fasta=sogin.unique.align, summary=sogin.unique.summary, minlength=50, maxlength=70)

contigsreport

The contigs report file is created by the make.contigs command. If you provide the contigs report file you can screen your sequences using the following parameters: minoverlap, ostart, oend and mismatches.

minoverlap

The minoverlap parameter is used to set a minimum overlap length. It can only be used with a contigsreport file.

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, contigsreport=stability.trim.contigs.report, minoverlap=25)

ostart && oend

The ostart and oend parameters are similar to the start and end parameters. Setting the ostart parameter sets a position the overlap must start by. Setting the oend parameter sets a position the overlap must end after. They can only be used with a contigsreport file.

mismatches

The mismatches parameter allows you to set and maximum mismatches in the *.contigs.report.

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, contigsreport=stability.trim.contigs.report, mismatches=5)

alignreport

The align report file is created by the align.seqs command. If you provide the align report file you can screen your sequences using the following parameters: minscore, maxinsert and minsim.

minscore

The minscore parameter allows you to set the minimum search score during alignment. Found in column ‘SearchScore’ in align.report file.

maxinsert

The maxinsert parameter allows you to set the maximum number of insertions during alignment. Found in column ‘LongestInsert’ in align.report file.

minsim

The minsim parameter allows you to set the minimum similarity to template sequences during alignment. Found in column ‘SimBtwnQuery&Template’ in align.report file.

processors

The processors parameter allows you to run the command with multiple processors. Default processors=Autodetect number of available processors and use all available.

mothur > screen.seqs(fasta=sogin.unique.align, start=4655, end=4928, maxhomop=6, name=sogin.names, group=sogin.groups, alignreport=sogin.unique.align.report, processors=2)

Revisions

  • 1.23.0 Added taxonomy parameter
  • 1.24.0 Paralellized for Windows.
  • 1.28.0 Added count parameter
  • 1.30.0 Added contigsreport, alignreport and summary files to allow files to be screened using their inputs. added minoverlap, ostart, oend, mismatches, maxn, minscore, maxinsert, minsim parameters and optimization types of minoverlap, ostart, oend, mismatches, maxn, minscore, maxinsert, minsim.
  • 1.32.0 Bug Fix: for count tables without group information, mothur was not creating the *good.count_table correctly. Which could cause fasta/count file mismatch errors downstream.
  • 1.32.0 Allowed for percent in the criteria parameter.
  • Changes minlength default to 10. #160
  • 1.39.0 Fixes bug with optimize parameter using alignreport file
  • 1.40.0 Reduces memory usage #319
  • 1.40.0 Rewrite of threaded code. Default processors=Autodetect number of available processors and use all available.
  • 1.40.4 - Bug Fix: pcr.seqs and screen.seqs (with no bad reads detected), causing accnos file issue.
  • 1.40.5 - Bug Fix: not making *good.groups or *good.count_table files. #476