collect.shared

The collect.shared command generates collector’s curves for calculators, which describe the similarity between communities or their shared richness. Collector’s curves describe how richness or diversity change as you sample additional individuals. If a collector’s curve becomes parallel to the x-axis, you can be reasonably confident that you have done a good job of sampling and can trust the last value in the curve. Otherwise, you need to keep sampling. mothur has the ability to generate data for collector’s curves much the same way that sons did; however, sons presented the data in the sons file, which was virtually impossible for novices to parse. mothur fixes many of these issues by generating separate files for each estimator. For this tutorial you should download and decompress patient70data.zip

Default settings

To execute the collect.shared() command you first need to have run the make.shared command with the list and group options. For example:

mothur > make.shared(list=patient70.fn.list, group=patient70.tissue_stool.groups)

By default, the collect.shared() command will randomize the order in which the individuals are sampled. So if you run collect.shared() multiple times, you will get slightly different results. The collector’s curve data for several calculators is generated by default with the following command:

mothur > collect.shared(shared=patient70.fn.shared)

This will result in output to the screen looking like:

unique 1
0.00   2
0.01   3
0.02   4
0.03   5
0.04   6
0.05   7
0.06   8
0.07   9
0.08   10
0.09   11
0.10   12

The left column indicates the label for each line in the data set and the right column indicates the row number in the data set. Execution of collect.shared() will generate 11 files, one for each of the calculators. If you look at patient70.fn.shared.sobs you will see something like:

sampled    uniquetissuestool   0.00tissuestool 0.01tissuestool 0.02tissuestool 0.03tissuestool
1  1           1       1       1       1
100    90          78      56      35      31
200    173         123     75      58      46
300    250         163     93      66      57
400    324         203     106     74      61
500    395         233     124     80      68
600    463         263     131     89      70
...

In this file the first column tells you how many sequences have been sampled; this would typically be plotted on the x-axis of a graph. Each subsequent column has a heading, which indicates the label for the line being analyzed from your OTU data file concatenate to the names of the two groups being compared. The data in the column provides the estimator value as indicated by the name of the file. For example, in the above examples, after sampling 600 sequences there were 463 OTUs shared between the tissue and stool samples when an OTU was defined as a group of identical sequences. By default, collect.shared() prints output every 100 sequences.

Options

calc

If you are not interested in producing collector’s curves for all of the calculators, it is possible to select the calculators you want using the calc option:

mothur > collect.shared(shared=patient70.fn.shared, calc=sharedsobs-sharedchao)

This command would only generate the files patient70.fn.shared.sobs and patient70.fn.shared.chao.

label

There may only be a couple of lines in your OTU data that you are interested in generating collector’s curves for. There are two options. You could: (i) manually delete the lines you aren’t interested in from your list file; (ii) or use the label option. To use the label option with the collect.shared() command you need to know the labels you are interested in. If you want the collector’s curve data for the lines labeled unique, 0.03, 0.05 and 0.10 you would enter:

mothur > collect.shared(shared=patient70.fn.shared, label=unique-0.03-0.05-0.10)

In the file patient70.fn.shared.sobs you would see something like:

sampled    uniquetissuestool   0.03tissuestool 0.05tissuestool 0.10tissuestool 
1  1           1       1       1
100    92          26      24      16
200    179         39      34      24
300    258         51      38      27
400    332         57      47      29
500    404         64      49      29
600    473         67      51      31

freq

For larger datasets you might not be interested in obtaining all of the data for the number of sequences sampled. For instance, if you have 100,000 sequences, you may only want to output the data every 100 sequences. Alternatively, if you only have 100 sequences, you may only want to output all of the data. The default setting is to output data every 100 sequences. By altering the freq option you can set the frequency that the analysis is performed:

mothur > collect.shared(shared=patient70.fn.shared, freq=1)

or

mothur > collect.shared(shared=patient70.fn.shared, freq=1000)

or you set set the frequency as a percentage of the number of sequences. For example to output after 25%:

 mothur > collect.shared(shared=patient70.fn.shared, freq=0.25)

The second command would generate data such as this in the patient70.fn.shared.sobs file:

sampled    uniquetissuestool   0.00tissuestool 0.01tissuestool 0.02tissuestool 0.03tissuestool 0.04tissuestool 0.05tissuestool     
1  1           1       1       1       1       1       1
1000   711         361     167     102     80      69      60
2000   1351            506     205     119     94      80      74
3000   1976            619     237     129     99      88      83
4000   2519            688     255     138     107     91      84
4392   2742            713     257     138     111     94      84

groups

If you had started this tutorial with the following comamnds:

mothur > make.shared(list=patient70.fn.list, group=patient70.sites.groups)
mothur > get.group(shared=patient70.fn.shared)

You would have seen that there were 7 groups here: 70A-70F and 70S. The sequences from 70S were collected from Patient 70’s stool sample those from samples 70A-70F were from their mucosa. These 7 groups would yield 21 pairwise comparisons if you ran the collect.shared command; however, if you were only interested in the comparisons between each mucosa site and the stool sample you could use the group option:

mothur > collect.shared(shared=patient70.fn.shared, calc=sharedsobs, groups=70A-70S)
mothur > collect.shared(shared=patient70.fn.shared, calc=sharedsobs, groups=70B-70S)
mothur > collect.shared(shared=patient70.fn.shared, calc=sharedsobs, groups=70C-70S)
mothur > collect.shared(shared=patient70.fn.shared, calc=sharedsobs, groups=70D-70S)

Alternatively, if you want all of the pairwise comparisons you can either not include the group option or set it equal to “all”.

mothur > collect.shared(shared=patient70.fn.shared, calc=sharedsobs, groups=all)

all

The sharedsobs and sharedchao calculators not only do the pairwise estimates, but also estimate the shared richness of all the groups in your file. This calculation is RAM intensive. If your RAM is limited and you have a large number of groups this may result in a crash, so by default the all parameter is set to false. To calculate the shared richness of all your groups, set the all parameter to true.

mothur > collect.shared(shared=patient70.fn.shared, calc=sharedsobs-sharedchao, all=true)

Revisions

  • 1.40.0 - Speed and memory improvements for shared files. #357 , #347
  • 1.41.0 - Bug Fix “[error]: requesting groups not present in files, aborting.” error #497