column-formatted distance matrix

Most of the currently used software packages store all of the pairwise distances in RAM before printing them out to disk. This can suck up a considerable amount of RAM to the point that it is prohibitive for analyzing the size of data sets generated by pyrosequencing. An alternative format, used by mothur, is to represent the matrix as a three column table. The first and second columns have the sequence names and the third column is the distance between those sequences. An example of this is provided in 96_sq_column_amazon.dist, which can be obtained from the amazondata.zip zip file:

U68589 U68589  0.0000
U68589 U68590  0.3371
U68589 U68591  0.3609
U68589 U68592  0.4155
U68589 U68593  0.2872
U68589 U68594  0.2970
U68589 U68595  0.3922
U68589 U68596  0.3093
...

This file contains all 9216 (96x96) unique pairwise distances. Note that this file is actually bigger than the 98_sq_phylip_amazon.dist file. mothur can also take in a lower or upper triangle matrix representation in column format as shown in 96_lt_column_amazon.dist, which has 4,560 rows:

U68590 U68589  0.337144
U68591 U68589  0.360977
U68591 U68590  0.378254
U68592 U68589  0.415506
U68592 U68590  0.319757
U68592 U68591  0.414843
U68593 U68589  0.287299
U68593 U68590  0.169021
...

This file is similar in size to 96_lt_phylip_amazon.dist. You’ll note that the file does not have any rows where U68589 is in the first column. Because of this, it is mandatory that a name file be used with the read.column() command.