We will be offering an R workshop December 18-20, 2019. Learn more.
Column-formatted distance matrix
Most of the currently used software packages store all of the pairwise distances in RAM before printing them out to disk. This can suck up a considerable amount of RAM to the point that it is prohibitive for analyzing the size of data sets generated by pyrosequencing. An alternative format, used by mothur, is to represent the matrix as a three column table. The first and second columns have the sequence names and the third column is the distance between those sequences. An example of this is provided in 96_sq_column_amazon.dist, which can be obtained from the AmazonData.zip zip file:
U68589 U68589 0.0000 U68589 U68590 0.3371 U68589 U68591 0.3609 U68589 U68592 0.4155 U68589 U68593 0.2872 U68589 U68594 0.2970 U68589 U68595 0.3922 U68589 U68596 0.3093 ...
This file contains all 9216 (96x96) unique pairwise distances. Note that this file is actually bigger than the 98_sq_phylip_amazon.dist file. mothur can also take in a lower or upper triangle matrix representation in column format as shown in 96_lt_column_amazon.dist, which has 4,560 rows:
U68590 U68589 0.337144 U68591 U68589 0.360977 U68591 U68590 0.378254 U68592 U68589 0.415506 U68592 U68590 0.319757 U68592 U68591 0.414843 U68593 U68589 0.287299 U68593 U68590 0.169021 ...
This file is similar in size to 96_lt_phylip_amazon.dist. You'll note that the file does not have any rows where U68589 is in the first column. Because of this, it is mandatory that a name file be used with the read.column() command.