We will be offering an R workshop December 18-20, 2019. Learn more.

Column-formatted distance matrix

From mothur
Jump to: navigation, search

Most of the currently used software packages store all of the pairwise distances in RAM before printing them out to disk. This can suck up a considerable amount of RAM to the point that it is prohibitive for analyzing the size of data sets generated by pyrosequencing. An alternative format, used by mothur, is to represent the matrix as a three column table. The first and second columns have the sequence names and the third column is the distance between those sequences. An example of this is provided in 96_sq_column_amazon.dist, which can be obtained from the AmazonData.zip zip file:

U68589	U68589	0.0000
U68589	U68590	0.3371
U68589	U68591	0.3609
U68589	U68592	0.4155
U68589	U68593	0.2872
U68589	U68594	0.2970
U68589	U68595	0.3922
U68589	U68596	0.3093

This file contains all 9216 (96x96) unique pairwise distances. Note that this file is actually bigger than the 98_sq_phylip_amazon.dist file. mothur can also take in a lower or upper triangle matrix representation in column format as shown in 96_lt_column_amazon.dist, which has 4,560 rows:

U68590	U68589	0.337144
U68591	U68589	0.360977
U68591	U68590	0.378254
U68592	U68589	0.415506
U68592	U68590	0.319757
U68592	U68591	0.414843
U68593	U68589	0.287299
U68593	U68590	0.169021

This file is similar in size to 96_lt_phylip_amazon.dist. You'll note that the file does not have any rows where U68589 is in the first column. Because of this, it is mandatory that a name file be used with the read.column() command.