R tutorial

The following tutorial makes use of the files available in the Rtutorial files

#===============================================================================
#
#  Introduction to R for use with output from mothur and other tools commonly
#  used in computational microbial ecology
#
#  Pat Schloss
#  pschloss@umich.edu
#
#
#  These notes are inspired and loosely built upon a tutorial developed by
#  Pawel Gajer (pgajer@gmail.com) at the University of Maryland Institute for
#  Genomic Sciences
#
#
#  The purpose of this tutorial is to introduce microbial ecologsist to the R
#  statistical programming enviroment.  You should create a folder that will
#  hold the data files that we will use and creat in the tutorial.  I will be
#  assuming that you either have never used R or used it in a rather basic
#  manner.
#
#  At the end you will find a list of resources that you may find useful for
#  learning more about R
#
#  This tutorial can be run by entering the following command within R...
#
#      source(file="tutorial.R", echo=T)
#
#===============================================================================
#
#  Introduction to the R environment
#
#===============================================================================
#  
#  The R environment is a command line where you enter commands.  When you 
#  fire up R, you will see the following:
#  
#      R version 2.10.0 (2009-10-26)
#      Copyright (C) 2009 The R Foundation for Statistical Computing
#      ISBN 3-900051-07-0
#      
#      R is free software and comes with ABSOLUTELY NO WARRANTY.
#      You are welcome to redistribute it under certain conditions.
#      Type 'license()' or 'licence()' for distribution details.
#      
#        Natural language support but running in an English locale
#      
#      R is a collaborative project with many contributors.
#      Type 'contributors()' for more information and
#      'citation()' on how to cite R or R packages in publications.
#      
#      Type 'demo()' for some demos, 'help()' for on-line help, or
#      'help.start()' for an HTML browser interface to help.
#      Type 'q()' to quit R.
#  
#      >
#
#  The ">" is the prompt where you enter the commands.  I will not be putting
#  the prompt in the tutorial.  Let's get started by entering the following

2+2

#  You will see the following:
#
#      > 2+2
#      [1] 4
#
#  This output tells you the obvious result that 2+2 is 4.  R can be used as an
#  over-grown calculator, but it is so much more than that.  It's really a high
#  level programming language to give you publication worthy graphics and 
#  perform complex statistical analyses.
#
#  If you want to quit R, run the following:
   
#q()

#  You will encounter the follwing prompt:
#
#      Save workspace image? [y/n/c]: 
#
#  If you select "y" then the next time you run R, it will recall your previous
#  commands.

#y

#  Fire up R again and hit the up arrow twice.  You will see the following:
#
#      > 2+2;
#
#  As a short cut you can also type the following to get the same effect when
#  quitting R:

#q("yes")

#  Something to note as you go through the tutorial is the use of the pound (#)
#  symbol.  This indicates that what follows is a comment.  For example:

2+2            #this is a comment

#  Another useful feature is the inline help feature:

?mean

#  This will output a help file for the 'mean' function.  You can scroll
#  through it with the arrow keys as well as the space bar.  At the end of the 
#  help text are one or more examples of how to use the function.  You can
#  execute this example by copying and pasting the text.  Alternatively, you 
#  can use the 'example' command:

example(plot)

#  Not sure what the function is that you want?  Use the '??' function...

??pca

#  This will tell you the list of functions and packages that have some
#  reference to "pca".
#
#  Running "2+2" gets you the answer of "4".  But what if you want to store
#  that value?  Save it as a variable...

x <- 2+2

#  or

x = 2+2
x

#  The left arrow is the preferred method of assigning values to variables.
#  Now you can manipulate that variable:

x + 2

#  You can run two commands on the same line by separating the commands with a
#  semicolon:

x<-2+2;x+2

#  To see what variables have been declared:

ls()

#  To remove variables you no longer need :

rm(x)
#x

#  Combining the last two commands will remove all of the variables stored in
#  memory:

rm(list=ls())
ls()

#  To keep our examples relevant, let's quickly discuss how to read in a table.
#  The frist two commands below are not necessarily needed if you know you are
#  in the same folder as your data

getwd()                            # is this the directory we want to be in?
setwd("~/Desktop/IntroToR")         # if not let's change directories
seq.sum <- read.table(file="stool.trim.fasta.summary", header=T)

#  The read.table command will read in your data table and keep the column
#  headings.  These data represent the length of sequece in the Costello stool
#  dataset after removing primers, barcodes, and low quality bases.  To make
#  the data easier to manipulate you can do the following:

colnames(seq.sum)        # see the column heading names
seq.sum$nbases[1:10]    # get the column "nbases" out of the table
attach(seq.sum)          # "attach" the table so that you don't need the "$"
nbases[1:10]

#  Don't worry if you don't understand everything in the last several lines.
#  The important thing is to see that there are different ways to get the same
#  output
#
#  Ok, that's enough preliminaries for now.  Let's get going...


#===============================================================================
#
#  Variables
#
#===============================================================================
#
#  Technically speaking everything in R is an object.  That may be a bit into
#  the weeds for most people.  But try to keep up.  There are many types of
#  objects in R.  We'll start with these:
#      * Numeric values
#      * Character values
#      * Logical values
#      * Vectors
#      * Matrices
#  
#
#  --Numeric values--
#  The variable we used above, x, is a numeric value.  Here are some other
#  examples:

x <- 2/3; x
y <- 31.7
z <- 2.1e7                         # same as 2.1*10^7
longAndSilly.variable.name <- pi/3 # an example of a long variable name

#  Although it is a matter of personal style, ideally, your variable names
#  will have some meaning to you.  For example:

genomeSize <- 4.5e6
#16SCopyNumber <- 6           # illegal variable name
rrnCopyNumber <- 6

#  A wide variety of arithmetic functions can be applied to these types of
#  variables:

log(x)           # natural logarithm
log2(x)          # base 2 logarithm
log10(x)         # base 10 logarithm
exp(x)           # exponent
sqrt(x)          # square root
abs(x)           # absolute value
floor(x)         # the largest integers not greater than x
ceiling(x)       # the smallest integers not less than x
x %% 2           # remainder or modulo operator


#  --Character values--
#  Strings of character values are useful for storing text information

A <- "Alanine"; A
R <- "Argenine"; R
N <- "Asparagine"; N

#  The paste function is useful for converting numerical information into a
#  string

x<-3.14159
text<-paste("x=", x, " y=", y, sep="")
text

#  or

text<-paste("x=", format(x, digits=3), " y=", y, sep="")
text



#  --Logical values--
#  Logical variables can have one of two values - TRUE or FALSE

x <- TRUE
y <- FALSE

!x # NOT operator x && y # AND operator x & y # bitwise AND operator (mainly for vectors) x || y # OR operator x | y # bitwise OR operator (mainly for vectors) x == y # is equal operator x != y # is not equal operator

#  We can also perform logical operators on numerical variables:

x <- 5
y <- 3

x > y          # greater than operator
x >= y         # greater than or equal to operator
x < y          # less than operator
x <= y         # less than  or equal tooperator
x == y         # is equal to operator
x != y         # is not equal to operator

#  We can also convert numerical values to logical valeus

x <- 0
as.logical(x)
as.logical(y)

#  We can also use logical operators on strings

x <- "ATG"
y <- "CCC"

x > y          # greater than operator
x >= y         # greater than or equal to operator
x < y          # less than operator
x <= y         # less than  or equal tooperator
x == y         # is equal to operator
x != y         # is not equal to operator


#  --Vectors--
#  Vectors are a one-dimensional set of values of the same type.  You can read
#  in vectors from a file or create them on the fly.  Here are several 
#  examples:

19:55                   # list the values from 19 to 55 by ones
c(1,2,3,4)              # concatenate 1, 2, 3, 4, 5 into a vector
rep("red", 5)           # repeat "red" five times
seq(1,10,by=3)          # list the values from 1 to 10 by 3's
seq(1,10,length.out=20) # list 20 evenly spaced elements from 1 to 10
seq(1,10,len=20)        # same thing; arguments of any function can be 
                        # abbreviated as long as the abbreviation is unique

#  These can be combined as well

c(rep("red", 5), rep("white", 5), rep("blue", 5))
rep(c(0,1), 10)

#  And they can be assigned to variables

x <- seq(1,100,by=0.5)
x

code <- c("A", "T", "G", "C")

#  Note that in contrast to many programming languages, vectors in R are 
#  indexed such that the first value is 1 NOT 0.

code[2]
code[0]
code[-1]
code[c(1,2)]

#  Recall that we've already seen vectors in the preamble to this tutorial...

nbases

#  You can easily determine the length of a vector

length(code)
code[length(code)]
length(nbases)

#  Logical operators can also be used on vectors...

tf <- x > 50
isLong <- nbases > 200    # do you know what this is doing?

#  And used to select portions of vectors

x[tf]

#  One of the awesome things about vectors is that you can perform algebraic
#  manipulations on them.  These types of operations are "elementwise" meaning
#  that the operation is applied to each operation

2 * x
x + x
log(x)

#  To define a vector without giving it any values

z <- numeric(5)             #  This creates a numerical vector with 5 zeros
z[3] <- 10
z
z[1:3] <- 5
z
z[10] <- pi

t <- character(5)
t[4] <- "DNA rocks!"


#  You can also create vectors that are indexed by character strings.  In some
#  programming languages these are called hash-maps or look-up tables.

v <- numeric(0)
v["a"] <- 1.23498
v["t"] <- 2.2342
v["c"] <- 3
v["g"] <- 4
v

#  You can get the name of each cell in the vector:

names(v)
names(v) <- NULL  # this removes names attribute
v

names(nbases) <- seqname
nbases[1:10]

#  Alternatively, we could define v2 as

v2 <- c(A=1.23498,T=2.2342,C=3,G=4)
v2
v2 * 2
as.vector(v2)  # strips out names
is.vector(v2)  # checks if v is a vector


#  We can access elements of 'v' by their labels

v2["a"]
v2[["a"]]      # strips the name associated with value 1


#  There are many ways to get a value out of a vector.  We've already seen
#  some of these

v <- floor(runif(10, 1,10))# create a vector with 10 values randomly drawn from 
v                   # the range of 1 to 10
n <- 3
v[n]                # n-th element
v[-n]               # all but the n-th element
v[1:n]              # first n elements
v[-(1:n)]           # elements from n+1 to the end
v[c(2,4)]           # 2-nd and 4-th elements
v["name"]           # element named "name"
v[v > 6]            # elements greater than 6
v[v > 4 & v < 6]    # elements between 4 and 6
v[v %in% c(1,3,7)]  # elements in a given set
v[!is.na(v)]        # subsequence of v consisting of non-missing values of v
v[is.na(v)] <- 0    # sets all missing values to 0


#  There are a variety of operators that take vectors as input

length(v)          # length of vector v
mean(v)            # mean of v
median(v)          # median of v
sd(v)              # standard deviation of v
var(v)             # variance of v
mad(v)             # median absolute deviation of v
min(v)             # min of v
max(v)             # maximal element of v
which.min(v)       # returns the index of the smallest element of v
which.max(v)       # returns the index of the greatest element of v
summary(v)         # return descriptive statistics of v
sum(v)             # sum of elements of v
prod(v)            # product of all elements of v
sort(v)            # ascending sort the values of v
sort(v, decreasing=T)# descending sort the values of v
order(v)           # the order of the sorted values of v
v[order(v)]        # gives the same order as sort(v)
rev(v)             # reverse the order of v
unique(v)          # give the unique values of v
all(v)             # returns TRUE if all values of a logical vector v are TRUE,
                   # otherwise returns FALSE
any(v)             # returns TRUE if at least one value of a logical vector v is
                   # TRUE, otherwise returns FALSE

#  Exercises:  Calculate the following on the data we read in from the seq.sum
#  file and stored as the "nbases" vector...
#  
#      number of sequences


#      number of sequences longer than 200 bp


#      mean seqeunce length


#      median sequence length


#      mean sequence length for sequences longer than 200 bp


#      stanard deviation of sequence length for sequences longer than 200 bp


#      get seq.sum statisitics for the length of sequences



#  Note that if a vector contains missing values then all above numerical 
#  routines are going to return NA value. In order to skip 
#  missing values of 'v' while computing some seq.sum statistics of 'v', one 
#  can set na.rm to TRUE as in the following example


v[11] = NA
mean(v)
mean(v, na.rm=T)


#  Say you want to search a character vector for a feature of the strings.  To
#  do this you can use the grep function and its relatives to get the indices
#  of values in the vector that match the search

shades.of.red <- grep("red", colors())
colors()[shades.of.red]


#  --Tables & Matrices--
#  Tables and matrices are multi-dimensional sets of values of multiple or the
#  same type.  We already read in a table in the preamble of this tutorial...

getwd()                            # is this the directory we want to be in?
setwd("~/Desktop/IntroToR")         # if not let's change directories
seq.sum <- read.table(file="stool.trim.fasta.summary", header=T)

#  Here "seq.sum" is a table of values describing some characteristics of the
#  sequences after removing the barcodes, primers, and low quality base calls.
#  Tables and matrices are indexed much like vectors...

seq.sum[1,1]             # get the name of the first sequence in the table
seq.sum[1,c(1,4)]        # get the name and length of the first sequence 
seq.sum[1,]              # get all of the data in the first row
seq.sum[,4]              # get all of the data in the column labelled "nbases"
seq.sum[,"nbases"]       # get all of the data in the column labelled "nbases"
seq.sum$nbases           # get all of the data in the column labelled "nbases"
attach(seq.sum);nbases   # get all of the data in the column labelled "nbases"
detach(seq.sum)

#  It should be clear taht the last four commands above gave the same data.
#  With the exception of the first command, each of these commands returns a
#  vector.
#
#  When reading in a table it is always nice to get a handle of what you're
#  working with...

dim(seq.sum)             # number of rows and columns in "seq.sum"
nrow(seq.sum)            # number of rows "seq.sum"
ncol(seq.sum)            # number of columns in "seq.sum"
colnames(seq.sum)        # names of the columns in "seq.sum"
rownames(seq.sum)        # names of the rows in "seq.sum"
seq.sum[1:10,]           # get the first 10 rows of data from the "seq.sum" file

#  Looking at the output from the 'rownames' command we know we can do better
#  than those row names.  Let's use the sequence names as the row names...

rownames(seq.sum)<-seq.sum$seqname
seq.sum<-seq.sum[,-1]
seq.sum[1:10,]


#  Let's look at how we could get the row indices corresponding to sequences
#  longer than 200 bp

(1:length(nbases))[nbases > 200]

#  To get the sequence names we could then do...

rownames(seq.sum)[(1:length(nbases))[nbases > 200]]  #wonky
rownames(seq.sum)[seq.sum$nbases > 200]              #better

#  Earlier we showed how to sort vectors.  Let's sort seqeunce legnth by the
#  length of the longest homopolymer in the sequence...

polymerOrder <- order(seq.sum$polymer)
seq.sum[polymerorder,c("polymer", "nbases")]

#  You can see that's not exactly what we were hoping for.  We can actually 
#  use the order command to sort on multiple columns...

polymerLengthOrder <- order(seq.sum$polymer, seq.sum$nbases)
polymerLength <- seq.sum[polymerlengthorder,c("polymer", "nbases")]
polymerLength

#  Let's write the sorted lengths to a new file
write(polymerLength$nbases,file="lengths.txt")

#  To read a vector in from a file...

lengths <- scan("lengths.txt")

#  Those two commands are for writing and reading vectors.  If we wanted to 
#  write and read a table we'd do the following...

write.table(polymerLength,file="polymerLengths.txt", quote=F)

#  If you open polymerLengths.txt in a text editor you will see that there are
#  three columns of data, but only two headings.  If you have one fewer 
#  headings than columns, R will use the first column of data as the rownames 
#  when you try to read it back in...

table<-read.table(file="polymerLengths.txt", header=T)


#  There are some functions that take matrices/tables as input

m <- matrix(floor(runif(100, 1, 10)), nrow=10, ncol=10)    #create a 10 x 10 matrix
t(m)             # transpose the matrix
1/m              # take each value of m and find it's reciprocal
m * m            # calculate the square of each value in m
m %*% m          # performs matrix multiplication
crossprod(m,m)   # performs the cross product
rowSums(m)       # calculate the sum for each row
colSums(m)       # calculate the sum for each column
lower.tri(m)     # find the indices that are below the diagonal
m[lower.tri(m)]  # give the lower triangle of m
diag(m)          # the values on the diagonal of m
det(m)           # the determinent of m

#  If you try to get the mean or standard deviation of a row or column you'll
#  struggle mightilly with these commands.  Instead you need to use the "apply"
#  command

mean(m)
apply(m, 1, mean)  # get the mean for each row (that's the 1)
apply(m, 2, mean)  # get the mean for each column (that's the 2)
apply(m, 1, sum)   # get the sum for each row - same as rowSums(m)
apply(m, 2, sum)   # get the sum for each column - same as colSums(m)


#  --Factors--
#  Factors are categorical variables.  The "seq.sum" table doesn't exactly
#  contain any categorical data.  For the purposes of discussion, let's use
#  the polymer column as a categorical data type.  The important thing is that
#  factors have discrete levels.  For example, in microbial ecology, we might
#  think of soil types, body sites, a person's sex, or whether a site is
#  polluted as categorical variables.  
#  Factors can be created using factor() routine.

seq.sum$polymer <-factor(seq.sum$polymer)
levels(seq.sum$polymer)

#  If we wanted to convert our factors from factors to strings or to numbers
#  we could do the following...
#
#      seq.sum$polymer <- as.character(seq.sum$polymer)
#      seq.sum$polymer <- as.numeric(seq.sum$polymer)
#
#
#  We might be interested to see if sequence length varies with the length
#  of the homopolymer in the sequence.  We can do this with the aggregate
#  command and treating polymer as a factor...

aggregate(seq.sum$nbases, list(polymer), mean)
aggregate(seq.sum$nbases, list(polymer), median)
aggregate(seq.sum$nbases, list(polymer), sd)

#  Similar to the aggregate command, the "by" command will allow you to take
#  all of the columns and perform an operation on the...

by(seq.sum, seq.sum["polymer"], summary)


#===============================================================================
#
#  Programming basics
#
#===============================================================================
#
#  R is a pretty high-level programming language with extensive functionality
#  built in.  The strength of R is that as a user you can add to this to suit
#  your needs
#
#  --Customized functions--
#  Want to make your own function or package?  It's relatively simple.  The
#  general syntax is as follows
#
#      functionName <- function(x){
#          #the stuff goes here
#      }
#
#  So we might write a trivial script to calculate the square root of a value.

my.sqrt<-function(x){
   sqrt(x)
}

my.values <- 1:10
my.sqrt(my.values)

#  The output of "my.sqrt(my.values)" should be the same as the following:

sqrt(1:10)


#  --for loops--
#  If you want to do something to each value in a vector or to carry out some
#  procedure a specific number of times, you can do that with a for loop.
#  Let's sum the square of all the numbers between 1 and 10

for.sum <- 0;
for(i in 1:10){
   for.sum = for.sum + i^2
}
for.sum


#  --If-then-else statements--
#  When you meet a fork in the road, take it.  Well, ok, maybe not, but in R
#  you can use logic statements to make decisions.  For example, let's create
#  a vector of factors to indicate if a sequence is short or long:

lengthCat <- numeric(nrow(seq.sum))

for ( i in 1:length(seq.sum$nbases) ){

  if ( seq.sum$nbases[i] < 150 ){
    lengthCat[i] <- "short"
  }else if ( seq.sum$nbases[i] < 200 ){
    lengthCat[i] <- "medium"
  }else {
    lengthCat[i] <- "long"
  }
}

lengthCat <- factor(lengthCat)


#===============================================================================
#
#  Graphics
#
#===============================================================================
#
#  R has an amazing array of options for graphics.  When someone gives a talk
#  you can tell that they used Excel because the plots look uniformly awful.  
#  In contrast you can tell that someone used R because they look awesome.  We 
#  will just scratch the surface of its graphics capabilities here.  For more 
#  details about these functions remember to use ?`<function name>`.  For more
#  information about the more complicated and sophisticated graphics libraries
#  consult:
#
#      Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.
#
#
#  --Histogram--
#  Let's look at the distribution of sequence lengths in our seq.sum table

hist(seq.sum$nbases)

#  Let's gussy it up a bit...

hist(seq.sum$nbases, col="skyblue", freq=T, xlab="Sequence Length",
   main="Distribution of Sequence Lengths")
box()

#  The hist command actually returns a value other than the graph...

histData <- hist(seq.sum$nbases, col="skyblue", freq=T, xlab="Sequence Length",
   main="Distribution of Sequence Lengths")
histData


#  --Stripcharts--
#  We can also create a strip chart of the sequence lengths for each 
#  homopolymer class...

stripchart(seq.sum$nbases~seq.sum$polymer, ylab="Sequence Length", 
   xlab="Homopolymer Length")

stripchart(seq.sum$nbases~seq.sum$polymer, method="jitter", 
   ylab="Sequence Length", xlab="Homopolymer Length")

stripchart(seq.sum$nbases~seq.sum$polymer, method="jitter", vertical=T, 
   ylab="Sequence Length", xlab="Homopolymer Length")

stripchart(seq.sum$nbases~seq.sum$polymer, method="jitter", vertical=T,
   jitter=0.4, pch=1, cex=.2, col=rainbow(5), ylab="Sequence Length", 
   xlab="Homopolymer Length")


#  --Boxplot--
#  Those strip charts had a lot of data and may be too difficult to interpret.
#  Instead, let's try to represent the data as box plots

boxplot(seq.sum$nbases~seq.sum$polymer, ylab="Sequence Length", 
   xlab="Homopolymer Length")



#  --Pie charts--
#  Pie charts are a generally bad way to represent data.  For a good chuckle
#  check out the "Note" section of...

?pie

polymerCount <- aggregate(seq.sum$polymer, list(polymer), length)
polymerFreq <- polymerCount$x / length(seq.sum$polymer)
pie(polymerFreq)


#  --Bar plots--

barplot(polymerCount$x, names.arg=polymerCount$Group, xlab="Homopolymer length",
   ylab="Number of Sequences")
barplot(polymerCount$x, names.arg=polymerCount$Group, ylab="Homopolymer length",
   xlab="Number of Sequences", horiz=T)



shared<-read.table(file="stool.final.tx.shared", header=T)
rownames(shared)<-shared$V2
shared<-as.matrix(shared[,-c(1,2,3)])
colnames(shared)<-paste("Phylotype", 1:ncol(shared), sep="")
shared.ra<-shared/apply(shared, 1, sum) # calculate the relative abundance of 
                                       # each phylotype

bar.col<-c(rep("pink", 12), rep("blue", 12))

barplot(shared.ra[,1:5], col=bar.col)
barplot(shared.ra[,1:5], beside=T, col=bar.col)
barplot(t(shared.ra), col=rainbow(200))



#  --Dot/Line graphs--
#  We might be interested in plotting the first two dimensions of a PCoA or 
#  NMDS plot.  Let's do this with data generated in the Costello stool analysis
#  tutorial.  The necessary file is in your folder.

nmds<-read.table(file="stool.final.an.thetayc.0.03.lt.nmds.axes", header=T)
plot(nmds$axis1, nmds$axis2)

#  or

plot(nmds$axis2~nmds$axis1)

#  Looking at the group names in the nmds table we see that the first 12 sample
#  names are from women ("F") and the last 12 are from men ("M").  There are
#  more elegant ways to do this, but for beginners, this will work...

nmds.col<-c(rep("pink", 12), rep("blue", 12))

plot(nmds$axis2~nmds$axis1, col=nmds.col, xlab="Axis 1", ylab="Axis 2")
legend(x=0.3, y=0.6, legend=c("Female", "Male"), pch=1, col=c("pink", "blue"))

plot(nmds$axis2~nmds$axis1, col=nmds.col, xlab="Axis 1", ylab="Axis 2", pch=18,
   cex=2)
legend(x=0.3, y=0.6, legend=c("Female", "Male"), pch=18, cex=1, col=c("pink", 
   "blue"))

#  Although these points aren't linked you could connect them...

plot(nmds$axis2~nmds$axis1, col=nmds.col, xlab="Axis 1", ylab="Axis 2", pch=18, 
   cex=2, type="b")
legend(x=0.3, y=0.6, legend=c("Female", "Male"), pch=18, cex=1, lty=1, 
   col=c("pink", "blue"))

#  You can also overlay two graphs on top of each other using the points
#  command.  Here we'll put the cumulative number of sequences that have that 
#  sequence length or higher.

hist(seq.sum$nbases, col="skyblue", freq=T, xlab="Sequence Length",
   main="Distribution of Sequence Lengths", ylim=c(0,length(seq.sum$nbases)))
points(sort(seq.sum$nbases), length(seq.sum$nbases):1, type="l")
box()


#  --Composite graphics--
#  You can make graphics in R using multiple panes.  Let's say we wanted to 
#  create a figure that had two panes representing histograpms for sequences
#  a maximum homopolymer length of 5 and one for the 6's

par(mfrow=c(2,1))
hist(seq.sum[seq.sum$polymer==5,]$nbases, col="skyblue", freq=T, 
   xlab="Length of sequences with a maximum polymer length of 5", main="", 
   xlim=c(0,300), breaks=seq(0,260, 20))
hist(seq.sum[seq.sum$polymer==6,]$nbases, col="red", freq=T,
   xlab="Length of sequences with a maximum polymer length of 6", main="", 
   xlim=c(0,300), breaks=seq(0,260, 20))
par(mfrow=c(1,1))

#===============================================================================
#
#  Libraries
#
#===============================================================================
#
#  One of the amazing aspects of R is that there is a broad and diverse array
#  of scientists engaged in adding features to the program by contributing
#  packages.  You can often find packages you might be interested in by 
#  googling "r `<insert type of function>`" that you want.  For example, if you'd
#  like to visualize PCoA or NMDS data in 3-D, you will likely stumble up on 
#  the rgl package.  You will need to install and load the package to use the 
#  plot3d command with your data

install.packages("rgl")
library(rgl)

pcoa<-read.table(file="stool.final.an.thetayc.0.03.lt.pcoa.axes", header=T)
rownames(pcoa)<-pcoa$V1
pcoa<-pcoa[,-1]
plot3d(pcoa, col=nmds.col, type="s")

#  Although the functionality is already in mothur, other packages that may
#  interest you include vegan and ecodist

#===============================================================================
#
#  Statistics
#
#===============================================================================
#
#  The R package has a wide array of statistical tools built in and created by
#  other developers.  If you can imagine the test, it is probably already
#  implemented in R

mf <- c(rep("F", 12), rep("M", 12))
t.test(shared.ra[,"phylotype1"]~mf)
t.test(shared.ra[,"phylotype5"]~mf)

wilcox.test(shared.ra[,"phylotype1"]~mf)
wilcox.test(shared.ra[,"phylotype5"]~mf)

#  Check back for more later.  This section is under development

#===============================================================================
#
#  Lists and data frames
#
#===============================================================================

#  Check back for more later.  This section is under development

#===============================================================================

#===============================================================================
#
#  More resources
#
#===============================================================================
#
#  Websites:
#      RTFM `#      `[`https://www.r-project.org/`](https://www.r-project.org/)
#
#      R tips `#      `[`https://pj.freefaculty.org/r/rtips.html`](https://pj.freefaculty.org/R/Rtips.html)
#  
#      R Graph Gallery `#      `[`https://addictedtor.free.fr/graphiques/rgraphgallery.php`](https://addictedtor.free.fr/graphiques/RGraphGallery.php)
#
#      Quick-R
#      A website for experienced users of popular statistical packages such as
#      SAS, SPSS, Stata, and Systat (although current R users should also find
#      it useful). `#      `[`https://www.statmethods.net/index.html`](https://www.statmethods.net/index.html)
#
#  Books:
#      Dalgard, P (2008) _Introductory Statistics with R_.  Second edition.  
#      Springer.
#
#      Adler, J (2009) _R in a Nutshell_.  O'Reilly.
#
#      Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S 
#      Language_.  Wadsworth & Brooks/Cole.
#
#      Venables, W. N. and Ripley, B. D. (2002) _Modern Applied Statistics with
#       S._  Fourth edition.  Springer.
#
#      Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.
#
#===============================================================================