Given a
standard FASTA-formatted file,
read_fasta
will read in the contents of the file and create a three column
data frame with columns for the sequence id, the sequence itself, and any
comments found in the header line for each sequence.
Arguments
- file
Either a path to a file, a connection, or literal data (either a single string or a raw vector) containing DNA sequences in the standard FASTA format. There are no checks to determine whether the data are DNA or amino acid sequences.
Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with
http://
,https://
,ftp://
, orftps://
will be automatically downloaded. Remote gz files can also be autom downloaded and decompressed.- degap
Logical value (default = TRUE) Removes gap characters from sequences indicated by "." or "-"
Value
A data frame object with three columns. The id
column will contain
the non-space characters following the >
in the header line of each
sequence; the sequence
column will contain the sequence; and the
comment
column will contain any text found after the first whitespace
character on the header line.
Note
The sequences in the FASTA file can have line breaks within them and
read_fasta()
will put those separate lines into the same sequence
Examples
temp <- tempfile()
write(">seqA\nATGCATGC\n>seqB\nTACGTACG", file = temp)
write(">seqC\nTCCGATGC", file = temp, append = TRUE)
write(">seqD B.ceresus UW85\nTCCGATGC", file = temp, append = TRUE)
write(">seq4\tE. coli K12\tBacteria;Proteobacteria;\nTCCGATGC",
file = temp,
append = TRUE
)
write(">seq_4\tSalmonella LT2\tBacteria;Proteobacteria;\nTCCGATGC",
file = temp, append = TRUE
)
write(">seqE B.ceresus UW123\nTCCGATGC\nTCCGATGC",
file = temp,
append = TRUE
)
sequence_df <- read_fasta(temp)