The classify_sequence()
function implements the Wang et al. naive Bayesian
classification algorithm for 16S rRNA gene sequences.
Arguments
- unknown_sequence
A character object representing a DNA sequence that needs to be classified
- database
A kmer database generated using
build_kmer_database
- kmer_size
An integer value (default of 8) indicating the size of kmers to use for classifying sequences. Higher values use more RAM with potentially more specificity Lower values use less RAM with potentially less specificity. Benchmarking has found that the default of 8 provides the best specificity with the lowest possible memory requirement and fastest execution time.
- num_bootstraps
An integer value (default of 100). The value of
num_bootstraps
is the number of randomizations to perform where1/kmer_size
of all kmers are sampled (without replacement) fromunknown_sequence
. Higher values will provide greater precision on the confidence score.
Value
A list object of two vectors. One vector (taxonomy
) is the
taxonomic assignment for each level. The second vector
(confidence
) is the percentage of num_bootstraps
that the
classifier gave the same classification at that level
References
Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007 Aug;73(16):5261-7. doi:10.1128/AEM.00062-07 PMID: 17586664; PMCID: PMC1950982.
Examples
kmer_size <- 3
sequences <- c("ATGCGCTA", "ATGCGCTC", "ATGCGCTC")
genera <- c("A", "B", "B")
db <- build_kmer_database(sequences, genera, kmer_size)
unknown_sequence <- "ATGCGCTC"
classify_sequence(
unknown_sequence = unknown_sequence,
database = db,
kmer_size = kmer_size
)
#> $taxonomy
#> [1] "B"
#>
#> $confidence
#> [1] 100
#>