# Introduction to the bioseq package

The purpose of the bioseq package is to provide a collection of classes and functions for biological sequence manipulation in R. This vignette will introduce you to the basics of the package so you can get an overview of its functionnalities and start to use it rapidly.

It is assumed that you already installed the package either from CRAN using the function install.packages("bioseq) or from GitHub using remotes::install_github("fkeck/bioseq"). Now, let’s get started by loading the package.

library(bioseq)

## First steps

One of the core functionnality of bioseq is to provide vector classes to store DNA, RNA and amino acid sequences. We create our first DNA sequence vector using the function dna():

x <- dna(Seq_1 = "ACCTAG", Seq_2 = "GGTATATACC", Seq_3 = "AGTC")
is_dna(x)
#> [1] TRUE
x
#> DNA vector of 3 sequences
#> Seq_1  ACCTAG
#> Seq_2  GGTATATACC
#> Seq_3  AGTC

The function is_dna() is useful to test if an object is a DNA vector. The print method nicely indicates that x is a DNA vector of 3 sequences. Note that contrary to the standard print methods for R vectors, each element is printed on its own line, next to its (optional) name. Apart from this, a DNA vector behave very like a character vector. For example we can select and reorder elements by name, logical or index:

x[c("Seq_3", "Seq_1")]
#> DNA vector of 2 sequences
#> Seq_3  AGTC
#> Seq_1  ACCTAG
x[2]
#> DNA vector of 1 sequences
#> Seq_2  GGTATATACC
x[c(FALSE, FALSE, TRUE)]
#> DNA vector of 1 sequences
#> Seq_3  AGTC

However, the key difference between a DNA vector and a character vector is that DNA uses a restricted alphabet. For DNA this alphabet is A, C, G, T, W, S, M, K, R, Y, B, D, H, V, N and -, which correpond to the IUPAC symbols for DNA nucleotides. What happens if you include a forbidden character in a sequence?

y <- dna("?AcGF")
#> Warning: Non-standard IUPAC symbols detected for DNA: 2 characters were
#> converted to N.
y
#> DNA vector of 1 sequences
#> >   NACGN

Here we included two forbidden characters (? and F). Both are automatically converted to a N (which stands for any nucleotide). Forbidden characters often mean that something went wrong somewhere, so the function warns the user. Additionally we included a lowercase symbol (c) which is automatically and silently converted to uppercase. This mechanism guarantees that DNA objects contain only uppercase characters.

RNA and amino acid sequences can be constructed just like DNA using rna() and aa() functions. It is possible to check the allowed alphabet for each type of sequences by typing ?alphabets in the console.

## Operations on sequences

### Biological conversion among classes

In living organisms, DNA is typically transcribed to RNA which is translated to a proteic sequence. Similarly, conversion among sequence classes can be achieved using the seq_transcribe() and seq_translate() functions.

x_dna <- dna("ATGTCACCACAAACAGAGACT")
x_dna
#> DNA vector of 1 sequences
#> >   ATGTCACCACAAACAGAGACT

x_rna <- seq_transcribe(x_dna)
x_rna
#> RNA vector of 1 sequences
#> >   AUGUCACCACAAACAGAGACU

x_aa <- seq_translate(x_rna)
x_aa
#> AA vector of 1 sequences
#> >   MSPQTET

During transcription thymine is simply replaced by uracil. The translation decodes the RNA sequence into amino acids using the standard genetic code. Non standard genetic codes are also available for translation (see the help ?seq_translate). The reverse transcription can be achieved using the function seq_rev_transcribe(). The reverse translation is biologically not possible but is implemented in the function seq_rev_translate(). Because of the degeneracy of the genetic code, the reverse translation typically produces many ambiguous bases.

dna_from_rna <- seq_rev_transcribe(x_rna)
dna_from_rna
#> DNA vector of 1 sequences
#> >   ATGTCACCACAAACAGAGACT

dna_from_aa <- seq_rev_translate(x_aa)
dna_from_aa
#> DNA vector of 1 sequences
#> >   ATGWSNCCNCARACNGARACN

Finally, it is often useful to compute the complement and the reverse complement of DNA and RNA sequences. this can be achieved using the functions

x_dna_comp <- seq_complement(x_dna)
x_dna_comp_rev <- seq_reverse(x_dna_comp)

dna(x_dna, x_dna_comp, x_dna_comp_rev)
#> DNA vector of 3 sequences
#> >   ATGTCACCACAAACAGAGACT
#> >   TACAGTGGTGTTTGTCTCTGA
#> >   AGTCTCTGTTTGTGGTGACAT

### String operations

The bioseq package comes with numerous functions to perform string operations at the sequence level. We will not review the complete list of functions provided by the package, but we will see below how use some of them.

We will take a simple example with 3 sequences. The first two sequences have four A repeated. We will focus on this particular pattern.

x <- dna("CTGAAAACTG", "ATGAAAACTG", "CTGCTG")

#### Detection and selection

Let’s start by selecting only the sequences that match the pattern. This can be easily achieved by combining seq_detect_pattern with the [] operator.

x[seq_detect_pattern(x, "AAAA")]
#> DNA vector of 2 sequences
#> >   CTGAAAACTG
#> >   ATGAAAACTG

When using a simple character vector as pattern, the pattern is evaluated as a regular expression. This means that you can perform very complex queries using the regular expression syntax. Regular expressions are beyond the scope of this vignette but if you are interested to learn more the String chapter from the book R for Data Science by Grolemund and Wickham is a good place to get started.

As an example to illustrate regex support, the same pattern (AAAA) could be also formulated:

x[seq_detect_pattern(x, "A{4}")]
#> DNA vector of 2 sequences
#> >   CTGAAAACTG
#> >   ATGAAAACTG

Alternatively, a biological sequence (i.e a DNA, RNA or AA vector) can be used as pattern. This is less flexible than regular expression but can present several advantages. First it is safer and clearer because it forces the user to be more specific. Second, it allows to deal with ambiguous characters.

# This works
x[seq_detect_pattern(x, dna("AAAA"))]
#> DNA vector of 2 sequences
#> >   CTGAAAACTG
#> >   ATGAAAACTG
# This fails because x is a DNA vector and pattern is an amino acid vector
x[seq_detect_pattern(x, aa("AAAA"))]
# This works because W can be A or T.
x[seq_detect_pattern(x, dna("WAWA"))]
#> DNA vector of 2 sequences
#> >   CTGAAAACTG
#> >   ATGAAAACTG

However it is important to remember that a pattern which contains ambiguous characters is less specific and can capture several strings. How many and which ones? This can be answered using the function seq_disambiguate_IUPAC:

seq_disambiguate_IUPAC(dna("WAWA"))
#> [[1]]
#> DNA vector of 4 sequences
#> >   AAAA
#> >   TAAA
#> >   AATA
#> >   TATA

#### Remove and replace

If the AAAA pattern is an incorrect insertion, we may want to remove it from the sequences. This can be done with the function seq_remove_pattern().

seq_remove_pattern(x, "A{4}")
#> DNA vector of 3 sequences
#> >   CTGCTG
#> >   ATGCTG
#> >   CTGCTG

We can also replace a specific pattern with another sequence.

seq_replace_pattern(x,
pattern = dna("AAAA"),
replacement = dna("----"))
#> DNA vector of 3 sequences
#> >   CTG----CTG
#> >   ATG----CTG
#> >   CTGCTG

So far we performed operations using pattern recognition (functions with prefix seq_ and suffix _pattern). Several operations (remove, replace, extract and crop) can also be applied to a specific region delimited by to positions (in and out). This is typically more useful with aligned sequences.

Instead of removing a pattern it is possible to remove specific region by providing to positions.

For example if we want to replace the last 3 nucleotides with CCC:

x <- seq_remove_pattern(x, "A{4}")
seq_replace_position(x, 4, 6,
replacement = dna("CCC"))
#> DNA vector of 3 sequences
#> >   CTGCCC
#> >   ATGCCC
#> >   CTGCCC

It is important to know that patterns, positions and replacements are recycled along the sequences (usually the x argument). This means that if a pattern (vector or list), a position or a replacement is of length > 1, it will be replicated until it is the same length as x. This is powerful but it must be used with caution. The exemple below show an exemple with a vector of position (in) and a vector of replacement.

x
#> DNA vector of 3 sequences
#> >   CTGCTG
#> >   ATGCTG
#> >   CTGCTG
seq_replace_position(x, 1:3, 6,
replacement = dna("-", "--", "---"))
#> DNA vector of 3 sequences
#> >   -
#> >   A--
#> >   CT---