# Input From Scratch

In order to generate a pagoo object, the only mandatory data structure is a data.frame which has to contain 3 basic columns. Let’s illustrate it with an toy dataset.

tgz <- system.file('extdata', 'toy_data.tar.gz', package = 'pagoo')
untar(tarfile = tgz, exdir = tempdir()) # Decompress example dataset
files <- list.files(path = tempdir(), full.names = TRUE, pattern = 'tsv$|fasta$') # List files
files
## [1] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/case_clusters_meta.tsv"
## [2] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/case_df.tsv"
## [3] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/case_orgs_meta.tsv"
## [4] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/organismA.fasta"
## [5] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/organismB.fasta"
## [6] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/organismC.fasta"
## [7] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/organismD.fasta"
## [8] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/organismE.fasta"

The file we need now is case_df.tsv. Lets load it and see what’s its structure:

data_file <- grep("case_df.tsv", files, value = TRUE)
data <- read.table(data_file, header = TRUE, sep = '\t', quote = '')
##      gene       org cluster                             annot
## 1 gene081 organismA   OG001  Thioesterase superfamily protein
## 2 gene122 organismB   OG001          Thioesterase superfamily
## 3 gene299 organismC   OG001  Thioesterase superfamily protein
## 4 gene186 organismD   OG001  Thioesterase superfamily protein
## 5 gene076 organismE   OG001          Thioesterase superfamily
## 6 gene352 organismA   OG002 Inherit from proNOG: Thioesterase

So it is a data.frame with 4 columns. The first one with the name of each gene, the second one with the organism to which each gene belongs, the third one with the cluster to which each gene was assigned in the pangenome reconstruction, and the last one with annotation metadata for each gene. Of the 4 columns, the former 3 are required, and pagoo will look for columns named “gene”, “org”, and “cluster”. More columns are optional, and you can add as many as you want (or none) to add metadata of each gene.

With only this data (even ignoring the fourth column, which is metadata), you can start working with pagoo:

pg <- pagoo(data = data)

The next 2 .tsv files contains metadata for each cluster and for each organism, respectively, and are optional arguments.

##         org sero  country
## 1 organismA    a Westeros
## 2 organismB    b Westeros
## 3 organismC    c Westeros
## 4 organismD    a    Essos
## 5 organismE    b    Essos

On this data.frame we have a column named org which is mandatory in case you provide this argument. Other columns are metadata associated to each organism. Beware that organisms provided in this table (orgs_meta$org) must coincide with the names provided in the data$org field, in order to correctly map each variables.

Last file contains metadata associated with each cluster of orthologous:

##   cluster   kegg  cog
## 1   OG001   <NA>    S
## 2   OG002   <NA>    S
## 3   OG003   <NA> <NA>
## 4   OG004   <NA>    D
## 5   OG005 K01990    V
## 6   OG006   <NA>    V

Again, the column clust_meta$cluster must contain the same elements as data$cluster column to be able to map one into the other.

With all this data the pagoo object will look much more complete. But you can still add sequence information to the pangenome, which makes it much more useful and interesting to work with.

In this made up dataset we have 5 organisms, so if you decide to add sequences to the pangenome you must provide them for all 5 organisms. The type of data needed is a DNA multifasta file for each organism, in which each sequence is a gene whose name can be mapped to the data$gene column. You must first load the sequences into a list, and name each list element as the organism provided in data$org (as well as org_meta$org). The list would look something like: 1. organism1 • gene1 • gene2 • geneN 2. organism2 • gene1 • gene2 • geneM 3. organism3 • gene1 • gene2 • geneP In the case of the example we are working on: ## [1] "list" ## [1] 5 ## [1] "organismA" "organismB" "organismC" "organismD" "organismE" ## [1] "DNAStringSet" ## attr(,"package") ## [1] "Biostrings" And we have a list of DNAStringSet (Biostrings package). Now we can load a quite complete pagoo object (you could still add more metadata to genes, clusters, or organisms): # Input From Pangenome Reconstruction Software All the above stuff with preparing data and loading classes seems difficult and time-consuming, but in real life working datasets this will be rarely needed. We are explaining it here to provide full details about how the software works, but this package also provides functions to automatically read-in output files from pangenome reconstruction software into pagoo, avoiding any formatting or manipulation of data. Currently pagoo supports input from roary (Page et al., 2015), which has been the standard and most cited software for pangenome reconstruction. To work with roary’s ouput, please refer to ?roary_2_pagoo documentation. You will only need the .gff files, and the gene_presence_absence.csv file. Also, we have created our own pangenome reconstruction software called pewit (Ferrés et al., still unpublished), which automatically generates a pagoo-like object to perform downstream analyses. This object contain all the methods and fields pagoo provides, plus a set of methods and fields exclusive to this software. More recently, other good software has emerged like PanX (Ding et al., 2018), PRIATE (Bayliss et al., 2019), micropan (Snipen & Liland, 2015), PEPPA (Zhou & Achtman, 2020), panaroo (Tonkin-Hill et al., 2020), among others. We plan to provide support to most of them in the future, although some of them already provide options to output its results as roary’s, so in theory you could use them as pagoo’s input. # Adding Metadata After Object Creation After object creation, you may want to add new metadata given new information or as result of posterior analyses. pagoo objects include a function to add columns of metadata to each gene, each cluster, or each organism. To illustrate it, we will add a new column to the$organisms field named host to add made up information about the host where each genome was isolated from.

host_df <- data.frame(org = p$organisms$org, host = c("Cow", "Dog", "Cat", "Cow", "Sheep"))
p$add_metadata(map = "org", host_df) p$organisms
## DataFrame with 5 rows and 4 columns
##         org        sero     country        host
##    <factor> <character> <character> <character>
## 1 organismA           a    Westeros         Cow
## 2 organismB           b    Westeros         Dog
## 3 organismC           c    Westeros         Cat
## 4 organismD           a       Essos         Cow
## 5 organismE           b       Essos       Sheep

In order to allow pagoo correct data mapping, the values in the first column of the metadata table should be the same as p$organisms$org, and its column header must also be named org.

As said, you can add gene or cluster metadata following the same idea.

p$write_pangenome(dir = tmp) list.files(tmp, full.names = TRUE) ## [1] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/pangenome/clusters.tsv" ## [2] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/pangenome/data.tsv" ## [3] "/var/folders/cq/v60_yqxs0qjg01c2n2f6yc2w0000gn/T//RtmpbriZnk/pangenome/organisms.tsv" This creates a directory with 3 text files. The advantage of this approach is that you can analyze it outside R, the disadvantage is that from a reproducibility point of view reading text could be less stable since class or number precision can be lost, and also you can’t save the state of the object in any given time. Only available organisms/genes/clusters are saved, and if you reload the class using the tsv files, any previously dropped organisms/gene/cluster won’t be available any more. (For information about dropping/recovering organisms, see “Subset” tutorial). If you want to save the object and continue working with it in other R session, we recommend to save them as R objects with the RDS methods provided: rds <- paste(tempdir(), "pangenome.RDS", sep = "/") p$save_pangenomeRDS(file = rds)