Querying Data

In this section we will start exploring what is stored inside the pagoo object and how we can access this information. Keep in mind that this object has its own associated data and methods that can be easily queried with the $ operator. These methods allow for the rapid subsetting, extraction and visualization of pangenome data.

First of all, we will load a pangenome using a toy dataset included in the package. This is a preloaded set of 10 Campylobacter spp. genomes, with metadata associated.

library(pagoo) # Load package
rds <- system.file('extdata', 'campylobacter.RDS', package = 'pagoo')
p <- load_pangenomeRDS(rds)

Summary statistics

A pangenome can be stratified in different gene subsets according to their frequency in the dataset. The core genes can be defined as those present in all or almost every genome (typically 95-100%). The remaining genes are defined as the accessory genome, that can be subdivided in cloud genes or singletons (present in one genome or in genomes that are identical) and shell genes which are those in the middle. Let’s see this using pagoo:

p$summary_stats
## DataFrame with 4 rows and 2 columns
##      Category    Number
##   <character> <integer>
## 1       Total      2588
## 2        Core      1627
## 3       Shell       413
## 4       Cloud       548

Core level

The core level defines the minimum number of genomes (as a percentage) in which a certain gene should be present to be considered a core gene. By default, pagoo considers as core all genes present in at least 95% of organisms. The core level can be modified to be more or less stringent defining the core genome. This feature exemplifies R6’s reference semantics, since modifying the core level will affect the pangenome object state resulting in different core, shell and cloud sets. Have a look:

## [1] 95
## DataFrame with 4 rows and 2 columns
##      Category    Number
##   <character> <integer>
## 1       Total      2588
## 2        Core      1554
## 3       Shell       486
## 4       Cloud       548

As you can see, changing the core level from 95% to a more stringent 100% cause in decrease in the number of core genes from 1627 to 1554, and a concomitant increase in shell genes from 413 to 486. This means that 73 genes migrated from core to shell when increasing the threshold to consider a cluster as “core” to 100%. Now this changes remain in the object for subsequent analysis, or can be reverted by setting the core level again at the original value.

Pangenome matrix

The pangenome matrix is one of the most useful things when analyzing pangenomes. Typically, it represents organisms in rows and clusters of orthologous in columns informing about gene abundance (considering paralogues). The pangenome matrix looks like this (printing only first 5 columns):

p$pan_matrix[, 1:5]
##            group0001 group0002 group0003 group0004 group0005
## 16244_6_6          1         1         1         1         1
## 16244_6_18         1         1         1         1         1
## 17059_2_16         1         1         1         1         1
## 17059_2_23         1         1         1         1         0
## 17059_2_27         1         1         1         1         1
## 17150_1_73         1         1         1         1         1
## 17059_2_42         1         1         1         1         0

Gene metadata

Individual gene metadata can be accessed by using the $genes suffix. It always contains the gene name, the organism to which it belongs, the cluster to where it was assigned, and a gene identifier (gid) that is mainly used internally to organize the data. Also, it may typically include annotation data, genomic coordinates, etc, but this other metadata is optional. Gene metadata is spitted by cluster, so it consist in a List of DataFrames.

p$genes
## SplitDataFrameList of length 2588
## $group0001
## DataFrame with 7 rows and 10 columns
##     cluster        org             gene                          gid
##    <factor>   <factor>         <factor>                  <character>
## 1 group0001 16244_6_6  16244_6_6_00150    16244_6_6__16244_6_6_00150
## 2 group0001 16244_6_18 16244_6_18_00172 16244_6_18__16244_6_18_00172
## 3 group0001 17059_2_16 17059_2_16_01012 17059_2_16__17059_2_16_01012
## 4 group0001 17059_2_23 17059_2_23_01040 17059_2_23__17059_2_23_01040
## 5 group0001 17059_2_27 17059_2_27_00492 17059_2_27__17059_2_27_00492
## 6 group0001 17150_1_73 17150_1_73_00221 17150_1_73__17150_1_73_00221
## 7 group0001 17059_2_42 17059_2_42_00176 17059_2_42__17059_2_42_00176
##      geneName
##   <character>
## 1        ilvC
## 2        ilvC
## 3        ilvC
## 4        ilvC
## 5        ilvC
## 6        ilvC
## 7        ilvC
##                                                                                                                                                                  product
##                                                                                                                                                              <character>
## 1 ketol-acid reductoisomerase,Ketol-acid reductoisomerase,ketol-acid reductoisomerase,ketol-acid reductoisomerase,Acetohydroxy acid isomeroreductase%2C catalytic domain
## 2 ketol-acid reductoisomerase,Ketol-acid reductoisomerase,ketol-acid reductoisomerase,ketol-acid reductoisomerase,Acetohydroxy acid isomeroreductase%2C catalytic domain
## 3 ketol-acid reductoisomerase,Ketol-acid reductoisomerase,ketol-acid reductoisomerase,ketol-acid reductoisomerase,Acetohydroxy acid isomeroreductase%2C catalytic domain
## 4 ketol-acid reductoisomerase,Ketol-acid reductoisomerase,ketol-acid reductoisomerase,ketol-acid reductoisomerase,Acetohydroxy acid isomeroreductase%2C catalytic domain
## 5 ketol-acid reductoisomerase,Ketol-acid reductoisomerase,ketol-acid reductoisomerase,ketol-acid reductoisomerase,Acetohydroxy acid isomeroreductase%2C catalytic domain
## 6 ketol-acid reductoisomerase,Ketol-acid reductoisomerase,ketol-acid reductoisomerase,ketol-acid reductoisomerase,Acetohydroxy acid isomeroreductase%2C catalytic domain
## 7 ketol-acid reductoisomerase,Ketol-acid reductoisomerase,ketol-acid reductoisomerase,ketol-acid reductoisomerase,Acetohydroxy acid isomeroreductase%2C catalytic domain
##                      contig      from        to      strand
##                 <character> <integer> <integer> <character>
## 1 ERS672247|SC|contig000001    137333    138355           +
## 2 ERS672259|SC|contig000001    173541    174563           +
## 3 ERS739235|SC|contig000005     43645     44667           -
## 4 ERS739242|SC|contig000004    173578    174600           +
## 5 ERS739246|SC|contig000002    184035    185057           -
## 6 ERS686652|SC|contig000001    207206    208228           +
## 7 ERS739261|SC|contig000001    173569    174591           +
## 
## $group0002
## DataFrame with 7 rows and 10 columns
##     cluster        org             gene                          gid
##    <factor>   <factor>         <factor>                  <character>
## 1 group0002 16244_6_6  16244_6_6_01290    16244_6_6__16244_6_6_01290
## 2 group0002 16244_6_18 16244_6_18_01307 16244_6_18__16244_6_18_01307
## 3 group0002 17059_2_16 17059_2_16_01627 17059_2_16__17059_2_16_01627
## 4 group0002 17059_2_23 17059_2_23_00348 17059_2_23__17059_2_23_00348
## 5 group0002 17059_2_27 17059_2_27_01208 17059_2_27__17059_2_27_01208
## 6 group0002 17150_1_73 17150_1_73_01398 17150_1_73__17150_1_73_01398
## 7 group0002 17059_2_42 17059_2_42_00751 17059_2_42__17059_2_42_00751
##      geneName
##   <character>
## 1        hprA
## 2        hprA
## 3        hprA
## 4        hprA
## 5        hprA
## 6        hprA
## 7        hprA
##                                                                                                                                                                                                                                  product
##                                                                                                                                                                                                                              <character>
## 1 2-hydroxyacid dehydrogenase,Putative 2-hydroxyacid dehydrogenase HI_1556,2-hydroxyacid dehydrogenase,Phosphoserine aminotransferase,phosphoglycerate dehydrogenase,D-isomer specific 2-hydroxyacid dehydrogenase%2C NAD binding domain
## 2 2-hydroxyacid dehydrogenase,Putative 2-hydroxyacid dehydrogenase HI_1556,2-hydroxyacid dehydrogenase,Phosphoserine aminotransferase,phosphoglycerate dehydrogenase,D-isomer specific 2-hydroxyacid dehydrogenase%2C NAD binding domain
## 3 2-hydroxyacid dehydrogenase,Putative 2-hydroxyacid dehydrogenase HI_1556,2-hydroxyacid dehydrogenase,Phosphoserine aminotransferase,phosphoglycerate dehydrogenase,D-isomer specific 2-hydroxyacid dehydrogenase%2C NAD binding domain
## 4 2-hydroxyacid dehydrogenase,Putative 2-hydroxyacid dehydrogenase HI_1556,2-hydroxyacid dehydrogenase,Phosphoserine aminotransferase,phosphoglycerate dehydrogenase,D-isomer specific 2-hydroxyacid dehydrogenase%2C NAD binding domain
## 5 2-hydroxyacid dehydrogenase,Putative 2-hydroxyacid dehydrogenase HI_1556,2-hydroxyacid dehydrogenase,Phosphoserine aminotransferase,phosphoglycerate dehydrogenase,D-isomer specific 2-hydroxyacid dehydrogenase%2C NAD binding domain
## 6 2-hydroxyacid dehydrogenase,Putative 2-hydroxyacid dehydrogenase HI_1556,2-hydroxyacid dehydrogenase,Phosphoserine aminotransferase,phosphoglycerate dehydrogenase,D-isomer specific 2-hydroxyacid dehydrogenase%2C NAD binding domain
## 7 2-hydroxyacid dehydrogenase,Putative 2-hydroxyacid dehydrogenase HI_1556,2-hydroxyacid dehydrogenase,Phosphoserine aminotransferase,phosphoglycerate dehydrogenase,D-isomer specific 2-hydroxyacid dehydrogenase%2C NAD binding domain
##                      contig      from        to      strand
##                 <character> <integer> <integer> <character>
## 1 ERS672247|SC|contig000003    274585    275517           -
## 2 ERS672259|SC|contig000003    272748    273680           -
## 3 ERS739235|SC|contig000012     11617     12549           +
## 4 ERS739242|SC|contig000001    308107    309039           -
## 5 ERS739246|SC|contig000005    114555    115487           -
## 6 ERS686652|SC|contig000003    273866    274798           -
## 7 ERS739261|SC|contig000003     15409     16341           +
## 
## ...
## <2586 more elements>

If you want to work with this data as a single DataFrame, just unlist it:

unlist(p$genes, use.names = FALSE)

pagoo also includes predefined subsets fields to list only certain pangenome category, these are queried by adding a prefix with the desired category followed by an underscore: $core_genes, $shell_genes, and $cloud_genes. These kind of subsets are better explained in the ‘4 - Subets’ tutorial, and also apply to other pangenome data described below.

Clusters metadata

Groups of orthologues (clusters) are also stored in pagoo objects as a table with a cluster identifier per row, and optional metadata associated as additional columns.

p$clusters
## DataFrame with 2588 rows and 2 columns
##        cluster                   Pfam_Arch
##       <factor>                 <character>
## 1    group0001                2-Hacid_dh_C
## 2    group0002     2-Hacid_dh_C;2-Hacid_dh
## 3    group0003 2-Hacid_dh_C;ACT;2-Hacid_dh
## 4    group0004             2Fe-2S_thioredx
## 5    group0005         4HB_MCP_1;MCPsignal
## ...        ...                         ...
## 2584 group2584                   zf-RING_7
## 2585 group2585                    zf-TFIIB
## 2586 group2586                        ZinT
## 2587 group2587                        ZnuA
## 2588 group2588                    ZT_dimer

Subsets also exists for this field: $core_clusters, $shell_clusters, and $cloud_clusters.

Sequences

Although is an optional field (it exists only if user provide this data as an argument when object is created), $sequences gives access to sequence data. Sequences are stored as a List of DNAStringSet (a.k.a DNAStringSetList, Biostrings package), grouped by cluster.

p$sequences             # List all sequences grouped by cluster
## DNAStringSetList of length 2588
## [["group0001"]] 16244_6_6__16244_6_6_00150=ATGGCGATAACAGTTTATTACGACAAAGATTGCG...
## [["group0002"]] 16244_6_6__16244_6_6_01290=ATGAAAATAGTATGCTTAGATGCCGACACGCTTG...
## [["group0003"]] 16244_6_6__16244_6_6_01710=ATGAAAACAGTTATAGTTTGCGATGCAATACATC...
## [["group0004"]] 16244_6_6__16244_6_6_01754=ATGAAATTCGAATTTACTCATGAGCAATTATCGG...
## [["group0005"]] 16244_6_6__16244_6_6_00049=ATGTCAAATTTAACTACTAACTTAACTACCAAAA...
## [["group0006"]] 16244_6_6__16244_6_6_01069=ATGAATTATTTTGAGAATTTAAAAGTTTCAACAA...
## [["group0007"]] 16244_6_6__16244_6_6_01612=ATGCGAATTAGAATTTATTATGAAGATACCGATG...
## [["group0008"]] 16244_6_6__16244_6_6_01679=ATGATGAAAGATATGGGCGAGCCACGTATAAAAA...
## [["group0009"]] 16244_6_18__16244_6_18_01216=ATGGGGCTTACTACGAGTACGACAAAGTATAT...
## [["group0010"]] 16244_6_6__16244_6_6_00758=ATGAAAAGAGTGGTTATAAAAGTAGGCTCTCACG...
## ...
## <2578 more elements>
p$sequences[["group0001"]]  # List first cluster
## DNAStringSet object of length 7:
##     width seq                                               names               
## [1]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 16244_6_6__16244_...
## [2]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 16244_6_18__16244...
## [3]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_16__17059...
## [4]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_23__17059...
## [5]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_27__17059...
## [6]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17150_1_73__17150...
## [7]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_42__17059...

Note that sequence names are created by pasting organism names and gene names, separated by a string that by default is sep = '__' (two underscores). This are the same as the gid column in the $genes field, and are initially set when pagoo object is created. If you think your dataset contain names with this separator, then you should set this parameter to other string to avoid conflicts. $sequences field also has predefined subsets: $core_sequences, $shell_sequences, and $cloud_sequences.

Organism metadata

The $organisms field contain a table with organisms and metadata as additional columns if provided.

p$organisms
## DataFrame with 7 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 4 17059_2_23         AR8      04/875      2004   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
## 7 17059_2_42         TW6        1830      2008      Taiwan       Human
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235
## 4       Fetus   ERS739242
## 5          VM   ERS739246
## 6       Blood   ERS686652
## 7       Blood   ERS739261