# Bayesian Testing of Equal Genotype Proportions between Multiple Populations

#### 2017-01-15

This tutorial shows how to use the MADPop package to test for genetic differences between two populations, of which the individuals of a same species contain a variable number of alleles.

## rstan (Version 2.18.1, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## Pre-Processing

Our data consists of $$N = 215$$ recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from $$K = 11$$ lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene.

Our dataset fish215 is included with MADPop. A random sample from it looks like this:

##        Lake  A1  A2  A3  A4
## 114   Slate             r.4
## 51  Crystal r.8     r.4 r.3
## 11    Hogan r.1 r.4
## 18    Hogan     r.4
## 27   Simcoe     r.1 r.5
## 156 Manitou r.4 r.5     v.8

The first column is the lake name (or population ID) for each sample, the remaining four columns are for potentially recorded allele codes (A1-A4). Here the code to identify a unique allele is a small letter followed by a number, but it could have been the sequence of integers $$1, 2,\ldots, A$$, which for the fish215 data is $$A = 57$$ unique alleles.

It is relatively straightforward to import a CSV file into the format above. An example of this is given along with our raw data in the extdata directory of the local copy of the MADPop package.

## Two-Population Comparisons

Suppose that we wish to compare two lakes, say Dickey and Simcoe. The allele counts in these lakes are in the table below. It is a subset of the full contingency table on all $$K = 11$$ lakes, which is produced by the MADPop function UM.suff:

popId <- c("Dickey", "Simcoe")          # lakes to compare
Xsuff <- UM.suff(fish215)             # summary statistics for dataset
ctab <- Xsuff$tab[popId,] # contingency table ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts #ctab rbind(ctab, Total = colSums(ctab)) ## 1.20 1.4.5 1.4.9.45 1.5 10 20 20.54 27.28.29 3 4 4.10 4.11 4.20.27 ## Dickey 1 1 0 2 0 2 1 1 0 0 1 0 1 ## Simcoe 0 0 1 1 2 0 0 0 1 1 2 1 0 ## Total 1 1 1 3 2 2 1 1 1 1 3 1 1 ## 4.20.31 4.20.45 4.27 4.33 4.5 4.53.55 4.56 4.7 4.7.10 5 5.20 5.30 ## Dickey 1 1 1 0 1 1 1 0 0 0 1 1 ## Simcoe 0 0 0 1 1 0 0 3 1 1 0 0 ## Total 1 1 1 1 2 1 1 3 1 1 1 1 ## 5.32 5.7 7 9.11 ## Dickey 1 1 0 0 ## Simcoe 0 1 2 1 ## Total 1 2 2 1 The unique allele identifiers are encoded as integers between $$1$$ and $$A$$ and separated by dots. The original allele names are stored in Xsuff$A, such that the genotype of the first column 1.20 is

gtype <- colnames(ctab)[1]
gtype <- as.numeric(strsplit(gtype, "[.]")[[1]])
gtype
## [1]  1 20
names(gtype) <- paste0("A", gtype)
sapply(gtype, function(ii) Xsuff\$A[ii])
##    A1   A20
## "r.1" "u.4"

There are $$C = 29$$ genotype combinations observed in these two lakes, corresponding to each column of the table.

### Multinomial Model

In the two-population problem we have $$K = 2$$ lakes with $$N_1$$ and $$N_2$$ fish sampled from each. Let $$\boldsymbol Y_k = (Y_{k1}, \ldots Y_{kC})$$ denote the counts for each genotype observed in lake $$k$$, such that $$\sum_{i=1}^C Y_{ki} = N_k$$. The sampling model for these data is $\boldsymbol Y_k \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,2,$ where $$\boldsymbol \rho_k = (\rho_{k1},...,\rho_{kC})$$ are the population proportions of each genotype, and $$\sum_{i=1}^C \rho_{ki} = 1$$.

### Hypothesis Testing

Our objective is to test $\begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \\ & \phantom{: } \iff \boldsymbol \rho_1 = \boldsymbol \rho_2. \end{split}$ The classical test statistics for assessing $$H_0$$ are Pearson’s Chi-Square statistic $$\mathcal X$$ and the Likelihood Ratio statistic $$\Lambda$$, $\mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}.$ Under $$H_0$$, the asymptotic distribution of either of these test statistics $$T = \mathcal X$$ or $$\Lambda$$ is $$\chi^2_{(C-1)}$$, such that the $$p$$-value $\textrm{p}_\textrm{v}= \textrm{Pr}(T > T_\textrm{obs} \mid H_0)$ for an observed value of $$T_\textrm{obs}$$ can be estimated as follows:

##  chi2   LRT
## 0.330 0.041

The Chi-Square and LR tests are asymptotically equivalent and so should give roughly the same $$p$$-values. The huge discrepancy observed here indicates that the sample sizes are too small for asymptotics to kick in. A more reliable $$p$$-value estimate can be obtained by the Bootstrap method, which in this case consists of simulating $$M$$ contigency tables with $$\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \hat{\boldsymbol \rho})$$, $$m = 1, \ldots, M$$, where $$\hat{\boldsymbol \rho}$$ is the estimate of the common probability vector $$\boldsymbol \rho_1 = \boldsymbol \rho_2$$ under $$H_0$$. For each contingency table $$(\boldsymbol Y_1^{(m)}, \boldsymbol Y_2^{(m)})$$, we calculate the test statistic $$T^{(m)}$$, and the bootstrapped p-value is defined as $\textrm{p}_\textrm{v}^{\textrm{boot}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_\textrm{obs}).$ $$\textrm{p}_\textrm{v}^{\textrm{boot}}$$ is calculated with MADPop as follows:

##    user  system elapsed
##   0.728   0.034   0.761
##   chi2    LRT
## 0.0062 0.0058

Note that the bootstrap $$p$$-values for both tests are roughly the same and decisively reject $$H_0$$, whereas the less reliable asymptotic $$p$$-values both failed to reject (at quite different significance levels).

## Pairwise Comparisons between Multiple Populations

Bootstrapping overcomes many deficiencies of the asymptotic $$p$$-value calculation. However, bootstrapping has a tendency to reject $$H_0$$ when sample sizes are small. To see why this is, consider all columns of ctab which have only one genotype count between the two lakes:

itab1 <- colSums(ctab) == 1             # single count genotypes
cbind(ctab[,itab1],
Other = rowSums(ctab[,!itab1]),
Total = rowSums(ctab))
##        1.20 1.4.5 1.4.9.45 20.54 27.28.29 3 4 4.11 4.20.27 4.20.31 4.20.45
## Dickey    1     1        0     1        1 0 0    0       1       1       1
## Simcoe    0     0        1     0        0 1 1    1       0       0       0
##        4.27 4.33 4.53.55 4.56 4.7.10 5 5.20 5.30 5.32 9.11 Other Total
## Dickey    1    0       1    1      0 0    1    1    1    0     7    20
## Simcoe    0    1       0    0      1 1    0    0    0    1    12    20

There are $$c_1 = 21$$ such columns, accounting for $$\hat p_1 = 0.525$$ of the common genotype distribution under $$H_0$$, as estimated from the two-lake sample. For each of these columns, observing counts in one lake but not the other provides evidence against $$H_0$$. Moreover, under the estimated common distribution $$\hat {\boldsymbol \rho}$$, it is very unlikely to have counts in only one of the lakes for each of these $$c_1 = 21$$ genotypes. Therefore, the data appear to provide very strong evidence against $$H_0$$. However, it is not so unlikely to have $$c_1 = 21$$ one-count genotypes if the true number of unique genotypes in these two lakes is much larger than the observed value of $$C = 29$$. With $$C = 29$$ unique genotypes in only $$N = N_1 + N_2 = 40$$ fish samples, it is quite plausible that a new sample of fish would yield several genotypes which are not present in the original two-lake sample ctab.

One way to obtain information about the unobserved genotypes in lakes Dickey and Simcoe is to consider the genotypes in all $$K = 11$$ Ontario lakes for which we have collected data. A natural way to do this is through a hierarchical model: $\begin{split} \boldsymbol Y_k \mid \boldsymbol \rho_k & \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,\ldots,K \\ \boldsymbol \rho_k & \stackrel{\textrm{iid}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha), \qquad \boldsymbol \alpha= (\alpha_1, \ldots, \alpha_C). \end{split}$ That is, each population is allowed to have its own probability vector $$\boldsymbol \rho_k$$. However, these probability vectors are drawn from a common Dirichlet distribution. The common distribution specifies that before any data is drawn, we have $E[\boldsymbol \rho] = \bar{\boldsymbol \alpha}, \qquad \textrm{var}(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0 + 1},$ where $$\alpha_0 = \sum_{i=1}^C \alpha_i$$ and $$\bar{\boldsymbol \alpha} = \boldsymbol \alpha/ \alpha_0$$. Moreover, the posterior distribution of $$\boldsymbol \rho_k$$ given $$\boldsymbol \alpha$$ and the data is also Dirichlet: $\boldsymbol \rho_k \mid \boldsymbol Y\stackrel{\textrm{ind}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha+ \boldsymbol Y_k).$ In this sense, the posterior estimate of the proportion of genotype $$i$$ in population $$k$$, $$\rho_{ki}$$, can be non-zero even if $$Y*{ki} = 0$$. In practice, $$\boldsymbol \alpha$$ is estimated from $$\boldsymbol Y$$, the data from all $$K$$ populations (details in the following section). The extent to which the genotypes observed in one lake affect inference in the other lakes is determined by $$\alpha*0$$ (the larger this value, the larger the effect).

### Parameter Estimation

The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, $\begin{split} \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) = \prod_{k=1}^K p(\boldsymbol Y_k \mid \boldsymbol \alpha) & = \prod_{k=1}^K \int p(\boldsymbol Y_k \mid \boldsymbol \rho_k) p(\boldsymbol \rho_k \mid \boldsymbol \alpha) \,\mathrm{d}\boldsymbol \rho_k \\ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split}$ from which $$\boldsymbol \alpha$$ can be estimated by maximum likelihood (Minka 2000). Alternatively we estimate $$\boldsymbol \alpha$$ using a Bayesian approach, which is more readily extensible to more complex genotype models, for which $$\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y)$$ has no closed form (see Future Work).

First, we specify a prior distribution $\pi(\alpha_0, \bar{\boldsymbol \alpha}) = \frac{1}{(1+ \alpha_0)^2},$ which is a uniform distribution on the prior probability vector $$\bar{\boldsymbol \alpha}$$, and a uniform distribution on the variance-like parameter $$A = (1+\alpha_0)^{-1}$$, in the sense that the prior mean and variance of a given genotype population probability are $E[\rho_{kj}] = \tilde \alpha_j, \qquad \textrm{var}(\rho_{kj}) = A \cdot \tilde \alpha_j(1-\tilde \alpha_j).$ Then, we sample from the posterior distribution $$p(\boldsymbol \alpha\mid \boldsymbol Y) \propto \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) \pi(\boldsymbol \alpha)$$ using a Markov chain Monte Carlo (MCMC) algorithm provided by the stan programming language (Stan Development Team 2016), as detailed below.

## Bayesian Hypothesis Testing

Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say $$k = 1,2$$, is $$H_0: \boldsymbol \rho_1 = \boldsymbol \rho_2 = \boldsymbol \rho_{12}$$. Our Bayesian hypothesis-testing strategy is implemented in two steps:

1. Sample from the posterior distribution $$p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)$$.
2. Sample from the posterior distribution of either classical test statistic, $$T = \mathcal X$$ or $$\Lambda$$, thus obtaining a posterior p-value $\textrm{p}_\textrm{v}^{\textrm{post}}= \textrm{Pr}(T > T_\textrm{obs} \mid \boldsymbol Y, H_0).$

### MCMC Sampling from the Posterior Distribution

In order to estimate $$\boldsymbol \alpha$$ under $$H_0$$, we apply the Dirichlet-Multinomial distribution to $$K-1$$ lakes, with the first two being collapsed into a common count vector $$\boldsymbol Y_{12} = \boldsymbol Y_1 + \boldsymbol Y_2$$ with sample size $$N_{12} = N_1 + N_2$$. MCMC sampling is implemented via a Hybrid Monte Carlo (HMC) variant provided by the R package rstan. In MADPop, sampling from $$p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)$$ is accomplished with the function hUM.post, where “hUM” stands for hierarchical Unconstrained Multinomial model. We start by merging the two samples from equal distributions under $$H_0$$:

## [1] "Dickey" "Simcoe"
##       Crystal Dickey.Simcoe         Hogan     Kingscote     Macdonald
##            20            40            21            18            19
##       Manitou  Michipicoten       Opeongo        Seneca         Slate
##            20            20            20            18            19

Next, we sample from $$p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)$$ using rstan:

##
## SAMPLING FOR MODEL 'DirichletMultinomial' NOW (CHAIN 1).
## Chain 1: Gradient evaluation took 0.000261 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.61 seconds.
## Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1001 / 10000 [ 10%]  (Sampling)
## Chain 1: Iteration: 2000 / 10000 [ 20%]  (Sampling)
## Chain 1: Iteration: 3000 / 10000 [ 30%]  (Sampling)
## Chain 1: Iteration: 4000 / 10000 [ 40%]  (Sampling)
## Chain 1: Iteration: 5000 / 10000 [ 50%]  (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 1:  Elapsed Time: 4.01173 seconds (Warm-up)
## Chain 1:                28.5347 seconds (Sampling)
## Chain 1:                32.5464 seconds (Total)

rstan typically throws a number of exceptions which are usually harmless. The MADPop package lists rstan as a dependency, such that its sophisticated tuning mechanism is exposed. Optionally, hUM.post returns the entire rstan output (full.stan.output = TRUE), which can be run through rstan’s MCMC convergence diagnostics.

By setting the rhoId argument of hUM.post, the fit0 object contains samples from $$p(\boldsymbol \alpha, \boldsymbol \rho_{12} \mid \boldsymbol Y, H_0)$$. Boxplots of the 50 highest posterior probability genotypes in the common distribution $$\boldsymbol \rho_{12}$$ are displayed below:

### Calculating the Posterior P-Value

This is calculated analogously to the bootstrapped p-value, by generating $$M$$ contingency tables with $$\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_{12}^{(m)}$$, $$m = 1, \ldots, M$$. However, for each contingency table, the common probability vector $$\boldsymbol \rho_{12}^{(m)}$$ is different, corresponding to a random draw from $$p(\boldsymbol \rho_{12} \mid \boldsymbol Y, H_0)$$. The test statistic $$T^{(m)}$$ is then calculated and we have $\textrm{p}_\textrm{v}^{\textrm{post}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_{\textrm{obs}}).$ $$\textrm{p}_\textrm{v}^{\textrm{post}}$$ is calculated with MADPop as follows:

##    user  system elapsed
##   1.352   0.081   1.436

We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic ($$\mathcal X$$ and $$\Lambda$$):

p-value ($$\times100\%$$)
Asymptotic Bootstrap Posterior
$$\mathcal X$$ 33.0 0.62 14
$$\Lambda$$ 4.1 0.58 13

We see that $$\textrm{p}_\textrm{v}^{\textrm{post}}$$ is much more conservative than $$\textrm{p}_\textrm{v}^{\textrm{boot}}$$.

## Future Work

Here we used a completely unconstrained Multinomial model for the genotype counts, in the sense that the only restriction on $$\boldsymbol \rho_k$$ is that $$\rho_{ki} \ge 0$$ and $$\sum_{i=1}^C \rho_{ki} = 1$$. However, it is possible to impose genetic constraints such as Hardy-Weinberg equilibrium, preferential mating, etc., which effectively reduce the degrees of freedom of $$\boldsymbol \rho_k$$ below $$C-1$$. In this case, the closed-form likelihood $$\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y)$$ is typically unavailable, but the stan code can easily be modified to sample from $$p(\boldsymbol \alpha, \boldsymbol{\mathcal R}\mid \boldsymbol Y)$$, where $$\boldsymbol{\mathcal R}= (\boldsymbol \rho_1, \ldots, \boldsymbol \rho_K)$$. We hope to feature some of these extensions to the basic Dirichlet-Multinomial model in the next version of MADPop.

## References

Minka, T.P. 2000. “Estimating a Dirichlet Distribution.” Technical Report. Massachusetts Institute of Technology. https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf.

Stan Development Team. 2016. “Rstan: The R Interface to Stan.” 2016. http://mc-stan.org.