Stochastic Models, Statistics, and Computation: December 2011

Friday, December 30, 2011

Phylogenetic likelihood with Rcpp+RcppArmadillo

In this post, I want to show how to implement Felsenstein's algorithm for calculating phylogenetic likelihood with the help of Rcpp and RcppArmadillo. I am going to use ape's representation of a phylogenetic tree. If you are not familiar with this data structure, look at this document. My implementation is very similar to the likelihood calculator of ape's ace function.

Problem formulation: given a rooted binary phylogenetic tree with branch length and a continuous-time Markov chain model of discrete character evolution (e.g. DNA, amino acids, morphological traits), compute the probability of observing certain characters at the tips of the tree. I am going to stop short of actually computing this probability. Instead, I will compute what's called partial likelihoods for each node in the tree. The actual likelihood is a weighted sum of the partial likelihoods at the root of the phylogeny.

C++ code with inline compilation:

require(inline)
require(RcppArmadillo)
 
src <- '
  /* Arguments: 
        tE: edge matrix of the ape object phylo
        nIN: number of internal nodes (not necessary, but convinient to get it from phylo)
        tS: integer vector of tip states (-1=missing value)
        nS: state space size (e.g. 2 for a binary trait)
        pM: array of probability matrices for each edge of the tree
 
     Two important assumptions:
        1. edges in the edge matrix and probability matrices are in the "pruningwise" order;

        see ?reorder.phylo for more details

        2. tip state vector is ordered according to the tip numbering in the edge matrix
  */
 
  using namespace Rcpp;
 
  IntegerMatrix treeEdges(tE);
 
  // get the number of edges
  int numEdges = treeEdges.nrow();
  int numIntNodes = as<int>(nIN);
  IntegerVector tipStates(tS);
  int numStates = as<int>(nS);
  NumericVector vecProbMat(pM);
  arma::cube cubeProbMat(vecProbMat.begin(), numStates, numStates, numEdges, false);
 
  // get the number of tips in the tree
  int numTips = tipStates.size();
 
  // prepare a matrix for storing regular (backward) partial likelihoods
  arma::mat partialLike = arma::zeros<arma::mat>(numTips + numIntNodes, numStates);
 
  for (int i=0; i < numTips; i++){
    if (tipStates[i] == -1){// -1 denotes a missing value
      partialLike.row(i) = arma::ones<arma::rowvec>(numStates);
    }else{
      partialLike(i, tipStates[i]) = 1.0;
    }
  }
 
  // compute regular partial likelihoods for all internal nodes
  for (int i=0; i < numEdges; i+=2){      
    // parent=treeEdges(i,0) or treeEdges(i+1,0); treeEdges indices should be shifted by one
    partialLike.row(treeEdges(i,0)-1) = (partialLike.row(treeEdges(i,1)-1)*cubeProbMat.slice(i).t())%(partialLike.row(treeEdges(i+1,1)-1)*cubeProbMat.slice(i+1).t());            
  }
 
  return wrap(partialLike);
'  
 
 
partLike = cxxfunction(signature(tE="integer", nIN="integer", tS="integer", nS="integer", pM="numeric"),body=src, plugin="RcppArmadillo")

Created by Pretty R at inside-R.org

I am going to test the code with a simple two state model, for which transition probabilities are available in closed form:

two.state.trans.prob = function(forward.rate, backward.rate, elapsed.time){
  total.rate = forward.rate + backward.rate               
 
  return((matrix(c(rep(backward.rate,2),rep(forward.rate,2)),2,2) +
    matrix(c(forward.rate, -backward.rate, -forward.rate, backward.rate),2,2)*
    exp(-total.rate*elapsed.time))/total.rate)    
}

Created by Pretty R at inside-R.org

R wrapper function that takes care of the prerequisites:

two.state.part.like = function(my.tree, my.data, forward.rate, backward.rate){
 
  ## reorder the edges in the "pruningwise" order
  my.tree = reorder(my.tree, order = "pr")
 
  if (!("phylo" %in% class(my.tree)))
    stop("Error: object \"my.tree\" is not of class \"phylo\"")
 
  if (is.null(my.tree$edge.length))
    stop("Error: tree \" my.tree\" must have branch lengths.")
 
  ## reorder data on tips to match the order of the my.tree phylo object
  if (!is.null(names(my.data))) {
    if(!any(is.na(match(names(my.data), my.tree$tip.label)))){
      my.data = my.data[my.tree$tip.label]
    }else{
      warning('the names of argument "my.data" and the names of the tip labels
did not match: the former were ignored in the analysis.')
    }
  }
 
  ## prepare transition probability matrices (this of course can and should be done in C++ as well)
  prob.array = array(0, dim=c(2,2,length(my.tree$edge.length)))            
  for (i in 1:length(my.tree$edge.length)){
    prob.array[,,i] = two.state.trans.prob(forward.rate, backward.rate, my.tree$edge.length[i])          
  }            
 
  return(partLike(my.tree$edge, my.tree$Nnode, my.data, 2, prob.array))
}

Created by Pretty R at inside-R.org

Testing:

First, let's simulate a tree with tip states. This can be done by several packages in R. I am going to use diversitree package:

require(diversitree)
 
set.seed(34344)
test.tree = tree.bisse(c(0.1, 0.1, 0.03, 0.03, 0.01, 0.07), x0=0, max.taxa = 50)

Created by Pretty R at inside-R.org

This produces the following tree:

Now, my function returns partial likelihoods for all nodes in the tree. I am going to extract only partial likelihoods corresponding to the root (recall that a weighted sum of these numbers gives the actual probability of observing the data). I am going to do the same using diversitree's likelhood calculator:

> test.like = two.state.part.like(test.tree, test.tree$tip.state, 0.02, 0.05)
> test.like[51,]
[1] 1.564918e-08 1.195940e-08
> 
> mk2.lik = make.mk2(test.tree, test.tree$tip.state)
> exp(mk2.lik(c(0.02,0.05),root=ROOT.BOTH))
[1] 1.564918e-08 1.195940e-08

Created by Pretty R at inside-R.org

The above agreement is encouraging.

Saturday, December 24, 2011

R array to RcppArmadillo cube (updated)

A quick post here on how to convert a 3D array from R to the RcppArmadillo cube.

Rcpp does not support arrays yet (only vectors and matrices). However, RcppArmadillo has a proper 3D array class: arma::cube. The conversion can be done via Rcpp::NumericVector. The only aspect of this exercise that surprised me was the fact that Rcpp::NumericVector stores dimensions of an array, but does not allow one to extract this info. Hence, if you don't know in advance the dimensions of your array, the only solution I see is to pass the vector of dimensions to Rcpp together with the array. Not too ugly, but a bit awkward. Thanks to the commenter Eli below, who pointed out that dimensions of the array can be extracted via the .attr() method -- much prettier now. My code to illustrate this concept:

require(inline)
require(RcppArmadillo)
 
src <- '
  using namespace Rcpp;
 
  NumericVector vecArray(myArray);
  IntegerVector arrayDims = vecArray.attr("dim");

 
  arma::cube cubeArray(vecArray.begin(), arrayDims[0], arrayDims[1], arrayDims[2], false);
 
  //change one element in the array/cube
  cubeArray(0,0,0) = 518;  
 
  return(wrap(cubeArray));  
'
 
readCube = cxxfunction(signature(myArray="numeric"),body=src, plugin="RcppArmadillo")

Created by Pretty R at inside-R.org

A numerical test:

> set.seed(345)
> testArray = array(rnorm(18), dim=c(3,3,2))
> print(testArray)
, , 1
 
           [,1]        [,2]      [,3]
[1,] -0.7849082 -0.29059656 -0.927724
[2,] -0.2795144 -0.06753159  1.710771
[3,] -0.1614579 -0.63352041  1.654769
 
, , 2
 
          [,1]       [,2]       [,3]
[1,]  1.810483 -0.8496292 -1.4029422
[2,]  1.866772  0.3184496  0.5682982
[3,] -1.399833  0.9035913  1.0457561
 
> readCube(testArray)
, , 1
 
            [,1]        [,2]      [,3]
[1,] 518.0000000 -0.29059656 -0.927724
[2,]  -0.2795144 -0.06753159  1.710771
[3,]  -0.1614579 -0.63352041  1.654769
 
, , 2
 
          [,1]       [,2]       [,3]
[1,]  1.810483 -0.8496292 -1.4029422
[2,]  1.866772  0.3184496  0.5682982
[3,] -1.399833  0.9035913  1.0457561

Created by Pretty R at inside-R.org

Monday, December 12, 2011

Uniformization-based CTMC sampling using Rcpp+RcppArmadillo

Suppose you have a continuous-time Markov chain on a finite state space and you want to draw realizations of this chain conditional on starting and ending states of the chain during some finite time interval. For a good review of available algorithms to accomplish this task see (Hobolth and Stone, 2009). Notice that supplementary materials to the the above paper contain R code, written by Asger Hobolth, that implements all algorithms described in the main text.

Here, I am going to implement a uniformization method (Fearnhead and Sherlock, 2006; Lartillot, 2006; Mateiu and Rannala, 2006) using Rcpp and RcppArmadillo. In essence, I just Rcpp-ed Asger's code. I want to point out that my motivation is not necessarily to speed up Asger's R implementation. Instead, I wanted to have a standalone C++ code that I can use in my packages internally. For portability, I am going to use the inline package, so that one can copy and paste my code into an R console.

I am fairly new to Rcpp and especially RcppArmadillo, so here are a couple of lessons that I've learned during this coding exercise:

Rcpp sugar doesn't have "sample()" implemented, so I had to write my own. Hence, the awkward "sampleOnce" function below.
The only way to convert between "colvec" and "rovec" in Armadillo is to use a transpose. Thanks to Jane Lange for pointing this possibility.
If you want to grow a vector, using Rcpp NumericVector class is not computationally efficient. According to Rcpp developers it is better to create a std::vector instance, use .push_back() to grow it and then create NumericVector from std:vector.

So here is the C++ code to use with inline:

inc <-
'int sampleOnce(arma::colvec weights, double rUnif) {
 
  double total = sum(weights);
  double cumProb=0;
 
  int i;
 
  for (i = 0; i < weights.n_elem; i++) {
    cumProb += weights(i) / total;
    if (rUnif < cumProb) break;
  }
  return(i);
}
'
 
src <- '
  using namespace Rcpp;
 
  NumericMatrix rM(rm);
  int numStates = rM.nrow();
  arma::mat rateMatrix(rM.begin(), numStates, numStates, false); // reuses memory and avoids extra copy
 
  int startState = as<int>(bS);
  int endState = as<int>(eS);
  double elapsedTime = as<double>(eT);
  double transProb = as<double>(tP);  
 
  RNGScope scope;
 
  // Set the rate of the dominating Poisson process
 
  double poissonRate = -1.0*arma::min(rateMatrix.diag());
  arma::mat dominTransProb(numStates,numStates);
  dominTransProb.eye();
  arma::cube powerTransProb = arma::cube(numStates, numStates,1);
  powerTransProb.slice(0) = dominTransProb;
  dominTransProb = dominTransProb + rateMatrix/poissonRate;
 
 
  // simulate the number of jumps, including fictitious ones
 
  double rU = as<double>(runif(1));
  double cum = 0;  
 
  if (startState==endState){
    cum = ::Rf_dpois(0,poissonRate*elapsedTime,0)/transProb;
  }
 
  bool notExceed = true;
 
  if (cum > rU){ 
    notExceed = false; 
  }
 
  int numJumps = 0;
  double nextProb;  
 
  while(notExceed){
 
    numJumps++;
    powerTransProb.insert_slices(numJumps,1,false);
    powerTransProb.slice(numJumps) = powerTransProb.slice(numJumps-1)*dominTransProb;
    nextProb = ::Rf_dpois(numJumps,poissonRate*elapsedTime,0)*powerTransProb(startState,endState,numJumps)/transProb;
    cum += nextProb;
 
    if (cum > rU){
      notExceed = false;
    }
  }
 
  List simPath = List::create(Named("states") = NULL,
       Named("times") = NULL);
 
  // if numJumps = 0: done
  if (numJumps == 0 || ((numJumps == 1) && (startState == endState))){
 
    simPath["states"] = NumericVector::create(startState, endState);
    simPath["times"] = NumericVector::create(0, elapsedTime);
 
  }else{ // if one true jump: done
    if ((numJumps == 1) && (startState != endState)){
 
      simPath["states"] = NumericVector::create(startState,endState,endState);
      simPath["times"] = NumericVector::create(0,elapsedTime*(as<double>(runif(1))),elapsedTime);
 
    }else{ // Case (nJmp >= 2) 
 
      // Simulate jumping times
      NumericVector dominJumpTimes = elapsedTime*runif(numJumps);
      std::sort(dominJumpTimes.begin(),dominJumpTimes.end());
      NumericVector dominStates(numJumps+1);
      dominStates[0] = startState;
      dominStates[numJumps] = endState;
 
      // Simulate states of the dominating Markov chain
      for (int i = 1; i < numJumps; i++){
        dominStates[i] = sampleOnce(powerTransProb.slice(1).row(dominStates[i-1]).t()%powerTransProb.slice(numJumps-i).col(endState), as<double>(runif(1)));
      }
 
      // Remove virtual substitutions
      std::vector<int> trueStates(1);
      std::vector<double> trueTimes(1);
      trueStates[0] = startState;
      trueTimes[0] = 0.0;
 
      for (int i = 1; i < dominStates.size(); i++){
        if (dominStates[i-1] != dominStates[i]){
          trueStates.push_back(dominStates[i]);
          trueTimes.push_back(dominJumpTimes[i-1]);
        }
      }
 
      trueStates.push_back(endState);
      trueTimes.push_back(elapsedTime);
 
      simPath["states"] = NumericVector(trueStates.begin(), trueStates.end());
      simPath["times"] = NumericVector(trueTimes.begin(), trueTimes.end());
    }
  }
 
 return simPath;
'

Created by Pretty R at inside-R.org

Now, we compile it with inline:

require(inline)
require(RcppArmadillo)
 
 
unifSample = cxxfunction(signature(rm="numeric", bS="numeric", eS="numeric", eT="numeric", tP="numeric"), includes=inc,body=src, plugin="RcppArmadillo")

Created by Pretty R at inside-R.org

The arguments of unifSample are the CTMC rate matrix (rm), beginning state (bS), ending state (es), time interval length (eT), and transition probability corresponding to the beginning and ending state (tP). The last argument is not strictly required, but I am operating under the assumption that by the time I will want to call this function, I will have transition probabilities pre-computed.

I am going to use markovjumps package to create a simple nucleotide substitution model and simulate from it:

require(markovjumps)
 
my.start = 1
my.end = 2
elapsed.time = 10.0
 
 
my.stat = c(0.1,0.3,0.25,0.35)
 
## define a K80 markov chain (HKY reduces to K80 when stationary distribution is uniform)
my.model = hky.mc(c(0.1,0.3),my.stat,F)
 
## compute eigen-decomposition of the generator
my.model.eigen = as.eigen(my.model)
 
## get transition probabilities
my.prob = matexp(my.model.eigen,elapsed.time)
 
## simulate CTMC path
unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])

Created by Pretty R at inside-R.org

The output looks something like this:

> unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
$states
[1] 0 3 1 1
 
$times
[1]  0.000000  4.430341  9.295083 10.000000

Created by Pretty R at inside-R.org

Notice that in R my state space is {1,2,3,4}, while in C++ it is {0,1,2,3}. To see more transitions, we can increase the time interval:

> elapsed.time = 100.0
> my.prob = matexp(my.model.eigen,elapsed.time)
> unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
$states
 [1] 0 3 1 3 0 3 1 3 0 1 3 1 2 3 1 1
 
$times
 [1]   0.000000   4.299258   5.135858   7.616836  14.274938  19.912332  27.649715  35.538612
 [9]  52.687339  53.404439  83.587054  86.185838  88.686555  91.954041  98.642818 100.000000

Created by Pretty R at inside-R.org

To test the simulator, I am going to use analytic formulae for CTMC path statistics, implemented in markovjumps. Namely, I am going to look at the total number of transitions and for total time spent in state 1 (2 in the R state space):

> sim.size = 10000
> num.jumps = array(dim=c(4,4,sim.size))
> dwell.in.one = array(dim=c(4,4,sim.size))
> 
> set.seed(1111)
> for (i in 1:sim.size){
+   for (my.start in 1:4){
+     for (my.end in 1:4){
+       temp1 = unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
+   
+       num.jumps[my.start,my.end,i] = length(temp1$states)-2
+       dwell.in.one[my.start,my.end,i] = sum(diff(temp1$times)[temp1$states[1:(length(temp1$states)-1)]==1])
+     }
+   }
+ }
> 
> apply(num.jumps,MARGIN=c(1,2),FUN=mean)
        [,1]    [,2]    [,3]    [,4]
[1,] 17.0541 16.9925 16.8286 16.8386
[2,] 16.9071 16.8813 16.7615 16.7366
[3,] 16.9179 16.7869 16.6631 16.6248
[4,] 16.8299 16.7299 16.5572 16.5783
> cond.mean.markov.jumps(my.model.eigen, matrix(1,4,4)-diag(rep(1,4)),elapsed.time)
         [,1]     [,2]     [,3]     [,4]
[1,] 17.06970 16.97879 16.88114 16.82559
[2,] 16.97879 16.88788 16.79024 16.73468
[3,] 16.88114 16.79024 16.69259 16.63704
[4,] 16.82559 16.73468 16.63704 16.58148
> 
> apply(dwell.in.one, MARGIN=c(1,2),FUN=mean)
         [,1]     [,2]     [,3]     [,4]
[1,] 26.20608 30.72730 27.08764 27.11346
[2,] 30.64288 35.35969 31.76749 31.54262
[3,] 27.00934 31.44582 28.03639 27.87922
[4,] 27.16154 31.64142 28.04596 28.07696
> cond.mean.markov.rewards(my.model.eigen, c(0,1,0,0), elapsed.time)
         [,1]     [,2]     [,3]     [,4]
[1,] 26.18182 30.72727 27.09091 27.09091
[2,] 30.72727 35.27273 31.63636 31.63636
[3,] 27.09091 31.63636 28.00000 28.00000
[4,] 27.09091 31.63636 28.00000 28.00000
>

Created by Pretty R at inside-R.org

Seems to work. I noticed that I can break the program by jacking up the time interval length. I think the problem is in that Poisson density returns non-numbers in some cases and confuses Rcpp. I am not sure if this is worth fixing though, because in practice such parameter configuration shouldn't occur...