Monday, December 12, 2011

Uniformization-based CTMC sampling using Rcpp+RcppArmadillo

Suppose you have a continuous-time Markov chain on a finite state space and you want to draw realizations of this chain conditional on starting and ending states of the chain during some finite time interval. For a good review of available algorithms to accomplish this task see (Hobolth and Stone, 2009). Notice that supplementary materials to the the above paper contain R code, written by Asger Hobolth, that implements all algorithms described in the main text.

Here, I am going to implement a uniformization method (Fearnhead and Sherlock, 2006Lartillot, 2006; Mateiu and Rannala, 2006) using Rcpp and RcppArmadillo. In essence, I just Rcpp-ed Asger's code. I want to point out that my motivation is not necessarily to speed up Asger's R implementation. Instead, I wanted to have a standalone C++ code that I can use in my packages internally. For portability, I am going to use the inline package, so that one can copy and paste my code into an R console. 

I am fairly new to Rcpp and especially RcppArmadillo, so here are a couple of lessons that I've learned during this coding exercise:


  1. Rcpp sugar doesn't have "sample()" implemented, so I had to write my own. Hence, the awkward "sampleOnce" function below.
  2. The only way to convert between "colvec" and "rovec" in Armadillo is to use a transpose. Thanks to Jane Lange for pointing this possibility.
  3. If you want to grow a vector, using Rcpp NumericVector class is not computationally efficient. According to Rcpp developers it is better to create a std::vector instance, use .push_back() to grow it and then create NumericVector from std:vector.
So here is the C++ code to use with inline:
inc <-
'int sampleOnce(arma::colvec weights, double rUnif) {
 
  double total = sum(weights);
  double cumProb=0;
 
  int i;
 
  for (i = 0; i < weights.n_elem; i++) {
    cumProb += weights(i) / total;
    if (rUnif < cumProb) break;
  }
  return(i);
}
'
 
src <- '
  using namespace Rcpp;
 
  NumericMatrix rM(rm);
  int numStates = rM.nrow();
  arma::mat rateMatrix(rM.begin(), numStates, numStates, false); // reuses memory and avoids extra copy
 
  int startState = as<int>(bS);
  int endState = as<int>(eS);
  double elapsedTime = as<double>(eT);
  double transProb = as<double>(tP);  
 
  RNGScope scope;
 
  // Set the rate of the dominating Poisson process
 
  double poissonRate = -1.0*arma::min(rateMatrix.diag());
  arma::mat dominTransProb(numStates,numStates);
  dominTransProb.eye();
  arma::cube powerTransProb = arma::cube(numStates, numStates,1);
  powerTransProb.slice(0) = dominTransProb;
  dominTransProb = dominTransProb + rateMatrix/poissonRate;
 
 
  // simulate the number of jumps, including fictitious ones
 
  double rU = as<double>(runif(1));
  double cum = 0;  
 
  if (startState==endState){
    cum = ::Rf_dpois(0,poissonRate*elapsedTime,0)/transProb;
  }
 
  bool notExceed = true;
 
  if (cum > rU){ 
    notExceed = false; 
  }
 
  int numJumps = 0;
  double nextProb;  
 
  while(notExceed){
 
    numJumps++;
    powerTransProb.insert_slices(numJumps,1,false);
    powerTransProb.slice(numJumps) = powerTransProb.slice(numJumps-1)*dominTransProb;
    nextProb = ::Rf_dpois(numJumps,poissonRate*elapsedTime,0)*powerTransProb(startState,endState,numJumps)/transProb;
    cum += nextProb;
 
    if (cum > rU){
      notExceed = false;
    }
  }
 
  List simPath = List::create(Named("states") = NULL,
       Named("times") = NULL);
 
  // if numJumps = 0: done
  if (numJumps == 0 || ((numJumps == 1) && (startState == endState))){
 
    simPath["states"] = NumericVector::create(startState, endState);
    simPath["times"] = NumericVector::create(0, elapsedTime);
 
  }else{ // if one true jump: done
    if ((numJumps == 1) && (startState != endState)){
 
      simPath["states"] = NumericVector::create(startState,endState,endState);
      simPath["times"] = NumericVector::create(0,elapsedTime*(as<double>(runif(1))),elapsedTime);
 
    }else{ // Case (nJmp >= 2) 
 
      // Simulate jumping times
      NumericVector dominJumpTimes = elapsedTime*runif(numJumps);
      std::sort(dominJumpTimes.begin(),dominJumpTimes.end());
      NumericVector dominStates(numJumps+1);
      dominStates[0] = startState;
      dominStates[numJumps] = endState;
 
      // Simulate states of the dominating Markov chain
      for (int i = 1; i < numJumps; i++){
        dominStates[i] = sampleOnce(powerTransProb.slice(1).row(dominStates[i-1]).t()%powerTransProb.slice(numJumps-i).col(endState), as<double>(runif(1)));
      }
 
      // Remove virtual substitutions
      std::vector<int> trueStates(1);
      std::vector<double> trueTimes(1);
      trueStates[0] = startState;
      trueTimes[0] = 0.0;
 
      for (int i = 1; i < dominStates.size(); i++){
        if (dominStates[i-1] != dominStates[i]){
          trueStates.push_back(dominStates[i]);
          trueTimes.push_back(dominJumpTimes[i-1]);
        }
      }
 
      trueStates.push_back(endState);
      trueTimes.push_back(elapsedTime);
 
      simPath["states"] = NumericVector(trueStates.begin(), trueStates.end());
      simPath["times"] = NumericVector(trueTimes.begin(), trueTimes.end());
    }
  }
 
 return simPath;
'
Created by Pretty R at inside-R.org

Now, we compile it with inline:
require(inline)
require(RcppArmadillo)
 
 
unifSample = cxxfunction(signature(rm="numeric", bS="numeric", eS="numeric", eT="numeric", tP="numeric"), includes=inc,body=src, plugin="RcppArmadillo")
Created by Pretty R at inside-R.org

The arguments of unifSample are the CTMC rate matrix (rm), beginning state (bS), ending state (es), time interval length (eT), and transition probability corresponding to the beginning and ending state (tP). The last argument is not strictly required, but I am operating under the assumption that by the time I will want to call this function, I will have transition probabilities pre-computed.

I am going to use markovjumps package to create a simple nucleotide substitution model and simulate from it:

require(markovjumps)
 
my.start = 1
my.end = 2
elapsed.time = 10.0
 
 
my.stat = c(0.1,0.3,0.25,0.35)
 
## define a K80 markov chain (HKY reduces to K80 when stationary distribution is uniform)
my.model = hky.mc(c(0.1,0.3),my.stat,F)
 
## compute eigen-decomposition of the generator
my.model.eigen = as.eigen(my.model)
 
## get transition probabilities
my.prob = matexp(my.model.eigen,elapsed.time)
 
## simulate CTMC path
unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
Created by Pretty R at inside-R.org

The output looks something like this:

> unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
$states
[1] 0 3 1 1
 
$times
[1]  0.000000  4.430341  9.295083 10.000000
Created by Pretty R at inside-R.org

Notice that in R my state space is {1,2,3,4}, while in C++ it is {0,1,2,3}. To see more transitions, we can increase the time interval:

> elapsed.time = 100.0
> my.prob = matexp(my.model.eigen,elapsed.time)
> unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
$states
 [1] 0 3 1 3 0 3 1 3 0 1 3 1 2 3 1 1
 
$times
 [1]   0.000000   4.299258   5.135858   7.616836  14.274938  19.912332  27.649715  35.538612
 [9]  52.687339  53.404439  83.587054  86.185838  88.686555  91.954041  98.642818 100.000000
Created by Pretty R at inside-R.org

To test the simulator, I am going to use analytic formulae for CTMC path statistics, implemented in markovjumps. Namely, I am going to look at the total number of transitions and for total time spent in state 1 (2 in the R state space):
> sim.size = 10000
> num.jumps = array(dim=c(4,4,sim.size))
> dwell.in.one = array(dim=c(4,4,sim.size))
> 
> set.seed(1111)
> for (i in 1:sim.size){
+   for (my.start in 1:4){
+     for (my.end in 1:4){
+       temp1 = unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
+   
+       num.jumps[my.start,my.end,i] = length(temp1$states)-2
+       dwell.in.one[my.start,my.end,i] = sum(diff(temp1$times)[temp1$states[1:(length(temp1$states)-1)]==1])
+     }
+   }
+ }
> 
> apply(num.jumps,MARGIN=c(1,2),FUN=mean)
        [,1]    [,2]    [,3]    [,4]
[1,] 17.0541 16.9925 16.8286 16.8386
[2,] 16.9071 16.8813 16.7615 16.7366
[3,] 16.9179 16.7869 16.6631 16.6248
[4,] 16.8299 16.7299 16.5572 16.5783
> cond.mean.markov.jumps(my.model.eigen, matrix(1,4,4)-diag(rep(1,4)),elapsed.time)
         [,1]     [,2]     [,3]     [,4]
[1,] 17.06970 16.97879 16.88114 16.82559
[2,] 16.97879 16.88788 16.79024 16.73468
[3,] 16.88114 16.79024 16.69259 16.63704
[4,] 16.82559 16.73468 16.63704 16.58148
> 
> apply(dwell.in.one, MARGIN=c(1,2),FUN=mean)
         [,1]     [,2]     [,3]     [,4]
[1,] 26.20608 30.72730 27.08764 27.11346
[2,] 30.64288 35.35969 31.76749 31.54262
[3,] 27.00934 31.44582 28.03639 27.87922
[4,] 27.16154 31.64142 28.04596 28.07696
> cond.mean.markov.rewards(my.model.eigen, c(0,1,0,0), elapsed.time)
         [,1]     [,2]     [,3]     [,4]
[1,] 26.18182 30.72727 27.09091 27.09091
[2,] 30.72727 35.27273 31.63636 31.63636
[3,] 27.09091 31.63636 28.00000 28.00000
[4,] 27.09091 31.63636 28.00000 28.00000
> 
Created by Pretty R at inside-R.org

Seems to work. I noticed that I can break the program by jacking up the time interval length. I think the problem is in that Poisson density returns non-numbers in some cases and confuses Rcpp. I am not sure if this is worth fixing though, because in practice such parameter configuration shouldn't occur...

No comments:

Post a Comment