Suppose you have a continuous-time Markov chain on a finite state space and you want to draw realizations of this chain conditional on starting and ending states of the chain during some finite time interval. For a good review of available algorithms to accomplish this task see (
Hobolth and Stone, 2009). Notice that supplementary materials to the the above paper contain R code, written by Asger Hobolth, that implements all algorithms described in the main text.
Here, I am going to implement a uniformization method (
Fearnhead and Sherlock, 2006;
Lartillot, 2006;
Mateiu and Rannala, 2006) using
Rcpp and
RcppArmadillo. In essence, I just Rcpp-ed Asger's code. I want to point out that my motivation is not necessarily to speed up Asger's R implementation. Instead, I wanted to have a standalone C++ code that I can use in my packages internally. For portability, I am going to use the
inline package, so that one can copy and paste my code into an R console.
I am fairly new to Rcpp and especially RcppArmadillo, so here are a couple of lessons that I've learned during this coding exercise:
- Rcpp sugar doesn't have "sample()" implemented, so I had to write my own. Hence, the awkward "sampleOnce" function below.
- The only way to convert between "colvec" and "rovec" in Armadillo is to use a transpose. Thanks to Jane Lange for pointing this possibility.
- If you want to grow a vector, using Rcpp NumericVector class is not computationally efficient. According to Rcpp developers it is better to create a std::vector instance, use .push_back() to grow it and then create NumericVector from std:vector.
So here is the C++ code to use with inline:
inc <-
'int sampleOnce(arma::colvec weights, double rUnif) {
double total = sum(weights);
double cumProb=0;
int i;
for (i = 0; i < weights.n_elem; i++) {
cumProb += weights(i) / total;
if (rUnif < cumProb) break;
}
return(i);
}
'
src <- '
using namespace Rcpp;
NumericMatrix rM(rm);
int numStates = rM.nrow();
arma::mat rateMatrix(rM.begin(), numStates, numStates, false); // reuses memory and avoids extra copy
int startState = as<int>(bS);
int endState = as<int>(eS);
double elapsedTime = as<double>(eT);
double transProb = as<double>(tP);
RNGScope scope;
// Set the rate of the dominating Poisson process
double poissonRate = -1.0*arma::min(rateMatrix.diag());
arma::mat dominTransProb(numStates,numStates);
dominTransProb.eye();
arma::cube powerTransProb = arma::cube(numStates, numStates,1);
powerTransProb.slice(0) = dominTransProb;
dominTransProb = dominTransProb + rateMatrix/poissonRate;
// simulate the number of jumps, including fictitious ones
double rU = as<double>(runif(1));
double cum = 0;
if (startState==endState){
cum = ::Rf_dpois(0,poissonRate*elapsedTime,0)/transProb;
}
bool notExceed = true;
if (cum > rU){
notExceed = false;
}
int numJumps = 0;
double nextProb;
while(notExceed){
numJumps++;
powerTransProb.insert_slices(numJumps,1,false);
powerTransProb.slice(numJumps) = powerTransProb.slice(numJumps-1)*dominTransProb;
nextProb = ::Rf_dpois(numJumps,poissonRate*elapsedTime,0)*powerTransProb(startState,endState,numJumps)/transProb;
cum += nextProb;
if (cum > rU){
notExceed = false;
}
}
List simPath = List::create(Named("states") = NULL,
Named("times") = NULL);
// if numJumps = 0: done
if (numJumps == 0 || ((numJumps == 1) && (startState == endState))){
simPath["states"] = NumericVector::create(startState, endState);
simPath["times"] = NumericVector::create(0, elapsedTime);
}else{ // if one true jump: done
if ((numJumps == 1) && (startState != endState)){
simPath["states"] = NumericVector::create(startState,endState,endState);
simPath["times"] = NumericVector::create(0,elapsedTime*(as<double>(runif(1))),elapsedTime);
}else{ // Case (nJmp >= 2)
// Simulate jumping times
NumericVector dominJumpTimes = elapsedTime*runif(numJumps);
std::sort(dominJumpTimes.begin(),dominJumpTimes.end());
NumericVector dominStates(numJumps+1);
dominStates[0] = startState;
dominStates[numJumps] = endState;
// Simulate states of the dominating Markov chain
for (int i = 1; i < numJumps; i++){
dominStates[i] = sampleOnce(powerTransProb.slice(1).row(dominStates[i-1]).t()%powerTransProb.slice(numJumps-i).col(endState), as<double>(runif(1)));
}
// Remove virtual substitutions
std::vector<int> trueStates(1);
std::vector<double> trueTimes(1);
trueStates[0] = startState;
trueTimes[0] = 0.0;
for (int i = 1; i < dominStates.size(); i++){
if (dominStates[i-1] != dominStates[i]){
trueStates.push_back(dominStates[i]);
trueTimes.push_back(dominJumpTimes[i-1]);
}
}
trueStates.push_back(endState);
trueTimes.push_back(elapsedTime);
simPath["states"] = NumericVector(trueStates.begin(), trueStates.end());
simPath["times"] = NumericVector(trueTimes.begin(), trueTimes.end());
}
}
return simPath;
'
Created by Pretty R at inside-R.org
Now, we compile it with inline:
Created by Pretty R at inside-R.org
The arguments of unifSample are the CTMC rate matrix (rm), beginning state (bS), ending state (es), time interval length (eT), and transition probability corresponding to the beginning and ending state (tP). The last argument is not strictly required, but I am operating under the assumption that by the time I will want to call this function, I will have transition probabilities pre-computed.
I am going to use
markovjumps package to create a simple nucleotide substitution model and simulate from it:
require(markovjumps)
my.start = 1
my.end = 2
elapsed.time = 10.0
my.stat = c(0.1,0.3,0.25,0.35)
## define a K80 markov chain (HKY reduces to K80 when stationary distribution is uniform)
my.model = hky.mc(c(0.1,0.3),my.stat,F)
## compute eigen-decomposition of the generator
my.model.eigen = as.eigen(my.model)
## get transition probabilities
my.prob = matexp(my.model.eigen,elapsed.time)
## simulate CTMC path
unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
Created by Pretty R at inside-R.org
The output looks something like this:
> unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
$states
[1] 0 3 1 1
$times
[1] 0.000000 4.430341 9.295083 10.000000
Created by Pretty R at inside-R.org
Notice that in R my state space is {1,2,3,4}, while in C++ it is {0,1,2,3}. To see more transitions, we can increase the time interval:
> elapsed.time = 100.0
> my.prob = matexp(my.model.eigen,elapsed.time)
> unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
$states
[1] 0 3 1 3 0 3 1 3 0 1 3 1 2 3 1 1
$times
[1] 0.000000 4.299258 5.135858 7.616836 14.274938 19.912332 27.649715 35.538612
[9] 52.687339 53.404439 83.587054 86.185838 88.686555 91.954041 98.642818 100.000000
Created by Pretty R at inside-R.org
To test the simulator, I am going to use analytic formulae for CTMC path statistics, implemented in
markovjumps. Namely, I am going to look at the total number of transitions and for total time spent in state 1 (2 in the R state space):
> sim.size = 10000
> num.jumps = array(dim=c(4,4,sim.size))
> dwell.in.one = array(dim=c(4,4,sim.size))
>
> set.seed(1111)
> for (i in 1:sim.size){
+ for (my.start in 1:4){
+ for (my.end in 1:4){
+ temp1 = unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
+
+ num.jumps[my.start,my.end,i] = length(temp1$states)-2
+ dwell.in.one[my.start,my.end,i] = sum(diff(temp1$times)[temp1$states[1:(length(temp1$states)-1)]==1])
+ }
+ }
+ }
>
> apply(num.jumps,MARGIN=c(1,2),FUN=mean)
[,1] [,2] [,3] [,4]
[1,] 17.0541 16.9925 16.8286 16.8386
[2,] 16.9071 16.8813 16.7615 16.7366
[3,] 16.9179 16.7869 16.6631 16.6248
[4,] 16.8299 16.7299 16.5572 16.5783
> cond.mean.markov.jumps(my.model.eigen, matrix(1,4,4)-diag(rep(1,4)),elapsed.time)
[,1] [,2] [,3] [,4]
[1,] 17.06970 16.97879 16.88114 16.82559
[2,] 16.97879 16.88788 16.79024 16.73468
[3,] 16.88114 16.79024 16.69259 16.63704
[4,] 16.82559 16.73468 16.63704 16.58148
>
> apply(dwell.in.one, MARGIN=c(1,2),FUN=mean)
[,1] [,2] [,3] [,4]
[1,] 26.20608 30.72730 27.08764 27.11346
[2,] 30.64288 35.35969 31.76749 31.54262
[3,] 27.00934 31.44582 28.03639 27.87922
[4,] 27.16154 31.64142 28.04596 28.07696
> cond.mean.markov.rewards(my.model.eigen, c(0,1,0,0), elapsed.time)
[,1] [,2] [,3] [,4]
[1,] 26.18182 30.72727 27.09091 27.09091
[2,] 30.72727 35.27273 31.63636 31.63636
[3,] 27.09091 31.63636 28.00000 28.00000
[4,] 27.09091 31.63636 28.00000 28.00000
>
Created by Pretty R at inside-R.org
Seems to work. I noticed that I can break the program by jacking up the time interval length. I think the problem is in that Poisson density returns non-numbers in some cases and confuses Rcpp. I am not sure if this is worth fixing though, because in practice such parameter configuration shouldn't occur...