Here, I am going to implement a uniformization method (Fearnhead and Sherlock, 2006; Lartillot, 2006; Mateiu and Rannala, 2006) using Rcpp and RcppArmadillo. In essence, I just Rcpp-ed Asger's code. I want to point out that my motivation is not necessarily to speed up Asger's R implementation. Instead, I wanted to have a standalone C++ code that I can use in my packages internally. For portability, I am going to use the inline package, so that one can copy and paste my code into an R console.
I am fairly new to Rcpp and especially RcppArmadillo, so here are a couple of lessons that I've learned during this coding exercise:
- Rcpp sugar doesn't have "sample()" implemented, so I had to write my own. Hence, the awkward "sampleOnce" function below.
- The only way to convert between "colvec" and "rovec" in Armadillo is to use a transpose. Thanks to Jane Lange for pointing this possibility.
- If you want to grow a vector, using Rcpp NumericVector class is not computationally efficient. According to Rcpp developers it is better to create a std::vector instance, use .push_back() to grow it and then create NumericVector from std:vector.
So here is the C++ code to use with inline:
inc <- 'int sampleOnce(arma::colvec weights, double rUnif) { double total = sum(weights); double cumProb=0; int i; for (i = 0; i < weights.n_elem; i++) { cumProb += weights(i) / total; if (rUnif < cumProb) break; } return(i); } ' src <- ' using namespace Rcpp; NumericMatrix rM(rm); int numStates = rM.nrow(); arma::mat rateMatrix(rM.begin(), numStates, numStates, false); // reuses memory and avoids extra copy int startState = as<int>(bS); int endState = as<int>(eS); double elapsedTime = as<double>(eT); double transProb = as<double>(tP); RNGScope scope; // Set the rate of the dominating Poisson process double poissonRate = -1.0*arma::min(rateMatrix.diag()); arma::mat dominTransProb(numStates,numStates); dominTransProb.eye(); arma::cube powerTransProb = arma::cube(numStates, numStates,1); powerTransProb.slice(0) = dominTransProb; dominTransProb = dominTransProb + rateMatrix/poissonRate; // simulate the number of jumps, including fictitious ones double rU = as<double>(runif(1)); double cum = 0; if (startState==endState){ cum = ::Rf_dpois(0,poissonRate*elapsedTime,0)/transProb; } bool notExceed = true; if (cum > rU){ notExceed = false; } int numJumps = 0; double nextProb; while(notExceed){ numJumps++; powerTransProb.insert_slices(numJumps,1,false); powerTransProb.slice(numJumps) = powerTransProb.slice(numJumps-1)*dominTransProb; nextProb = ::Rf_dpois(numJumps,poissonRate*elapsedTime,0)*powerTransProb(startState,endState,numJumps)/transProb; cum += nextProb; if (cum > rU){ notExceed = false; } } List simPath = List::create(Named("states") = NULL, Named("times") = NULL); // if numJumps = 0: done if (numJumps == 0 || ((numJumps == 1) && (startState == endState))){ simPath["states"] = NumericVector::create(startState, endState); simPath["times"] = NumericVector::create(0, elapsedTime); }else{ // if one true jump: done if ((numJumps == 1) && (startState != endState)){ simPath["states"] = NumericVector::create(startState,endState,endState); simPath["times"] = NumericVector::create(0,elapsedTime*(as<double>(runif(1))),elapsedTime); }else{ // Case (nJmp >= 2) // Simulate jumping times NumericVector dominJumpTimes = elapsedTime*runif(numJumps); std::sort(dominJumpTimes.begin(),dominJumpTimes.end()); NumericVector dominStates(numJumps+1); dominStates[0] = startState; dominStates[numJumps] = endState; // Simulate states of the dominating Markov chain for (int i = 1; i < numJumps; i++){ dominStates[i] = sampleOnce(powerTransProb.slice(1).row(dominStates[i-1]).t()%powerTransProb.slice(numJumps-i).col(endState), as<double>(runif(1))); } // Remove virtual substitutions std::vector<int> trueStates(1); std::vector<double> trueTimes(1); trueStates[0] = startState; trueTimes[0] = 0.0; for (int i = 1; i < dominStates.size(); i++){ if (dominStates[i-1] != dominStates[i]){ trueStates.push_back(dominStates[i]); trueTimes.push_back(dominJumpTimes[i-1]); } } trueStates.push_back(endState); trueTimes.push_back(elapsedTime); simPath["states"] = NumericVector(trueStates.begin(), trueStates.end()); simPath["times"] = NumericVector(trueTimes.begin(), trueTimes.end()); } } return simPath; '
Now, we compile it with inline:
Created by Pretty R at inside-R.org
The arguments of unifSample are the CTMC rate matrix (rm), beginning state (bS), ending state (es), time interval length (eT), and transition probability corresponding to the beginning and ending state (tP). The last argument is not strictly required, but I am operating under the assumption that by the time I will want to call this function, I will have transition probabilities pre-computed.
I am going to use markovjumps package to create a simple nucleotide substitution model and simulate from it:
Created by Pretty R at inside-R.org
The output looks something like this:
Created by Pretty R at inside-R.org
Notice that in R my state space is {1,2,3,4}, while in C++ it is {0,1,2,3}. To see more transitions, we can increase the time interval:
Created by Pretty R at inside-R.org
To test the simulator, I am going to use analytic formulae for CTMC path statistics, implemented in markovjumps. Namely, I am going to look at the total number of transitions and for total time spent in state 1 (2 in the R state space):
Created by Pretty R at inside-R.org
Seems to work. I noticed that I can break the program by jacking up the time interval length. I think the problem is in that Poisson density returns non-numbers in some cases and confuses Rcpp. I am not sure if this is worth fixing though, because in practice such parameter configuration shouldn't occur...
The arguments of unifSample are the CTMC rate matrix (rm), beginning state (bS), ending state (es), time interval length (eT), and transition probability corresponding to the beginning and ending state (tP). The last argument is not strictly required, but I am operating under the assumption that by the time I will want to call this function, I will have transition probabilities pre-computed.
I am going to use markovjumps package to create a simple nucleotide substitution model and simulate from it:
require(markovjumps) my.start = 1 my.end = 2 elapsed.time = 10.0 my.stat = c(0.1,0.3,0.25,0.35) ## define a K80 markov chain (HKY reduces to K80 when stationary distribution is uniform) my.model = hky.mc(c(0.1,0.3),my.stat,F) ## compute eigen-decomposition of the generator my.model.eigen = as.eigen(my.model) ## get transition probabilities my.prob = matexp(my.model.eigen,elapsed.time) ## simulate CTMC path unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end])
The output looks something like this:
> unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end]) $states [1] 0 3 1 1 $times [1] 0.000000 4.430341 9.295083 10.000000
Notice that in R my state space is {1,2,3,4}, while in C++ it is {0,1,2,3}. To see more transitions, we can increase the time interval:
> elapsed.time = 100.0 > my.prob = matexp(my.model.eigen,elapsed.time) > unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end]) $states [1] 0 3 1 3 0 3 1 3 0 1 3 1 2 3 1 1 $times [1] 0.000000 4.299258 5.135858 7.616836 14.274938 19.912332 27.649715 35.538612 [9] 52.687339 53.404439 83.587054 86.185838 88.686555 91.954041 98.642818 100.000000
To test the simulator, I am going to use analytic formulae for CTMC path statistics, implemented in markovjumps. Namely, I am going to look at the total number of transitions and for total time spent in state 1 (2 in the R state space):
> sim.size = 10000 > num.jumps = array(dim=c(4,4,sim.size)) > dwell.in.one = array(dim=c(4,4,sim.size)) > > set.seed(1111) > for (i in 1:sim.size){ + for (my.start in 1:4){ + for (my.end in 1:4){ + temp1 = unifSample(my.model$rate.matrix, my.start-1, my.end-1, elapsed.time, my.prob[my.start,my.end]) + + num.jumps[my.start,my.end,i] = length(temp1$states)-2 + dwell.in.one[my.start,my.end,i] = sum(diff(temp1$times)[temp1$states[1:(length(temp1$states)-1)]==1]) + } + } + } > > apply(num.jumps,MARGIN=c(1,2),FUN=mean) [,1] [,2] [,3] [,4] [1,] 17.0541 16.9925 16.8286 16.8386 [2,] 16.9071 16.8813 16.7615 16.7366 [3,] 16.9179 16.7869 16.6631 16.6248 [4,] 16.8299 16.7299 16.5572 16.5783 > cond.mean.markov.jumps(my.model.eigen, matrix(1,4,4)-diag(rep(1,4)),elapsed.time) [,1] [,2] [,3] [,4] [1,] 17.06970 16.97879 16.88114 16.82559 [2,] 16.97879 16.88788 16.79024 16.73468 [3,] 16.88114 16.79024 16.69259 16.63704 [4,] 16.82559 16.73468 16.63704 16.58148 > > apply(dwell.in.one, MARGIN=c(1,2),FUN=mean) [,1] [,2] [,3] [,4] [1,] 26.20608 30.72730 27.08764 27.11346 [2,] 30.64288 35.35969 31.76749 31.54262 [3,] 27.00934 31.44582 28.03639 27.87922 [4,] 27.16154 31.64142 28.04596 28.07696 > cond.mean.markov.rewards(my.model.eigen, c(0,1,0,0), elapsed.time) [,1] [,2] [,3] [,4] [1,] 26.18182 30.72727 27.09091 27.09091 [2,] 30.72727 35.27273 31.63636 31.63636 [3,] 27.09091 31.63636 28.00000 28.00000 [4,] 27.09091 31.63636 28.00000 28.00000 >
Seems to work. I noticed that I can break the program by jacking up the time interval length. I think the problem is in that Poisson density returns non-numbers in some cases and confuses Rcpp. I am not sure if this is worth fixing though, because in practice such parameter configuration shouldn't occur...
No comments:
Post a Comment