Stochastic Models, Statistics, and Computation: RcppArmadillo

Showing posts with label RcppArmadillo. Show all posts

Monday, October 14, 2013

Passing Armadillo matrices by value and by reference

Rcpp and RcppArmadillo make it relatively easy to integrate R code and C++ code. In fact the resulting C++ code looks so much as R code that it is easy to forget some basic rules by which a C++ coder must play. This is especially true for someone like me, who wrote relatively little C++ code in the past.

Here is an example of a typical calculation that may get one in trouble. Suppose you have a function that takes two matrices from R, converts them into Armadillo matrices and passes these functions to other functions to do some repetitive calculations. For example, imagine passing transition and emission probability matrices to execute the forward-backward algorithm on multiple realizations of the same hidden Markov model.

Below you see my C++ code, saved in the file "matrix_code.cpp" that has two sets of functions: a) doStuff1(), twoMatrices1(); b) doStuff2(), twoMatrices2(). The functions twoMatrices1() and twoMatrices2() are exported to R via Rcpp attributes. The twoMatrices1() calls doStuff1() 100 times, while twoMatrices2() calls doStuff2() 100 times. The only difference between the two sets of functions is in that doStuff1() passes two Armadillo matrices by value, which requires copying these matrices. In contrast, doStuff2() passes Armadillo matrices by reference without calling copy-constructor under the hood.

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>

using namespace Rcpp;

// First passing Armadillo matrices by value
void doStuff1(arma::mat c, arma::mat d){
double x = c(1,1) + d(1,1);
}

// [[Rcpp::export]]
void twoMatrices1(NumericMatrix A, NumericMatrix B) {
arma::mat M1(A.begin(), A.nrow(), A.ncol(), false);
arma::mat M2(B.begin(), B.nrow(), B.ncol(), false);

for (int i = 0; i < 100; i++){
doStuff1(M1, M2);
}
}

// Now passign Armadillo matrices by reference
void doStuff2(const arma::mat& c, const arma::mat& d){
double x = c(1,1) + d(1,1);
}

// [[Rcpp::export]]
void twoMatrices2(NumericMatrix A, NumericMatrix B) {
arma::mat M1(A.begin(), A.nrow(), A.ncol(), false);
arma::mat M2(B.begin(), B.nrow(), B.ncol(), false);

for (int i = 0; i < 100; i++){
doStuff2(M1, M2);
}
}

Now I am going to call twoMatrices1() and twoMatrices2() from R, using rbenchmark package to compare their speed. I am going to use fairly large matrices (100x100), so the cost of moving them around is significant.

library(RcppArmadillo)
library(rbenchmark)
 
sourceCpp("matrix_code.cpp")
 
x = matrix(1,100,100)
y = matrix(1,100,100)
 
res <- benchmark(twoMatrices1(x, y), twoMatrices2(x, y), 
                 columns = c("test", "replications", "elapsed", "relative"), 
                 order="relative", replications=1000)
print(res)
 
                test replications elapsed relative
2 twoMatrices2(x, y)         1000   0.006    1.000
1 twoMatrices1(x, y)         1000   0.829  138.167

Created by Pretty R at inside-R.org

As you can see, passing two 100x100 Armadillo matrices by value is more than 100 times slower than passing these matrices by reference. Perhaps not a very surprising outcome, but an important one to keep in mind when porting R code to C++.

Sunday, July 22, 2012

RcppArmadillo and R-forge

Jane Lange and I have recently finished a paper on continuous-time hidden Markov models and published an accompanying R package cthmm on R-forge. Jane's package cthmm relies heavily on Rcpp and RcppArmadillo packages. Somewhat surprisingly, the last dependency caused compilation problems on R-forge. Specifically, the package compiled only on Mac OS X, but not on Linux or Windows. Since Jane and I use Macs, we didn't have any problems compiling the package locally on our computers.

It turns out that the root of the problem was in using Armadillo's .inv() function to invert complex matrices. Armadillo library outsources some of its linear algebra calculations to lapack, which should be no problem since R also relies on lapack (who doesn't, right?). But for some reason r-project.org systems have only an abbreviated version of the lapack library, missing some functions for complex matrix calculations. This problem is discussed here.

Following suggestions from the above link, my solution was to download missing fortran routines (zgetri.f, ztrti2.f, ztrtri.f) from http://www.netlib.org/lapack/ and add them to /src directory of our package. A bit ugly, but worked and the package compiled on R-forge on all architectures.

P.S. As of July 22, 2012, our ~~cthmm failed to compile on R-forge again~~, but we believe that this has to do with dependent packages (looks like survival package was not available on Windows). We hope to resolve this problem shortly.

Update on the P.S.: It looks like the problem with the survival package was resolved by the R-forge compilers. Our cthmm compiles now.

Friday, December 30, 2011

Phylogenetic likelihood with Rcpp+RcppArmadillo

In this post, I want to show how to implement Felsenstein's algorithm for calculating phylogenetic likelihood with the help of Rcpp and RcppArmadillo. I am going to use ape's representation of a phylogenetic tree. If you are not familiar with this data structure, look at this document. My implementation is very similar to the likelihood calculator of ape's ace function.

Problem formulation: given a rooted binary phylogenetic tree with branch length and a continuous-time Markov chain model of discrete character evolution (e.g. DNA, amino acids, morphological traits), compute the probability of observing certain characters at the tips of the tree. I am going to stop short of actually computing this probability. Instead, I will compute what's called partial likelihoods for each node in the tree. The actual likelihood is a weighted sum of the partial likelihoods at the root of the phylogeny.

C++ code with inline compilation:

require(inline)
require(RcppArmadillo)
 
src <- '
  /* Arguments: 
        tE: edge matrix of the ape object phylo
        nIN: number of internal nodes (not necessary, but convinient to get it from phylo)
        tS: integer vector of tip states (-1=missing value)
        nS: state space size (e.g. 2 for a binary trait)
        pM: array of probability matrices for each edge of the tree
 
     Two important assumptions:
        1. edges in the edge matrix and probability matrices are in the "pruningwise" order;

        see ?reorder.phylo for more details

        2. tip state vector is ordered according to the tip numbering in the edge matrix
  */
 
  using namespace Rcpp;
 
  IntegerMatrix treeEdges(tE);
 
  // get the number of edges
  int numEdges = treeEdges.nrow();
  int numIntNodes = as<int>(nIN);
  IntegerVector tipStates(tS);
  int numStates = as<int>(nS);
  NumericVector vecProbMat(pM);
  arma::cube cubeProbMat(vecProbMat.begin(), numStates, numStates, numEdges, false);
 
  // get the number of tips in the tree
  int numTips = tipStates.size();
 
  // prepare a matrix for storing regular (backward) partial likelihoods
  arma::mat partialLike = arma::zeros<arma::mat>(numTips + numIntNodes, numStates);
 
  for (int i=0; i < numTips; i++){
    if (tipStates[i] == -1){// -1 denotes a missing value
      partialLike.row(i) = arma::ones<arma::rowvec>(numStates);
    }else{
      partialLike(i, tipStates[i]) = 1.0;
    }
  }
 
  // compute regular partial likelihoods for all internal nodes
  for (int i=0; i < numEdges; i+=2){      
    // parent=treeEdges(i,0) or treeEdges(i+1,0); treeEdges indices should be shifted by one
    partialLike.row(treeEdges(i,0)-1) = (partialLike.row(treeEdges(i,1)-1)*cubeProbMat.slice(i).t())%(partialLike.row(treeEdges(i+1,1)-1)*cubeProbMat.slice(i+1).t());            
  }
 
  return wrap(partialLike);
'  
 
 
partLike = cxxfunction(signature(tE="integer", nIN="integer", tS="integer", nS="integer", pM="numeric"),body=src, plugin="RcppArmadillo")

Created by Pretty R at inside-R.org

I am going to test the code with a simple two state model, for which transition probabilities are available in closed form:

two.state.trans.prob = function(forward.rate, backward.rate, elapsed.time){
  total.rate = forward.rate + backward.rate               
 
  return((matrix(c(rep(backward.rate,2),rep(forward.rate,2)),2,2) +
    matrix(c(forward.rate, -backward.rate, -forward.rate, backward.rate),2,2)*
    exp(-total.rate*elapsed.time))/total.rate)    
}

Created by Pretty R at inside-R.org

R wrapper function that takes care of the prerequisites:

two.state.part.like = function(my.tree, my.data, forward.rate, backward.rate){
 
  ## reorder the edges in the "pruningwise" order
  my.tree = reorder(my.tree, order = "pr")
 
  if (!("phylo" %in% class(my.tree)))
    stop("Error: object \"my.tree\" is not of class \"phylo\"")
 
  if (is.null(my.tree$edge.length))
    stop("Error: tree \" my.tree\" must have branch lengths.")
 
  ## reorder data on tips to match the order of the my.tree phylo object
  if (!is.null(names(my.data))) {
    if(!any(is.na(match(names(my.data), my.tree$tip.label)))){
      my.data = my.data[my.tree$tip.label]
    }else{
      warning('the names of argument "my.data" and the names of the tip labels
did not match: the former were ignored in the analysis.')
    }
  }
 
  ## prepare transition probability matrices (this of course can and should be done in C++ as well)
  prob.array = array(0, dim=c(2,2,length(my.tree$edge.length)))            
  for (i in 1:length(my.tree$edge.length)){
    prob.array[,,i] = two.state.trans.prob(forward.rate, backward.rate, my.tree$edge.length[i])          
  }            
 
  return(partLike(my.tree$edge, my.tree$Nnode, my.data, 2, prob.array))
}

Created by Pretty R at inside-R.org

Testing:

First, let's simulate a tree with tip states. This can be done by several packages in R. I am going to use diversitree package:

require(diversitree)
 
set.seed(34344)
test.tree = tree.bisse(c(0.1, 0.1, 0.03, 0.03, 0.01, 0.07), x0=0, max.taxa = 50)

Created by Pretty R at inside-R.org

This produces the following tree:

Now, my function returns partial likelihoods for all nodes in the tree. I am going to extract only partial likelihoods corresponding to the root (recall that a weighted sum of these numbers gives the actual probability of observing the data). I am going to do the same using diversitree's likelhood calculator:

> test.like = two.state.part.like(test.tree, test.tree$tip.state, 0.02, 0.05)
> test.like[51,]
[1] 1.564918e-08 1.195940e-08
> 
> mk2.lik = make.mk2(test.tree, test.tree$tip.state)
> exp(mk2.lik(c(0.02,0.05),root=ROOT.BOTH))
[1] 1.564918e-08 1.195940e-08

Created by Pretty R at inside-R.org

The above agreement is encouraging.

Saturday, December 24, 2011

R array to RcppArmadillo cube (updated)

A quick post here on how to convert a 3D array from R to the RcppArmadillo cube.

Rcpp does not support arrays yet (only vectors and matrices). However, RcppArmadillo has a proper 3D array class: arma::cube. The conversion can be done via Rcpp::NumericVector. The only aspect of this exercise that surprised me was the fact that Rcpp::NumericVector stores dimensions of an array, but does not allow one to extract this info. Hence, if you don't know in advance the dimensions of your array, the only solution I see is to pass the vector of dimensions to Rcpp together with the array. Not too ugly, but a bit awkward. Thanks to the commenter Eli below, who pointed out that dimensions of the array can be extracted via the .attr() method -- much prettier now. My code to illustrate this concept:

require(inline)
require(RcppArmadillo)
 
src <- '
  using namespace Rcpp;
 
  NumericVector vecArray(myArray);
  IntegerVector arrayDims = vecArray.attr("dim");

 
  arma::cube cubeArray(vecArray.begin(), arrayDims[0], arrayDims[1], arrayDims[2], false);
 
  //change one element in the array/cube
  cubeArray(0,0,0) = 518;  
 
  return(wrap(cubeArray));  
'
 
readCube = cxxfunction(signature(myArray="numeric"),body=src, plugin="RcppArmadillo")

Created by Pretty R at inside-R.org

A numerical test:

> set.seed(345)
> testArray = array(rnorm(18), dim=c(3,3,2))
> print(testArray)
, , 1
 
           [,1]        [,2]      [,3]
[1,] -0.7849082 -0.29059656 -0.927724
[2,] -0.2795144 -0.06753159  1.710771
[3,] -0.1614579 -0.63352041  1.654769
 
, , 2
 
          [,1]       [,2]       [,3]
[1,]  1.810483 -0.8496292 -1.4029422
[2,]  1.866772  0.3184496  0.5682982
[3,] -1.399833  0.9035913  1.0457561
 
> readCube(testArray)
, , 1
 
            [,1]        [,2]      [,3]
[1,] 518.0000000 -0.29059656 -0.927724
[2,]  -0.2795144 -0.06753159  1.710771
[3,]  -0.1614579 -0.63352041  1.654769
 
, , 2
 
          [,1]       [,2]       [,3]
[1,]  1.810483 -0.8496292 -1.4029422
[2,]  1.866772  0.3184496  0.5682982
[3,] -1.399833  0.9035913  1.0457561

Created by Pretty R at inside-R.org