Stochastic Models, Statistics, and Computation: Phylogenetic likelihood with Rcpp+RcppArmadillo

In this post, I want to show how to implement Felsenstein's algorithm for calculating phylogenetic likelihood with the help of Rcpp and RcppArmadillo. I am going to use ape's representation of a phylogenetic tree. If you are not familiar with this data structure, look at this document. My implementation is very similar to the likelihood calculator of ape's ace function.

Problem formulation: given a rooted binary phylogenetic tree with branch length and a continuous-time Markov chain model of discrete character evolution (e.g. DNA, amino acids, morphological traits), compute the probability of observing certain characters at the tips of the tree. I am going to stop short of actually computing this probability. Instead, I will compute what's called partial likelihoods for each node in the tree. The actual likelihood is a weighted sum of the partial likelihoods at the root of the phylogeny.

C++ code with inline compilation:

require(inline)
require(RcppArmadillo)
 
src <- '
  /* Arguments: 
        tE: edge matrix of the ape object phylo
        nIN: number of internal nodes (not necessary, but convinient to get it from phylo)
        tS: integer vector of tip states (-1=missing value)
        nS: state space size (e.g. 2 for a binary trait)
        pM: array of probability matrices for each edge of the tree
 
     Two important assumptions:
        1. edges in the edge matrix and probability matrices are in the "pruningwise" order;

        see ?reorder.phylo for more details

        2. tip state vector is ordered according to the tip numbering in the edge matrix
  */
 
  using namespace Rcpp;
 
  IntegerMatrix treeEdges(tE);
 
  // get the number of edges
  int numEdges = treeEdges.nrow();
  int numIntNodes = as<int>(nIN);
  IntegerVector tipStates(tS);
  int numStates = as<int>(nS);
  NumericVector vecProbMat(pM);
  arma::cube cubeProbMat(vecProbMat.begin(), numStates, numStates, numEdges, false);
 
  // get the number of tips in the tree
  int numTips = tipStates.size();
 
  // prepare a matrix for storing regular (backward) partial likelihoods
  arma::mat partialLike = arma::zeros<arma::mat>(numTips + numIntNodes, numStates);
 
  for (int i=0; i < numTips; i++){
    if (tipStates[i] == -1){// -1 denotes a missing value
      partialLike.row(i) = arma::ones<arma::rowvec>(numStates);
    }else{
      partialLike(i, tipStates[i]) = 1.0;
    }
  }
 
  // compute regular partial likelihoods for all internal nodes
  for (int i=0; i < numEdges; i+=2){      
    // parent=treeEdges(i,0) or treeEdges(i+1,0); treeEdges indices should be shifted by one
    partialLike.row(treeEdges(i,0)-1) = (partialLike.row(treeEdges(i,1)-1)*cubeProbMat.slice(i).t())%(partialLike.row(treeEdges(i+1,1)-1)*cubeProbMat.slice(i+1).t());            
  }
 
  return wrap(partialLike);
'  
 
 
partLike = cxxfunction(signature(tE="integer", nIN="integer", tS="integer", nS="integer", pM="numeric"),body=src, plugin="RcppArmadillo")

Created by Pretty R at inside-R.org

I am going to test the code with a simple two state model, for which transition probabilities are available in closed form:

two.state.trans.prob = function(forward.rate, backward.rate, elapsed.time){
  total.rate = forward.rate + backward.rate               
 
  return((matrix(c(rep(backward.rate,2),rep(forward.rate,2)),2,2) +
    matrix(c(forward.rate, -backward.rate, -forward.rate, backward.rate),2,2)*
    exp(-total.rate*elapsed.time))/total.rate)    
}

Created by Pretty R at inside-R.org

R wrapper function that takes care of the prerequisites:

two.state.part.like = function(my.tree, my.data, forward.rate, backward.rate){
 
  ## reorder the edges in the "pruningwise" order
  my.tree = reorder(my.tree, order = "pr")
 
  if (!("phylo" %in% class(my.tree)))
    stop("Error: object \"my.tree\" is not of class \"phylo\"")
 
  if (is.null(my.tree$edge.length))
    stop("Error: tree \" my.tree\" must have branch lengths.")
 
  ## reorder data on tips to match the order of the my.tree phylo object
  if (!is.null(names(my.data))) {
    if(!any(is.na(match(names(my.data), my.tree$tip.label)))){
      my.data = my.data[my.tree$tip.label]
    }else{
      warning('the names of argument "my.data" and the names of the tip labels
did not match: the former were ignored in the analysis.')
    }
  }
 
  ## prepare transition probability matrices (this of course can and should be done in C++ as well)
  prob.array = array(0, dim=c(2,2,length(my.tree$edge.length)))            
  for (i in 1:length(my.tree$edge.length)){
    prob.array[,,i] = two.state.trans.prob(forward.rate, backward.rate, my.tree$edge.length[i])          
  }            
 
  return(partLike(my.tree$edge, my.tree$Nnode, my.data, 2, prob.array))
}

Created by Pretty R at inside-R.org

Testing:

First, let's simulate a tree with tip states. This can be done by several packages in R. I am going to use diversitree package:

require(diversitree)
 
set.seed(34344)
test.tree = tree.bisse(c(0.1, 0.1, 0.03, 0.03, 0.01, 0.07), x0=0, max.taxa = 50)

Created by Pretty R at inside-R.org

This produces the following tree:

Now, my function returns partial likelihoods for all nodes in the tree. I am going to extract only partial likelihoods corresponding to the root (recall that a weighted sum of these numbers gives the actual probability of observing the data). I am going to do the same using diversitree's likelhood calculator:

> test.like = two.state.part.like(test.tree, test.tree$tip.state, 0.02, 0.05)
> test.like[51,]
[1] 1.564918e-08 1.195940e-08
> 
> mk2.lik = make.mk2(test.tree, test.tree$tip.state)
> exp(mk2.lik(c(0.02,0.05),root=ROOT.BOTH))
[1] 1.564918e-08 1.195940e-08

Created by Pretty R at inside-R.org

The above agreement is encouraging.

Stochastic Models, Statistics, and Computation

Friday, December 30, 2011

Phylogenetic likelihood with Rcpp+RcppArmadillo

No comments:

Post a Comment