Saturday, November 3, 2012

Guest post: Functional Programming in R with an HMM example

Below is a guest post by Chris Glazner, who works on his Ph.D. in Statistics with Elizabeth Thompson here at UW. I especially recommend checking out Chris' slick example of using Reduce for HMM forward-backward algorithm.

Chris writes:

I spent some time over the summer looking at ways of using functional programming to implement the statistical methods in my dissertation research. Functional programming is a style which treats functions as the basic building block of computations. The flow of a program is expressed as the composition of functions, rather than a sequence of steps which manipulate variables. Programming this way tends to feel a little more like doing math, so I've found it to be a good fit for statistical programming. If you can describe your program as a function between sets (say, from a data vector to a confidence interval) you've already figured out the overall structure of a functional program to implement it. 
R is a functional language, meaning that functions can be given names, created anonymously, and passed to other functions. Functions whose inputs or outputs are themselves functions are called "higher order functions". A number of higher order functions are included in the R base library, including the popular apply family: apply, sapply, lapply, and mapply. These functions take a function and a list or vector and apply the function to every element of the list or vector. They replace for loops that look like this:  
for (i in 1:length(x)) f(x[i])  
Created by Pretty R at inside-R.org
where f is any suitable function. Replacing a loop like this with a function call simplifies code and makes it easier to understand.
Other higher order functions express other common patterns. Reduce captures the behavior of an accumulating loop like: 
y <- startValue
for (i in 1:length(x)) y <- g(y,x[i]) 
Created by Pretty R at inside-R.org 
This pattern may not seem as general as apply, but it encompasses many common calculations. If g is `+` and startValue is 0, then the call to Reduce is equivalent to sum(x). If g is `*` and startValue is 1, we have product(x). Reduce is known by many names in other languages and is a powerful abstraction for expressing computations. 
A nice application of Reduce arises in the Hidden Markov Models (HMMs). An HMM is an unobserved Markov chain along with data which are generated independently according to the hidden Markov state at each step. The goal is usually to infer something about the hidden process from the data. The standard method for calculating the marginal probabilities of the hidden states at each step is known as the forward-backward algorithm, which is explained in a nice tutorial by LR Rabiner. We can use Reduce to express this algorithm concisely. 
As an example we will implement the simple Occasionally Dishonest Casino model described in a classic Biologial Sequence Analysis book by Durbin, Eddy, Krogh, and Mitchison. This casino has a fair die and a biased die which rolls three 50% of the time. We observe repeated rolls of a single die, but we never know which one. The casino switches the dice according to a two-state Markov chain. We want to infer which die is in use at a given time. 
The model is fully specified by the Markov transition matrix, the emission distribution of the data conditional on the hidden state, and the initial distribution. Since this is R, we will specify these using vectors and matrices, but we could use functions if we wanted.
tMat <- t(matrix(c(0.95, 0.05 # fair -> biased
                 , 0.10, 0.90 # biased -> fair
                 ),2,2))
 
eMat <- cbind( rep(1/6,6)            # fair
              ,c(.1,.1,.5,.1,.1,.1)) # biased
initDist <- c(0.5,0.5)
Created by Pretty R at inside-R.org 
Now we need two functions to execute the forward and backward passes. These can be concisely expressed in matrix notation: 

fwd <- function(distr,obs) normalize(( t( tMat ) %*% distr ) * eMat[obs,] )
backwd <- function(obs,distr) normalize( tMat %*% ( distr * eMat[obs,] ))
normalize <- function(x) x/sum(x)
Created by Pretty R at inside-R.org 
The fwd function takes two inputs, a distribution on the hidden state space and the observation at the current node, and returns another distribution. Reduce uses the output of one call as the distr argument to the next call, with the obs argument taken from the list of data observations. We set the accum option to Reduce so that it collects the intermediate results from each call. Assuming the data are in a vector called dieRolls, we write 
dieRolls <- c(3,3,3,4,4,6,2,3,4)
forwardProbs <- Reduce(fwd,dieRolls,init=initDist,acc=T)
backwardsProbs <- Reduce(backwd,dieRolls,init=normalize(c(1,1)),acc=T,right=T)
marginalProbs <- mapply(`*`,forwardProbs,backwardsProbs)
Created by Pretty R at inside-R.org 
Since the backwards algorithm starts at the end of the data vector, we use the right argument (and switch the order of the arguments to backwd). The initial values for both passes are as described in the Rabiner tutorial. The higher order function mapply is used to combine the resulting lists with pointwise multiplication. 
The Rabiner tutorial presents this algorithm in terms of loops over arrays. This style is easy to translate into computer code, but it doesn't provide any intuition as to what is happening; I have a hard time connecting the abstract statistical ideas to the implementation. In contrast, Reduce expresses the recursive nature of the dynamic programming problem. Coding the algorithm becomes part of the research process, rather than a chore. Of course, some people can grasp the loops and arrays just fine, but if you are like me and have spent more time taking math and statistics classes than programming, you might want to check out functional programming.


Sunday, July 22, 2012

RcppArmadillo and R-forge

Jane Lange and I have recently finished a paper on continuous-time hidden Markov models and published an accompanying R package cthmm on R-forge. Jane's package cthmm relies heavily on Rcpp and RcppArmadillo packages. Somewhat surprisingly, the last dependency caused compilation problems on R-forge. Specifically, the package compiled only on Mac OS X, but not on Linux or Windows. Since Jane and I use Macs, we didn't have any problems compiling the package locally on our computers.

It turns out that the root of the problem was in using Armadillo's .inv() function to invert complex matrices. Armadillo library outsources some of its linear algebra calculations to lapack, which should be no problem since R also relies on lapack (who doesn't, right?). But for some reason r-project.org systems have only an abbreviated version of the lapack library, missing some functions for complex matrix calculations. This problem is discussed here.

Following suggestions from the above link, my solution was to download missing fortran routines (zgetri.f, ztrti2.f, ztrtri.f) from http://www.netlib.org/lapack/ and add them to /src directory of our package. A bit ugly, but worked and the package compiled on R-forge on all architectures.

P.S. As of July 22, 2012, our cthmm failed to compile on R-forge again, but we believe that this has to do with dependent packages (looks like survival package was not available on Windows). We hope to resolve this problem shortly.

Update on the P.S.: It looks like the problem with the survival package was resolved by the R-forge compilers. Our cthmm compiles now.

Saturday, July 21, 2012

Blog name change

As I find myself working on problems outside of evolutionary biology more often, I decided that a blog name change would be in place. This doesn't mean that I am planning to stop working on phylogenetics and population genetics though.

Sunday, January 15, 2012

Packaging and exposing Rcpp functions

In my previous posts, I used the inline package to test my Rcpp functions. This strategy becomes very limited as your software project grows even modestly beyond a couple of functions. So a natural solution is to create an R package with your C++ functions and R code. Creating R packages that depend on the Rcpp package is very well documented in the corresponding Rcpp vignette.

I've created a package phyloRcppExamples. It is hosted on R-forge, so you can browse the source code online or download the source via svn. I hope that through this package I will be able to illustrate more elaborate examples of using Rcpp for phylogenetic computations. I am not sure if this package will ever make its way to CRAN, but the good people at R-forge provide binaries for the most recent version of R (not tested yet on phyloRcppExamples).

Now, since we are not using inline anymore, every time I add a new C++ function to my package, I need to expose it to R (e.g. via .Call()). This usually requires writing a small C++ wrapper, which becomes cumbersome. Rcpp modules come to rescue. Below is an illustration of this concept.

In the /src directory of the phyloRcppExamples packages, I have a file two_state_model.cpp. One of the functions in this file calculates transition probabilities of a two-state continuous-time Markov chain:


//Computes finite time transition probabilities using analytic formulae.

arma::mat trans_prob(double lambda_01, double lambda_10, double time){

  double total_rate = lambda_01 + lambda_10;               

  arma::mat prob_matrix = arma::mat(2,2);
  
  prob_matrix(0,0) = (lambda_10 + lambda_01*exp(-total_rate*time))/total_rate;
  prob_matrix(0,1) = (lambda_01 - lambda_01*exp(-total_rate*time))/total_rate;
  prob_matrix(1,0) = (lambda_10 - lambda_10*exp(-total_rate*time))/total_rate;
  prob_matrix(1,1) = (lambda_01 + lambda_10*exp(-total_rate*time))/total_rate;

  return prob_matrix;
}

One way to expose this function R is to write a wrapper function

RcppExport SEXP trans_prob_to_R(SEXP lambda_01, SEXP lambda_10, SEXP time){
  return wrap(trans_prob(as<double>(lambda_01), as<double>(lambda_10), as<double>(time)));
}



Another method uses Rcpp modules:

RCPP_MODULE(two_state){ 
  function( "trans_prob", &trans_prob );
}


Here is how the two methods are used within R:

> require(phyloRcppExamples)
> 
> forward.rate = 2.3
> backward.rate = 3.1
> elapsed.time = 0.03
> 
> ## via wrapper
> .Call("trans_prob_to_R", forward.rate, backward.rate, elapsed.time, PACKAGE="phyloRcppExamples")
           [,1]       [,2]
[1,] 0.93629903 0.06370097
[2,] 0.08585783 0.91414217
> 
> ## same as above, but using Rcpp modules
> 
> two.state.Rcpp = Module("two_state", PACKAGE="phyloRcppExamples")
> 
> two.state.Rcpp$trans_prob(forward.rate, backward.rate, elapsed.time)
           [,1]       [,2]
[1,] 0.93629903 0.06370097
[2,] 0.08585783 0.91414217
> 
Created by Pretty R at inside-R.org

Keep in mind that you can add multiple functions to the same module.