Wednesday, June 25, 2014

TexShop engines for Knitr w/ and w/o pstricks

Two years ago, I've posted a link that explains how to set up a painless compilation of Sweave documents in TexShop: http://cameron.bracken.bz/sweave-for-texshop

Since I've switched to Knitr last year, I wanted to share a couple of tricks of using a TexShop+Knitr combination.

First, setting up a Knitr engine in TexShop is analogous to the recipe from the above link: create a file Knitr.engine in your ~/Library/TexShop/Engines/ directory with the following content:

#!/bin/bash

export PATH=$PATH:/usr/texbin:/usr/local/bin
Rscript -e "library(knitr); knit('$1')"

Now if you (re-)open TexShop, you should have Knitr in your list of engines located next to the Typeset button on the top. This worked well for me until today, when I needed to combine Knitr code with some old LaTeX code of mine that used "pstricks" package. Since pdflatex does not work with "pstricks" (there something called PDFTricks, but I haven't tried it), I needed to create another engine in my ~/Library/TexShop/Engines/ directory that I called KnitrDviPs:

Rscript -e "library(knitr); knit('$1')"
latex "${1%.*}"
bibtex "${1%.*}"
latex "${1%.*}"
latex "${1%.*}"
dvips -t letter -o "${1%.*}.ps" "${1%.*}"
pstopdf "${1%.*}.ps"

This engine goes through a less efficient "dvi->ps->pdf" route. However, this still doesn't do the trick just yet, because by default Knitr does not produce postscript files of all embedded figures. So you have to add a "postscript" device to your Knitr document using Knitr chunk options. For example, somewhere near the beginning of your Knitr document you can put the following command:

opts_chunk$set(fig.path='figures/', fig.align='center', dev=c('pdf', 'postscript'))

Now you have two Knitr engines: one that goes through pdflatex and another that goes through "dvi->ps->pdf".

Monday, May 5, 2014

Fundraising for victims of May 2nd, 2014 events in Odessa, Ukraine (updated on May 19, 2014)

Many of you have seen the news about violent clashes that occurred in Odessa, Ukraine on May 2nd, 2014. During these clashes, my very close childhood friend was shot from a machine gun and now he is fighting for his life in one of the hospitals in Odessa. I don’t know all the details yet, but it appears that my friend was wounded by a pro-russian activist during the initial stage of the unrest, when heavily armed pro-russian activists attacked a peaceful demonstration for the unity of Ukraine. The crossed out information hasn't been confirmed. I still don't know all the details, but if you feel that you were mislead by the crossed out sentence, send me a note by email.

My friend’s family has enough financial resources to give him medical care, but my friend’s father says that he sees many families of victims of the clashes that don’t have money to pay for medical expenses. The Ukrainian government promised to help all victims, but their help may arrive too late, if it ever arrives. Since I have a unique opportunity to reach out to the victims’ families in need through my relatives and friends in Odessa, I’ve decided to try to raise funds to provide medical care to the wounded in clashes on May 2nd, 2014. The funds will be distributed to the wounded without
any regard to their political convictions.

If you would like to help me in this fundraising campaign, please donate to my PayPal account using a link on my web page:

http://www.stat.washington.edu/vminin/

-Vladimir

Update 1 (05/06/2014): funds raised so far - ~$1,500; wholehearted thanks to everyone, you are amazing! I've made the first transfer of $1,000 to Odessa today.

Update 2 (05/06/2014): I was on the phone with Odessa today. Hospitals are in dire need of many pieces of medical equipment to treat hundreds of wounded. There is a local fundraising effort going on to buy this equipment and we were offered to join this effort. This would be a slight deviation of the initial plan of distributing funds to the wounded directly. However, I was told that there is where the help is needed the most at the moment, so I support this suggestion. Still, I will ask my contacts in Odessa to give money directly to the families of wounded that need assistance first, and use the rest for medical equipment. Important: if you are not comfortable with this plan or have any hesitation about how your donation will be spent, please contact me by email so I can give you more information and/or refund you.

Update 3 (05/09/2014): funds raised - ~$2,000; again, thank you!

Update 4 (05/09/2014): I am happy to report that I got in touch with a local volunteer in Odessa, whom I personally know and who is involved in distributing financial assistance to the wounded on May 2nd in Odessa. So we are back to the original plan -- providing money directly to the families of wounded!

Update 5 (05/09/2014): Since we have reached a plateau in the total funds raised, I've decided to take a pause, to stop soliciting/receiving donations, and to concentrate on distributing the money we have collected so far. If I have time and energy, I may try for a second round of fundraising. I will be updating this post with details of how the donations are being spent in Odessa.

Update 6 (05/19/2014): I've got an update on how the first transfer (~1,000) to Odessa was spent! First, medical bills, amounting to ~$485 were payed for a person in need who was hurt on May 2, 2014 in Odessa and remains in critical condition. Next, our local volunteer used ~$460 to purchase painkillers for the wounded that still need medical attention. The small remainder of the transferred sum was used to help offset the cost of medical treatment for children of three military personnel relocated from Crimea after the Russian invasion.

Tuesday, December 10, 2013

Guest post: Where did the Chapman-Kolmogorov name come from?

This is a guest post by Peter Guttorp. Peter and I are working on updating Peter's book Stochastic Modeling of Scientific Data. To provide the reader with productive distractions, we are sprinkling the book with "historical boxes." Also, let's be honest, it keeps us entertained as well, because we are both math history junkies. Well, Peter does real research in the history of mathematics (e.g., here) , so he is a real historian, but I am definitely a junkie. While working on one of the background chapters, we put a historical box for Chapman and Kolmogorov next to the first appearance of the Chapman-Kolmogorov equation, but then wondered why we did this, because Peter knew that neither Chapman nor Kolmogorov were responsible for its discovery. This is, of course, not surprising in light of the Stigler's law, but then who gave this equation its modern name? Peter started digging and this is what he has found so far:
The origin of the Chapman-Kolmogorov equation dates at least back to Bachelier’s 1900 dissertation on the theory of speculation (but it may be earlier than that). Bachelier did not give it a name.  
In physics it is sometimes called Smoluchovsky’s equation, after Smoluchovsky’s 1906 rediscovery of Brownian motion (preceded by Bachelier and by Einstein (1905)). Smoluchovksy’s work was extended by Chapman (1928) to the case of nonstationary transition kernels. It was independently developed by Kolmogorov in his fundamental 1933 Mathematische Annalen paper. 
But where did the name come from? The first uses of the Chapman-Kolmogorov name are in Feller’s 1940 Transactions of the AMS paper, where he says that it is “known as the equation of Chapman and Kolmogoroff” . The 1940 thesis of Lundberg also uses this term, referring to Frechet (1938) and Kolmorogov (1931). Going to Frechet’s book, he does not give the equation a name, but designates it “as a mnemotechnical tool” with the letter I. 
In 1936, Feller writes in his Mathematische Annalen paper “Finally, from the rule of combination of probabilities, what Chapman and von Smoluchowski called the fundamental equality follows”. 
So what happened between 1936 and 1940? Did the name get invented in the corridors of the University of Stockholm (where Feller was a lecturer, Lundberg a PhD student, and Cramer the professor)?
If you know anything about the origin of the name Chapman-Kolmogorov equation, please get in touch with Peter and/or me. We are very curious!

Monday, October 14, 2013

Passing Armadillo matrices by value and by reference

Rcpp and RcppArmadillo make it relatively easy to integrate R code and  C++ code. In fact the resulting C++ code looks so much as R code that it is easy to forget some basic rules by which a C++ coder must play. This is especially true for someone like me, who wrote relatively little C++ code in the past.

Here is an example of a typical calculation that may get one in trouble. Suppose you have a function that takes two matrices from R, converts them into Armadillo matrices and passes these functions to other functions to do some repetitive calculations. For example, imagine passing transition and emission probability matrices to execute the forward-backward algorithm on multiple realizations of the same hidden Markov model.

Below you see my C++ code, saved in the file "matrix_code.cpp" that has two sets of functions: a) doStuff1(), twoMatrices1(); b) doStuff2(), twoMatrices2(). The functions twoMatrices1() and twoMatrices2() are exported to R via Rcpp attributes. The twoMatrices1() calls doStuff1() 100 times, while twoMatrices2() calls doStuff2() 100 times. The only difference between the two sets of functions is in that doStuff1() passes two Armadillo matrices by value, which requires copying these matrices. In contrast, doStuff2() passes Armadillo matrices by reference without calling copy-constructor under the hood.


// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>  
using namespace Rcpp; 
// First passing Armadillo matrices by value
void doStuff1(arma::mat c, arma::mat d){
  double x = c(1,1) + d(1,1);
// [[Rcpp::export]]
void twoMatrices1(NumericMatrix A, NumericMatrix B) {
    arma::mat M1(A.begin(), A.nrow(), A.ncol(), false);
    arma::mat M2(B.begin(), B.nrow(), B.ncol(), false);
 
    for (int i = 0; i < 100; i++){
      doStuff1(M1, M2);
    }
// Now passign Armadillo matrices by reference
void doStuff2(const arma::mat& c, const arma::mat& d){
  double x = c(1,1) + d(1,1);
}

// [[Rcpp::export]]
void twoMatrices2(NumericMatrix A, NumericMatrix B) {
    arma::mat M1(A.begin(), A.nrow(), A.ncol(), false);
    arma::mat M2(B.begin(), B.nrow(), B.ncol(), false);
 
    for (int i = 0; i < 100; i++){
      doStuff2(M1, M2);
    }
}


Now I am going to call twoMatrices1() and twoMatrices2() from R, using rbenchmark package to compare their speed. I am going to use fairly large matrices (100x100), so the cost of moving them around is significant.

library(RcppArmadillo)
library(rbenchmark)
 
sourceCpp("matrix_code.cpp")
 
x = matrix(1,100,100)
y = matrix(1,100,100)
 
res <- benchmark(twoMatrices1(x, y), twoMatrices2(x, y), 
                 columns = c("test", "replications", "elapsed", "relative"), 
                 order="relative", replications=1000)
print(res)
 
                test replications elapsed relative
2 twoMatrices2(x, y)         1000   0.006    1.000
1 twoMatrices1(x, y)         1000   0.829  138.167

Created by Pretty R at inside-R.org

As you can see, passing two 100x100 Armadillo matrices by value is more than 100 times slower than passing these matrices by reference. Perhaps not a very surprising outcome, but an important one to keep in mind when porting R code to C++.

Friday, October 11, 2013

Dissertation Acknowledgments

Jonathan Eisen posted his dissertation acknowledgments. I think this is a great idea. Maybe it is just me, but when I read someone's dissertation, not necessarily as a committee member, I really enjoy reading the acknowledgements. For me, these personal comments place the work in the context of "real life" and personal relationships. 

Here are my acknowledgments:

First and foremost, I would like to thank my mom and dad for creating many opportunities for my intellectual growth. Feeling my parents' constant love and support makes my scientific journey so much more meaningful! I am very thankful to my middle and high school mathematics teachers, Maya Radionovna Matihina and Svetlana Efimovna Gaydamak, for introducing me to the beautiful world of mathematical reasoning and abstract thinking.  I am very grateful to my undergraduate advisor, Galina Evgenievna Samkova, for patiently listening to my naive ideas and taking seriously my modest attempts to produce new mathematical results. Special thanks goes to Stephen Krone, who hooked me onto mathematical biology by regularly taking me to the bioinformatics seminars at the University of Idaho. My first scientific collaborators, Zaid Abdo and Paul Joyce, became my very good friends and influenced my scientific career more than they probably realize.    
UCLA Department of Biomathematics has been my second home for the last four years. I would like to thank the departmental staff, David Tomita, Martha Riemer, Wendy Hagar, and Fred Hughes, for taking care of administrative hurdles so well. During my first years at UCLA, I received a lot of advice and guidance from my fellow Biomathematics students. I am very grateful to Robert Rovetti for being a wonderful colleague and a good friend. Robert thoroughly read and constructively criticized Chapters 3 and 4 of this dissertation. Intellectual discussions with Benjamin Redelings, John O'Brien, and Alex Alekseyenko were certainly worth inhaling cigarette smoke and drinking really bad coffee from vending machines. 
All members of my dissertation committee played important roles in my graduate school development. I enjoyed collaborating with Christina Kitchen, who is very passionate about both statistics and biology. I am grateful to Janet Sinsheimer for giving me many opportunities to assist her in graduate and undergraduate level teaching. Elliot Landaw provided me with excellent academic guidance at the beginning of my studies and was extremely supportive throughout all these years. Ken Lange has frequently given me always timely and wise pieces of advice. Chapter 6 would not be possible without
his enthusiasm about Markov chain induced counting processes and his encouragement to pursue this line of research. 
The chair of my dissertation committee, Marc Suchard, is the greatest advisor one could wish for! I am very grateful to Marc for introducing me to interesting problems and for not revealing solutions to these problems, giving me an opportunity to become an independent researcher. It is very difficult to summarize everything I have learned from Marc in one paragraph. Perhaps the most important lesson that Marc taught me is how not to be afraid: not to be afraid to attack challenging problems, not to be afraid to seek rigorous solutions for these problems, and not to be afraid to go somewhere no one has gone before.    
This dissertation would never be finished without support of my wife, Yuliya, who patiently tolerated my "writing weekends'' and many other strange habits of a finishing Ph.D. student.  

Saturday, November 3, 2012

Guest post: Functional Programming in R with an HMM example

Below is a guest post by Chris Glazner, who works on his Ph.D. in Statistics with Elizabeth Thompson here at UW. I especially recommend checking out Chris' slick example of using Reduce for HMM forward-backward algorithm.

Chris writes:

I spent some time over the summer looking at ways of using functional programming to implement the statistical methods in my dissertation research. Functional programming is a style which treats functions as the basic building block of computations. The flow of a program is expressed as the composition of functions, rather than a sequence of steps which manipulate variables. Programming this way tends to feel a little more like doing math, so I've found it to be a good fit for statistical programming. If you can describe your program as a function between sets (say, from a data vector to a confidence interval) you've already figured out the overall structure of a functional program to implement it. 
R is a functional language, meaning that functions can be given names, created anonymously, and passed to other functions. Functions whose inputs or outputs are themselves functions are called "higher order functions". A number of higher order functions are included in the R base library, including the popular apply family: apply, sapply, lapply, and mapply. These functions take a function and a list or vector and apply the function to every element of the list or vector. They replace for loops that look like this:  
for (i in 1:length(x)) f(x[i])  
Created by Pretty R at inside-R.org
where f is any suitable function. Replacing a loop like this with a function call simplifies code and makes it easier to understand.
Other higher order functions express other common patterns. Reduce captures the behavior of an accumulating loop like: 
y <- startValue
for (i in 1:length(x)) y <- g(y,x[i]) 
Created by Pretty R at inside-R.org 
This pattern may not seem as general as apply, but it encompasses many common calculations. If g is `+` and startValue is 0, then the call to Reduce is equivalent to sum(x). If g is `*` and startValue is 1, we have product(x). Reduce is known by many names in other languages and is a powerful abstraction for expressing computations. 
A nice application of Reduce arises in the Hidden Markov Models (HMMs). An HMM is an unobserved Markov chain along with data which are generated independently according to the hidden Markov state at each step. The goal is usually to infer something about the hidden process from the data. The standard method for calculating the marginal probabilities of the hidden states at each step is known as the forward-backward algorithm, which is explained in a nice tutorial by LR Rabiner. We can use Reduce to express this algorithm concisely. 
As an example we will implement the simple Occasionally Dishonest Casino model described in a classic Biologial Sequence Analysis book by Durbin, Eddy, Krogh, and Mitchison. This casino has a fair die and a biased die which rolls three 50% of the time. We observe repeated rolls of a single die, but we never know which one. The casino switches the dice according to a two-state Markov chain. We want to infer which die is in use at a given time. 
The model is fully specified by the Markov transition matrix, the emission distribution of the data conditional on the hidden state, and the initial distribution. Since this is R, we will specify these using vectors and matrices, but we could use functions if we wanted.
tMat <- t(matrix(c(0.95, 0.05 # fair -> biased
                 , 0.10, 0.90 # biased -> fair
                 ),2,2))
 
eMat <- cbind( rep(1/6,6)            # fair
              ,c(.1,.1,.5,.1,.1,.1)) # biased
initDist <- c(0.5,0.5)
Created by Pretty R at inside-R.org 
Now we need two functions to execute the forward and backward passes. These can be concisely expressed in matrix notation: 

fwd <- function(distr,obs) normalize(( t( tMat ) %*% distr ) * eMat[obs,] )
backwd <- function(obs,distr) normalize( tMat %*% ( distr * eMat[obs,] ))
normalize <- function(x) x/sum(x)
Created by Pretty R at inside-R.org 
The fwd function takes two inputs, a distribution on the hidden state space and the observation at the current node, and returns another distribution. Reduce uses the output of one call as the distr argument to the next call, with the obs argument taken from the list of data observations. We set the accum option to Reduce so that it collects the intermediate results from each call. Assuming the data are in a vector called dieRolls, we write 
dieRolls <- c(3,3,3,4,4,6,2,3,4)
forwardProbs <- Reduce(fwd,dieRolls,init=initDist,acc=T)
backwardsProbs <- Reduce(backwd,dieRolls,init=normalize(c(1,1)),acc=T,right=T)
marginalProbs <- mapply(`*`,forwardProbs,backwardsProbs)
Created by Pretty R at inside-R.org 
Since the backwards algorithm starts at the end of the data vector, we use the right argument (and switch the order of the arguments to backwd). The initial values for both passes are as described in the Rabiner tutorial. The higher order function mapply is used to combine the resulting lists with pointwise multiplication. 
The Rabiner tutorial presents this algorithm in terms of loops over arrays. This style is easy to translate into computer code, but it doesn't provide any intuition as to what is happening; I have a hard time connecting the abstract statistical ideas to the implementation. In contrast, Reduce expresses the recursive nature of the dynamic programming problem. Coding the algorithm becomes part of the research process, rather than a chore. Of course, some people can grasp the loops and arrays just fine, but if you are like me and have spent more time taking math and statistics classes than programming, you might want to check out functional programming.


Sunday, July 22, 2012

RcppArmadillo and R-forge

Jane Lange and I have recently finished a paper on continuous-time hidden Markov models and published an accompanying R package cthmm on R-forge. Jane's package cthmm relies heavily on Rcpp and RcppArmadillo packages. Somewhat surprisingly, the last dependency caused compilation problems on R-forge. Specifically, the package compiled only on Mac OS X, but not on Linux or Windows. Since Jane and I use Macs, we didn't have any problems compiling the package locally on our computers.

It turns out that the root of the problem was in using Armadillo's .inv() function to invert complex matrices. Armadillo library outsources some of its linear algebra calculations to lapack, which should be no problem since R also relies on lapack (who doesn't, right?). But for some reason r-project.org systems have only an abbreviated version of the lapack library, missing some functions for complex matrix calculations. This problem is discussed here.

Following suggestions from the above link, my solution was to download missing fortran routines (zgetri.f, ztrti2.f, ztrtri.f) from http://www.netlib.org/lapack/ and add them to /src directory of our package. A bit ugly, but worked and the package compiled on R-forge on all architectures.

P.S. As of July 22, 2012, our cthmm failed to compile on R-forge again, but we believe that this has to do with dependent packages (looks like survival package was not available on Windows). We hope to resolve this problem shortly.

Update on the P.S.: It looks like the problem with the survival package was resolved by the R-forge compilers. Our cthmm compiles now.