Hadley Wickham’s ggplot is a very interesting package. It makes beautiful graphics, integrates well with some of the other packages to allow you to superimpose the plots of various types of estimates on plots of data. In particular, it uses colours very well. The default colour schemes are aesthetically pleasing. It allows a flexible use of colour schemes, both to separate categorical data and to shade continuous variables, that helps understanding data.

I was recently working on data on household incomes from a stratified sample survey and wanted to look at boxplots of some variables. The problem, however, was that the plot functions in r-base do not allow the observations to be weighted. As a result, I could have only used the simple median, Q1 and Q3 to draw the plots. That would be obviously misleading. ggplot package on R draws the weighted boxplots.

The problem, however, is that the ggplot documentation, as of today, is rather incomplete. There are a lot of interesting features that are either not documented or hidden away in details. Hadley is working on a new version of ggplot, and a ggplot book. That should take care of these gaps!

What saved me was that Hadley himself has been very active on the mailing list, and has been personally answering most of the queries on ggplot. He answered my queries as well, and we had a very useful exchange of e-mails. This ended in a code that made the weighted box plots as I wanted, and also put the values of the boxplot statistics on the plot. I was keen to see the numbers on the plot to be able to compare boxplots for different categories of observations. Thank you Hadley for being so accessible!! I subsequently improvised on the script Hadley suggested to make it also plot the weighted averages on the boxplots. The script was then converted into a function, which could be called into any R program to make the boxplots.

library(ggplot)

vjitter <- function(dataset,col,xlab,ylab,v1,v2){

ggopt(axis.colour=”black”)

p <- ggplot(dataset,aesthetics=list(x=x,y=y, weight=Multiplier,colour=col))

p$xlabel<-xlab

p$ylabel<-ylab

(p<-ggjitter(ggboxplot(p,colour=”black”,orientation=”vertical”)))

split(dataset,dataset$x)->cl

dots <- do.call(rbind, lapply(cl, function(df) {

data.frame(

x = df[1, ]$x,

dots = boxplot_stats_weighted(df$y, weights=df$Multiplier)$stats

)

}))

(p<-ggtext(p, data=dots, aes=list(x=x, y=dots, label=format(dots, digits=2)), justification=”left”,colour=”blue”))

means<-do.call(rbind, lapply(cl,function(df){

data.frame(

x=df[1,]$x,

mean=weighted.mean(df$y,df$Multiplier)

)

}))

(p<-ggpoint(p, data=means, aes=list(x=x, y=mean), colour=”magenta”))

(p<-ggtext(p, data=means, aes=list(x=x, y=mean, label=format(mean, digits=2)), justification=”right”,colour=”magenta”))

pscontinuous(p,variable=”y”,range=c(v1,v2))

}

Here is a sample boxplot created by the above function.

This is, of course, a touched up version. I subsequently decided to display only the median (in blue) and the mean (in magenta) values. But that only required a minor modification in the code.

The sticky issue, which I have not yet been able to clear, is that the code actually puts the numbers rights on the vertical axis of the boxplot. The x-axis is a categorical axis, and it recognises only discrete values. I don’t know how to write the numbers to the right of the boxes. What I did was to print the files as postscript, and then edit the postscript file to push the numbers out of the boundaries of the boxes. I know it is not a neat way of doing things at all. But I had a deadline and I just had to get it done!! But I do really hope I will be able to find a better way of doing that. (Hadley, help please!!)

As I became more comfortable with the ggplot way of doing things, I started playing around more. This graph shows a scatterplot in which two continuous and two categorical variables, that is four variables in all, have been plotted simultaneously. I found it such an insightful way of looking at data!!

There is one more feature with which I am still having trouble. ggquantile plots quantile regressions lines. Why is it that it plots curves rather than straight lines? See, for example, example (gguantile). Rescale the y axis using pscontinuous and you’ll see what we have are curves and not linear regression lines. I don’t know why is that so. Of course, I could estimate the quantile regressions using rq and then insert the lines using ggabline. But ggquantile is supposed to automate that. Isn’t it? Or have I not understood how to operate it?

In any case, with all the powerful and aesthetically pleasing graphics, I am sure ggplot will become more popular with users of R, and we will see more discussion on it on the mailing list. I for one will surely be using a lot in my work.

V.

Technorati Tags: GNU-R