Climate Charts & Graphs

Boxplots: R Does Them Right

November 18, 2008 · 7 Comments

In this post, I define a Tukey boxplot, review the history of boxplots in Excel and walk through an R script for making a proper Tukey boxplot. A link to the source data and R script files is provided.

Tukey Boxplot

“… a boxplot (also known as a box-and-whisker diagram or plot) is a convenient way of graphically depicting groups of numerical data through their five-number summaries  (the smallest observation,   lower  quartile  (Q1), median (Q2), upper quartile (Q3), and largest observation). A boxplot may also indicate which observations, if any, might be considered outliers. The boxplot was invented in 1977 by the American statistician John Tukey.”  Wikipedia

Naomi Robbins’  in her book,  Creating More Effective Graphs  explains the proper layout of a boxplot:

“The rectangle shows the interquartile range (IQR); it goes from the first quartile (the 25th percentile) to the third quartile (the 75th percentile). The whiskers go from the minimum value to the maximum value unless the distance from the minimum value to the first quartile is more than 1.5 times the IQR. In that case the whisker extends out to the smallest value within 1.5 times the IQR from the first quartile. A similar rule is used for values larger than 1.5 times IQR from the third quartile. A special symbol shows the values, called outliers, which are smaller or larger than the whiskers.” pg.91

Boxplots and Excel

Since Excel does not provide a boxplot option, hundreds of Excel users have asked for boxplot capabilities over the years.  In October, 2002, Duncan Williamson posted a tutorial on how to make box and whisker plots in Excel.  Duncan’s method was a modification of the Tukey boxplot.

Duncan Williamson’s  July, 2003 follow-up post on boxplots describes how he found out that his October, 2002 boxplot page “ is, unbeknownst to me, inadequate. ” Duncan learned from a reader, Nikolai Graae, that he was not correctly calculating the whisker values and needed to show outliers for a proper Tukey box and whisker plot.

John Walkenbach, in his 2003 book, Excel Charts, also showed how to make a modified boxplot.  I say modified, because John’s boxplot, like Duncan’s,  used the minimum and maximum values to calculate the whisker values and did not plot outliers.  In April of 2006, I showed how to make a modified boxplot in Excel on my ProcessTrends.com website. 

Jon Peltier created a downloadable Excel workbook of a modified boxplot in October, 2002.  Jon placed the following note in his workbook: 

“Note: there are several different ways that box and whisker plot depict sample data.  The whiskers may alternatively show a more rigorously defined upper control limit, and special markers may be used to indicate outliers beyond these.  Any such alternatives are beyond the scope of this exercise.”

In July, 2007, Jon Peltier updated his boxplot utility to include an option to construct a proper Tukey boxplot that uses the 1.5 IQR definition for the whiskers and outliers. 

The significance of the differences between the Tukey boxplot and modified boxplot methods depends on the distribution of the data being analyzed.  If the minimum and maximum values lie outside the 1.5 IQR, then the modified boxplot becomes problematic because it misses the presence of outliers in the data.

Excel users who want to make a proper Tukey boxplot in Excel should use Jon Peltier’s add-in that includes an option to use the proper whisker definition and outliers. 

R Boxplot

Here’s a comparison of my original Excel ProcessTrends.com modified boxplot and a proper Tukey boxplot that I made with R. 

ptcom_boxplotr_boxplot

 

 

 

The R boxplot has the 1.5 IQR Tukey whiskers as well the outliers for points outside the whiskers. 

R Script Overview

Here’s an image of the R script for this R boxplot. 

box_plot_script

The 26 lines of R script include 4 lines of comments, 2 library and  attach statements,  and 4 function calls. This script can be reused over and over to make proper Tukey box plots by adjusting the source data file link, variable names,  titles and color selections.  

Let’s walk through the script to see how R handles boxplots. The script is set up in the 4 steps I have described before. I have deliberately arranged the script to highlight the arguments for each function to help me be able to reuse the script from chart to chart and help both you and I understand the options for each function. 

Step 1 – Setup 

library(lattice)
link<- c("C:\\R_Home\\Charts & Graphs Blog\\Box_Whisker
   \\fluoride.csv")

The first line loads the lattice library.
The second line establishes the variable link that I use to specify the source data file path and file name. 

Step 2 – Read Data 

my_Data <- read.table(link,
     skip = 0, sep = ",", dec=".",
     row.names = NULL, header = FALSE,
     colClasses=c("numeric","factor","NULL","numeric"),
     comment.char="#",na.strings=c(-99.9,-999.9),
     col.names=c("week","week_dt","ignore","fluoride"))
attach(my_Data)

This first function reads the data file specified by link. I have added arguments to indicate how many lines to skip in data file (skip = 0), decimal indicator (dec=”.”), specify that there are no row.names, indicate that there is no header, specify the colClasses (character, factor, logical, NULL,  or numeric) for the each input variable, specify the comment.char =”#”, specify the missing data strings (na.strings = “-99.9, -999.9), and specify the col.names.

The second function allows me to use the variable names directly without having to include the data.frame name (my_Data).

Step 3 – Data Manipulation 

# reorder week_dt factor to sort by week number
by_week <- with(my_Data,(reorder(week_dt, week, mean)))
trellis.par.set(box.rectangle = list(col="grey"),
    box.dot =list(col = "grey", pch = 16, cex = 0.5),  
    box.umbrella= list(col = "grey", lty= "solid"), 
    plot.symbol=list(col="red", pch= 6 , cex=.5))

I want to make a plot by week, so I have specified the week_dt as a factor. Since factors will appear in alphabetical order be default, I have reordered the data.frame by week. I have also set several trellis par values for box and symbol specifications.

Step 4 – Make Boxplot

xyplot(fluoride ~ by_week, my_Data,
         main = "Box & Whisker Plot: Weekly Results \n 
           Fluoride Concentrations \n
           (10/1/02 - 1/20/03)",
    par.settings=list(axis.text=list(cex=0.9),
    fontsize=list(text=9)),
    panel = panel.bwplot, horizontal = FALSE,
    ylab = "Fluoride Concentration - mg/l",
    xlab = "Week Begining " ,
    box.ratio =0.5)

This function makes the boxplot, one for each of the 16 weeks. The title and axis labels are specified as well as text font size and the ratio of box width to height.

Note that I use the lattice library xyplot function. I specify the boxplot with the panel = panel.bwplot argument.

Here’s a link to the source data and R script file so that you can try it for yourself. Be sure to correct the link in the R script so that it points to the path and file name that you give your copy of the source data file.

Categories: Chart Principles · R Learning Curve
Tagged:

7 responses so far ↓

  • Oldi // July 22, 2009 at 5:33 AM | Reply

    great! I just found the way to optimize my boxplot by using your R-code. I was looking for a way to sort my BPs by a values but it did not work out well so far… until I found your neat little page.

    by the way, using library(gplots) one can use boxplot.n which plots the number of cases per boxplot. could be a nice combination but it seems that boxplot.n does not work with your trellis style.

    Any idea?

    cheers
    Oldi

  • Ed Ferrero // January 31, 2009 at 6:01 PM | Reply

    It has been a long time since I used box-plots in Excel. More recently, I have revisited these using a scatter chart approach.

    See the ‘Box Whisker Charts’ download at http://www.edferrero.com/ExcelCharts/tabid/102/Default.aspx

    This does use outlier markers.

  • Fabrice // November 21, 2008 at 4:01 PM | Reply

    There is another free alternative for Excel :

    http://sparklines-excel.blogspot.com/

    Nixnut implemented a UDF to easily create a Tukey Boxplot inside a cell.

    His Excel 2000 version of the add-in can display several other type of boxplots, including outliers.

  • nixnut // November 21, 2008 at 12:06 PM | Reply

    I can sort of make those in Excel with UDF’s. Example here: http://dev.gentoo.org/~nixnut/images/tukeyboxplots.png. I only have horizontal boxplots, not vertical ones. And extreme outliers are display differently from mild outliers. I used the definition on http://en.wikipedia.org/wiki/Box_plot

  • william jones // November 19, 2008 at 1:51 PM | Reply

    You fail to mention Astute (now called Analyse-it) which first offered box (including Tukey outlier box plots) in 1993. They were the first to offer box plots back when I was using Excel 5, and now offer both skeletal and outlier box plots now in Analyse-it

  • dkodpe // November 18, 2008 at 3:59 PM | Reply

    Jon:

    Welcome to my blog. I’m glad to hear that you will be looking into R. It’s very powerful and not that hard for VBA programmers like yourself.

    My approach is to use a mix of Excel and R, depending on the situation. Just like a good woodworker with a proper set of tools, I want to use the right tool for the task. There are times when Excel is easier and faster and there are times when R and the family of libraries is the way to go.

    Kelly

  • jonpeltier // November 18, 2008 at 3:20 PM | Reply

    Kelly -

    This is very relevant to me now. I’m close to marketing an Excel box plot generator which improves on my previous attempts to handle outliers.

    In addition my utility tries to account for the many ways of defining the 25th and 75th percentiles. There’s Excel’s way, there’s Minitab’s way, there are a number of other ways (including one called CDF which is usually in between Excel’s and Minitab’s), and there is Tukey’s way. Tukey’s method doesn’t even use percentiles, but rather, easily determined “hinges”, made for quick manual computation back in the days before one could use a computer program to spit out quartile measurements. Using different measures of “quartile” doesn’t seem to make much difference, but it’s often enough to move an observation from within the fences to outlier land.

    Don’t construe this comment as a plug for my utility: I’ll do that on my own blog. Your recent activities have led me to download R-for-Windows, and one day soon I want to give it a spin. Your R tutorials have convinced me that it should not be too difficult to program R.

Leave a Comment