Charts & Graphs

R Lattice Plot Beats Excel Stacked Column Chart – Update 3

October 9, 2008 · 24 Comments

What’s Wrong With Excel’s Stacked Column Chart  This is my 2nd post on BP Oil Statistical Review of World Energy – June 2008, post 1 is here. In this post, I discuss Excel’s stacked column charts, using BP’s Regional Consumption Pattern  2007 as an example.           First, the good news. While I’m not sure what chart package BP used, this chart looks a lot like an Excel stacked column chart with some color, Y axis position and label enhancements.  The bad news is that BP used a stacked column chart which  is the Excel way of showing three variables on a 2D display. In this case, the variables are: Region of World, Fuel type and percent of energy use by fuel in that region. It is difficult to interpret the values for the internal fuels on the stacked column chart because they do not have a common baseline. Here’s the same data in an R Lattice chart. To me, the R Lattice – trellis chart helps me to see the patterns more clearly than the stacked column chart. Notice how Asia Pacific use of coal sticks out! S & C America have the largest portion of hydroelectric use. Natural gas is used widely, with Middle East having the greatest portion. Nuclear is relatively small, with Europe leading and N America  close behind. Finally, look at oil use, Middle East is greatest user, followed by S & C America and Africa.  Could you see these details in BP’s stacked column chart? Why Are Excel Chart Users Still Using Excel for Multivariate Charts If R is so good and free, then why are Excel charters still using Excel for multivariate charts?  For me, there are two reasons: 

  1. At first, I didn’t know any better. I knew about small multiples from Tufte’s writings, however, I didn’t know there was a free tool that could do trellis – lattice type small multiples.
  2. R Learning Curve  - Naomi Robbins book, Creating More Effective Graphs, introduced me to R. I got excited about R’s capabilities, however, I found the learning curve daunting. It was easier to slip back into comfortable Excel charts rather than learn a new – better way to make charts. 

R Lattice Flexibility In a comment to this post, Tony said ” .. I  typically like to see all of the charts either in a row or column so it’s easier to compare.” Here the lattice plot the way Tony prefers. The original 2 x 3 matrix was the default. by adding a simple layout control, I changed it to a 1x 6 display.  

Update Hadley Wickham, an R heavyweight (that’s a good thing), suggested in his comment that “.. you might also want to use the reorder function to reorder the factor levels in terms of the highest use”. Since it sounded like a good idea from a really experienced R programmer, I decided to give it a try. Here’s my revised chart based on Hadley’s suggested reorder of both the fuel and region factors.

I like it! The fuel panels are now sorted by median percent energy use, with nuclear the lowest and oil the highest. The regions are sorted by magnitude of oil use, with Africa at low end and N America at high end.

Update 2 In my 1st update, I sorted the panels by magnitude of energy use, however, I was not able to sort by regional energy use within panels. I tried several sorting, ordering and reordering approaches to no avail. I finally asked the R Graphics expert, Paul Murrell, the author of R Graphics for help. Thanks again Paul.

Here’s the plot the way I really wanted it.

 

Update 3

Reader Jaanus had an interesting comment:

“Whilst the charts look nice isn’t the fact that the x-axis doesn’t begin at 0 confusing the message?

Percentage of nuclear energy usage in Middle-East is not that intuitive…I would recommend forcing the x-axis to begin at 0.” 

Here’s the chart with the x-axis starting at 0.

 

Categories: Multivariate Charts
Tagged: , ,

24 responses so far ↓

  • R ggplot2 Plot Beats Excel Stacked Column Chart « Charts & Graphs // December 19, 2008 at 12:45 PM | Reply

    [...] my October 9, 2008 post, I showed the regional energy use by fuel type for 2007, based on BP Oil Statistical Review of [...]

  • Wayne // December 16, 2008 at 11:27 AM | Reply

    I’ve sent my script to your processtrends.com email address.

    You’ll notice that ggplot2 leaves less margin around the edges and gray’s the axis tick labels, to focus attention on the data itself.

    At first, the gray background (which can be changed easily) reminded me a bit of Excel’s gray background, but the difference is that Excel’s is dark enough to cut overall contrast — making it harder to read — while ggplot’s is lighter.

    (P.S. The graph works fine in 0.8 of ggplot2, but a bug fix was released in the last couple of days: 0.8.1.)

  • dkodpe // December 15, 2008 at 5:47 PM | Reply

    Wayne:

    Thanks for the tip on ggplot2. I have it on my list to try out. Your comment has convinced me to move it up to the top.

    Your script is being chewed up by WordPress, when it sees the “” it thinks you are sending html tags.

    You can get around this by usign tag or tag.

    Since I’d like to put your ggplot2 in the post as update 4, please send me a simple text file of your script and I’ll add the plot to the post.

    Kelly

  • Wayne // December 15, 2008 at 4:06 PM | Reply

    One last try, then I have to kick myself and quit. The less-than and greater-thans are messing it up…

    #########
    library (ggplot2)
    energy$Fuel.median <- with(energy, rev()reorder(Fuel, Percent, mean))
    qplot(Region, Percent, data=energy, geom=”bar”, main=”Regional Energy Consumption Patterns – 2007″, xlab=”Percent of Regional Energy Use”, ylab=”Region”) + coord_flip() + facet_grid (Fuel.median ~ .)
    #########

  • Wayne // December 15, 2008 at 4:03 PM | Reply

    Hmmm… got mis-formatted. Let me try again…

    > library (ggplot2)

    > energy$Fuel.median qplot(Region, Percent, data=energy, geom=”bar”, main=”Regional Energy Consumption Patterns – 2007″, xlab=”Percent of Regional Energy Use”, ylab=”Region”) + coord_flip() + facet_grid (Fuel.median ~ .)

  • Wayne // December 15, 2008 at 4:02 PM | Reply

    I think you will like ggplot2. It takes a different mindset from other R graphing alternatives, but it really makes sense as you proceed to charts that are more and more complex. A quick start on your Update 3 trellis chart:

    > library (ggplot2)

    > energy energy$Fuel.median qplot (Region, Percent, data=energy, geom=”bar”, main=”Regional Energy Consumption Patterns – 2007″, xlab=”Percent of Regional Energy Use”, ylab=”Region”) + coord_flip() + facet_grid (Fuel.median ~ .)

  • Is it “..too hard” to change R plot defaults? Update 1 « Charts & Graphs // November 25, 2008 at 2:38 PM | Reply

    [...] ran into Derek’s barplot situation ( bar graph not starting at 0) in my lattice plot versus stacked bar chart post. This “problem” is easily solved with the addition of  the xlim=c(0,55) argument to my [...]

  • dkodpe // November 18, 2008 at 3:54 PM | Reply

    Andreas

    Your line of script may have been disrupted by WordPress. It looks like you are calling the function xyplot() rather than plot(). Can you send me the actual r script and data files?

    Send them to koday**at**processtrends**dot*com with appropriate corrections to remove characters that I’ve added to fool the robotic e-mail scalpers.

    Kelly

  • Andreas // November 17, 2008 at 7:02 AM | Reply

    Hi! thanks for your posts. I’ trying to learn R and will come back ofte :-)

    Right now I wonder about this: When doing a normal (not lattice) plot, xlim doesent really force the axis to start at 0

    > x y plot(y~x, xlim=c(0,10),ylim=c(0,10))

    there is still some kind of inner margin. Any idea how to remove it?

    sincerely

  • Hadley Wickham // October 26, 2008 at 9:56 AM | Reply

    I didn’t notice the fact that you are reordering regional energy use _within_ each panel. I’m not convinced this is a good idea because it’s very easy to miss (I did!) if you just quickly skim the axis labels and assume that they are all the same.

    I’m disturbed that the problem with the bars not starting at 0 occurs in all of the bar charts. This is documented in panel.barchart: “Otherwise, defaults to ‘NULL’, in which case bars start at
    the left (or bottom) end of a panel. This choice is somewhat unfortunate, as it can be misleading, but is the default for historical reasons.” So the moral is to always set origin to 0.

  • dkodpe // October 24, 2008 at 10:20 AM | Reply

    Hadley:

    I’ve started looking at your ggplot2 documentation. I am impressed and will give it a try very soon.

    Here’s the code I used to produce my lattice plot.

    #######
    library(lattice)
    my_data <- read.table(”http://processtrends.com/Files/2007_energy_use_by_region_type.csv”,
    , sep =”,”, header=TRUE,
    skip=”",
    colClasses=c(”factor”, “factor”, “numeric”))
    # Reorder fuel factor by median % use
    fuel_median <- with(my_data, reorder(Fuel, Percent, mean))
    barchart(my_data$Region ~ Percent | fuel_median, data = my_data,
    main=”Regional Energy Consumption Patterns – 2007 “,
    sub = “Data Source: BP Statistical Review of World Energy 2008″,
    xlab=”Percent of Regional Energy Use “,
    # xlim=c(0,55),
    ylab=”Region”,
    col=”grey”,
    border=”NA”,
    par.settings=list(axis.text=list(cex=0.85), fontsize=list(text=10)),
    par.strip.text=list(cex=0.9),
    par.strip.col=”white” ,
    layout = c(1,6),
    aspect = (0.3),
    # Code provided by Paul Murrell to order by fuel% within each fuel panel
    scales=list(y=list(relation=”free”)),
    prepanel=function(x, y, …) {
    list(ylim=levels(reorder(y, x, sum)))},
    panel=function(x, y, …) {
    panel.barchart(sort(x), reorder(y, x, mean), …) })
    #########

    Once I got Jaanus’s comment, I added the the xlim(0,55) argument which solved the problem.

    When I run the script with the xlim() active, the chart plots strating at 0.

    I suspect the problem is with the prepanel() call.

  • Hadley Wickham // October 24, 2008 at 9:37 AM | Reply

    What lattice code did you use to produce the graph? I’m surprised that the bars didn’t start at zero already (they would if you used ggplot2 ;)

  • dkodpe // October 23, 2008 at 12:31 PM | Reply

    Jaanus

    I’ve adjusted the x-axis scale to start at 0.

  • Jaanus // October 22, 2008 at 4:30 PM | Reply

    Whilst the charts look nice isn’t the fact that the x-axis doesn’t begin at 0 confusing the message?

    Percentage of nuclear energy usage in Middle-East is not that intuitive…

    I would recommend forcing the x-axis to begin at 0.

  • dkodpe // October 22, 2008 at 8:17 AM | Reply

    I’ve updated the lattice plot to sort by both fuel and region within fuel, as suggested by Tony.

  • Hadley Wickham // October 13, 2008 at 6:57 PM | Reply

    One approach would be to use the plyr package and do something like:

    ddply(mydf, .(fuel), transform, region = reorder(region, perc))

    Hadley

  • dkodpe // October 13, 2008 at 11:33 AM | Reply

    Hadley:

    Thanks for the tip on the reorder function. I was able to reorder based on fuel; however, I have been stumped on how to reorder on both fuel and region within fuel.

    Tony:

    Thanks for feedback. As my note to Hadley shows, I’m still trying to figure our how to do a 2-way reorder based on fuel then region within fuel.

    Stay tunned. I’ll get it eventually.

  • Tony Rose // October 13, 2008 at 10:46 AM | Reply

    I like the most recent version with the reordering. One change that may help even more is reordering each energy type so the bars go from highest to lowest.

    I think this change would make it easier to read and provide just a little more value that what’s already available.

    I’m looking forward to learning R.

  • Hadley Wickham // October 9, 2008 at 6:55 PM | Reply

    For those graphics, you might also want to use the reorder function to reorder the factor levels in terms of the highest use.

    And it’s nice to see someone selling R to excel users! It really is much easier to do data analysis in R, once you get the hang of it.

  • Hadley Wickham // October 9, 2008 at 6:53 PM | Reply

    You might also want to have a look at my ggplot2 package. It offers similar capabilities to lattice, but based around Wilkinson’s grammar of graphics, which means it’s generally easier to build up complex graphics a piece at a time. There some nice side effects of this approach, including automatic legends that get it right 99% of the time. See http://had.co.nz/ggplot2 for more info and examples.

  • Tony Rose // October 9, 2008 at 5:11 PM | Reply

    Ahh, I like that version better. It is not that easy to make that same change in Excel.

    You could also add another series next to this vertical chart that shows energy type on the y-axis and segmented chart by region. This will allow you to see the region detail vs. energy type.

    Nice job Kelly!

  • dkodpe // October 9, 2008 at 4:43 PM | Reply

    Tony

    Good point. I’ve added a vertical version to show how easy it is to adjust lattice chart layouts in R.

  • Tony Rose // October 9, 2008 at 3:40 PM | Reply

    I also agree and have written about the ineffectiveness of stacked column charts. If I was trying to create this in Excel, I would go with a panel chart, which is similar to your R Lattice chart. I typically like to see all of the charts either in a row or column so it’s easier to compare.

  • Matt // October 9, 2008 at 3:32 PM | Reply

    I’m not familiar with R but Origin can make the lattice chart also (but not automatically). You can create several individual charts and then use the layout function to combine them. Origin is not free though. I’m going to check out R.

Leave a Comment