Tag Archives: R and Excel

Don’t Try This With Excel

In this post I show 4 charts of the same data to demonstrate  what Excel chart users are missing by not having a more powerful charting tool. This post,  building on my previous discussion of using factors for conditional formatting,  shows the potential advantages of plotting summary values and bounding area polygons . These analytical displays are not readily available to even advanced Excel users. Continue reading

Learn R Toolkit Videos (Updated 6/14/14)

My list of free Learn R Toolkit videos is growing. Here’s the latest list of free videos:

  • Get Familiar with R
  • Download & Install R
  • Configure Working Directory – Set Start-in Folder
  • Example R Session
  • Getting R Help
  • R Packages
  • Compare R and Excel Worlds
  • Missing Data – R and Excel

So there’s no reason for you to put off learning R. You now have a risk free way to start using R  for your advanced charts.

Excel’s Missing Factor

In this post I show how R’s factor data type allows users to produce effective charts much more quickly and simply than Excel. Since Excel does not include a “factor” data type, users need to use tricks and workarounds to categorize and display factor type data.

Excel Example Data Set

In  his “VBA to Split Data Range into Multiple Chart Series”  (May, 2008) , Jon Peltier writes:

A common problem is to make a chart from a list like this one. You want separate series for each item in one column (e.g., the cities in the first column), but your list is different every time, and you have to spend an hour assigning data to each series in the chart.

This is a job for VBA. It’s possible to write a simple loop that reads the first column of the range, grouping rows together by item.”

Continue reading

Learn R Toolkit (Updated 6/14/14)

Learn R Toolkit: I have developed this toolkit to help Excel users quickly learn R so that you can make those advanced charts that you really want to make. I use over 100 PowerPoint slides, 19 videos 40 R scripts, 12 data files and 3 Excel workbooks to lead you up the R learning curve as quickly and comfortably as possible.

As an Excel user, I explain essential R concepts using Excel examples and terminology so that you can learn R fundamentals using your Excel knowledge. The videos demonstrate the R scripts and then you use the same R scripts in a series of assignments. The R scripts are fully commented and organized for reuse so that you can use them as starter scripts for you real world projects. Finally, the PowerPoint slides can be printed out to give you a valuable memory jogger resource. This toolkit will save you many hours learning R.

Visit my web page to get more information about Learn R Toolkit.

Video To Help Excel Users Get Familiar With R

I’ve made a short (9 minute) video to help Excel users get familiar with R.  This video gives a quick overview of the R user interface, demonstrates an R session and walks through a short R chart script. This video is for those Excel users who have heard about R and would like to get a better feel for how it works and what its like.

Here’s  a link to my Getting Familiar video.

Let me know what you think.

Step Charts: R is Easier Than Excel

In this post, I show how to make a Step Chart with R.  The chart also includes a lowess smoother and annotation.  Readers can visit my ProcessTrends.com site to see how to make a step chart in Excel. Continue reading

Using R to Enhance an Excel Chart

In this post I show how to add a loess fit from R to an Excel chart. Continue reading

R Lattice Plot Beats Excel Stacked Column Chart – Update 3

What’s Wrong With Excel’s Stacked Column Chart  This is my 2nd post on BP Oil Statistical Review of World Energy – June 2008, post 1 is here. In this post, I discuss Excel’s stacked column charts, using BP’s Regional Consumption Pattern  2007 as an example.           First, the good news. While I’m not sure what chart package BP used, this chart looks a lot like an Excel stacked column chart with some color, Y axis position and label enhancements.  The bad news is that BP used a stacked column chart which  is the Excel way of showing three variables on a 2D display. In this case, the variables are: Region of World, Fuel type and percent of energy use by fuel in that region. It is difficult to interpret the values for the internal fuels on the stacked column chart because they do not have a common baseline. Here’s the same data in an R Lattice chart. To me, the R Lattice – trellis chart helps me to see the patterns more clearly than the stacked column chart. Notice how Asia Pacific use of coal sticks out! S & C America have the largest portion of hydroelectric use. Natural gas is used widely, with Middle East having the greatest portion. Nuclear is relatively small, with Europe leading and N America  close behind. Finally, look at oil use, Middle East is greatest user, followed by S & C America and Africa.  Could you see these details in BP’s stacked column chart? Why Are Excel Chart Users Still Using Excel for Multivariate Charts If R is so good and free, then why are Excel charters still using Excel for multivariate charts?  For me, there are two reasons: 

  1. At first, I didn’t know any better. I knew about small multiples from Tufte’s writings, however, I didn’t know there was a free tool that could do trellis – lattice type small multiples.
  2. R Learning Curve  - Naomi Robbins book, Creating More Effective Graphs, introduced me to R. I got excited about R’s capabilities, however, I found the learning curve daunting. It was easier to slip back into comfortable Excel charts rather than learn a new – better way to make charts. 

R Lattice Flexibility In a comment to this post, Tony said ” .. I  typically like to see all of the charts either in a row or column so it’s easier to compare.” Here the lattice plot the way Tony prefers. The original 2 x 3 matrix was the default. by adding a simple layout control, I changed it to a 1x 6 display.  

Update Hadley Wickham, an R heavyweight (that’s a good thing), suggested in his comment that “.. you might also want to use the reorder function to reorder the factor levels in terms of the highest use”. Since it sounded like a good idea from a really experienced R programmer, I decided to give it a try. Here’s my revised chart based on Hadley’s suggested reorder of both the fuel and region factors.

I like it! The fuel panels are now sorted by median percent energy use, with nuclear the lowest and oil the highest. The regions are sorted by magnitude of oil use, with Africa at low end and N America at high end.

Update 2 In my 1st update, I sorted the panels by magnitude of energy use, however, I was not able to sort by regional energy use within panels. I tried several sorting, ordering and reordering approaches to no avail. I finally asked the R Graphics expert, Paul Murrell, the author of R Graphics for help. Thanks again Paul.

Here’s the plot the way I really wanted it.

 

Update 3

Reader Jaanus had an interesting comment:

“Whilst the charts look nice isn’t the fact that the x-axis doesn’t begin at 0 confusing the message?

Percentage of nuclear energy usage in Middle-East is not that intuitive…I would recommend forcing the x-axis to begin at 0.” 

Here’s the chart with the x-axis starting at 0.

 

R Lattice Plot Beats Excel Stacked Area Trend Chart

Start of Series on Switch to R for Advanced Charts

I have been a long time Excel chart user, with dozens of techniques and tools for advanced Excel charts. As I pointed out in this post,  Excel’s multivariate data visualization limitations are severe. For me, the best strategy is to switch to R for advanced charts and continue to use Excel for my data analysis and simple charts. 

This post starts a series on my transition to R that may be of interest to those Excel chart users who struggle with Excel’s limited multivariate chart capabilities. I will post videos and provide source data files as well as R scripts on my ProcessTrends.com website for those who want to try R for themselves.

Working Example

In learning R graphics, I will be using the BP Statistical Review of World Energy – June 2008 report for data and examples of ineffective Excel like charts. BP does a great service for all of us by publishing their data rich report and workbook on world energy trends, an incredibly important topic from economic, environmental and security standpoints.  The 45 page report includes 8 charts types and a total of 26 charts. From a data visualization standpoint, 4 of these chart types are acceptable and 4 are ineffective.

BP seems to use Excel – PowerPoint or a similarly weak data visualization tool for their charts. The telling point is the reliance on doughnut, stacked area trend charts, stacked column charts and clustered column charts to show multivariate data. I have previously discussed BP’s chart selection in my Chart Doctor page.

Stacked Area Trend Charts

BP’s Statistical Review includes 11 stacked area trend charts. Let’s take a look at one of them.

This multivariate chart shows 3 variables: oil consumption (Y axis) by year (X axis) and region (stacked area). To add the 3rd dimension to this 2D display, BP stacked the regions with a stacked area chart.

We can see the North American and world total values clearly, however, it is quit difficult to interpret the specific trends for the 5 other regions because their baselines change over time.

I’ve made this R lattice chart to show the oil consumption trends for the 6 regions. Notice that I have added the 1965-1982 data and have not included the worldwide total consumption.

This lattice plot adds the 3rd variable by making a small multiple chart for each region. In this example, the small multiples have the same X and Y scales so comparisons are very straightforward.

The lattice plot shows the individual regional trends clearly, we can see that Asia Pacific consumption has increased dramatically, now equivalent to N American consumption. We can also see that Europe – Eurasia consumption peaked in the late 1970s and has been drifting downward since. We see similar local peaks in N America and Asia Pacific consumption in the late 1970s. We can also see the smaller market share and increasing trends for Africa, S & C America and Middle East.  

The R lattice plot lets me see both the trends and market share for each region much more clearly than the Excel like stacked area trend chart.

Excel’s Missing Multivariate Chart Capabilities

Why use stacked area trend charts? Did BP’s charters use stacked area trend charts because their chart tool doesn’t have small multiples, trellis – lattice chart capabilities?   

Excel charters often use stack & cluster techniques on their their multivariate data because Excel doesn’t have built-in small multiple, trellis – lattice tools. To work around Excel’s multivariate chart limitations, I started making panel charts in 2006.  Once I started looking into R, I quickly realized that Excel work-arounds, tricks, neat techniques aren’t enough. The tool does matter. You need real multivariate  graphical tools to effectively visualize multivariate data. Since Excel’s chart engine doesn’t support effective multivariate data displays, it’s time to switch to a tool that does.

After spending so much time mastering Excel, it just makes me wonder, “.. why the h*** doesn’t Excel have trellis – lattice capabilities? ” Tufte wrote about “small multiples” in his 1983 book, “The Visual Display of Quantitative Information“. That’s 25 years ago. Has the Microsofts Excel Team read Tufte, Few, Robbins, Cleveland? 

I’d like to hear what you think about multivariate charting with Excel.

 

 

 

 


Data Loss Aversion

I’m joining an on-going discussion chain by four data – chart blogs that I follow.  Andrew Gellman started the chain with a post on The Monkey Cage about a New York Times article on data visualizations sites like Many Eyes.  Andrew pointed out that the NYT example chart “.. is just horrible”.  “It’s a classic example of a graph that looks cool but is just confusing.”

Kaiser Fung of JunkCharts followed up with a post on Loss Aversion raising concerns about “cramming as much data into the chart as possible“. Kaiser points out that this tendency is “..taking Tuft’s concept of maximizing data-ink ratio to the extreme.” In discussing the original NYT graph, Kaiser says “.. Every piece of data is given equal footing, which results in nothing standing out.”

Jorge Camoes, following up on Kaiser’s post, points to a Tufte corollary “..To clarify, add detail” , which supports the loss aversion tendency. Jorge shows an example  chart with nine time series and asks “does it make any sense to add those nine series to a single chart?

Andreas Lipphardt of XLCubed followed up Jorge’s question on how to best show this chart data by adding an elegant set of grouped colors. 

Does Andreas’s color coding solve the readability issue? No! While it helps, it does not significantly clarify the data.  We need to rethink our chart; what are we trying to show? There are three factors in this data set: year, income class and % of households in the class.  What are we most interested in? Do we really need to show the data for each year, aren’t we more interested in the long term trend?  

To me, the most important information is the long term shift in income distribution by income group, not the year to year changes. Lets use a dot plot and directly compare 1967 and 2005 distributions.


The dot plot clarifies the situation by showing changes in income by class for just 2 years so that we can compare changes by class. The % of households in the top 3 income classes were much higher in 2005, the $50-74,900 class stayed the same and % of households with total income less than $49,900 decreased.

In this case, changing chart type improved the chart more than enhanced color coding. We need to make sure we select the most appropriate chart before we try to optimize the chart format.

Kaiser’s Loss Aversion concerns raise an important charting prinicple, clarity in our chart purpose is critical to making an effective chart.  More data or better colors won’t help a poor chart type selection.

 

Source data file link.