*In this post I show how R’s factor data type allows users to produce effective charts much more quickly and simply than Excel. Since Excel does not include a “factor” data type, users need to use tricks and workarounds to categorize and display factor type data.*

**Excel Example Data Set**

In his “VBA to Split Data Range into Multiple Chart Series” (May, 2008) , Jon Peltier writes:

“

A common problem is to make a chart from a list like this one. You want separate series for each item in one column (e.g., the cities in the first column), but your list is different every time, and you have to spend an hour assigning data to each series in the chart.

This is a job for VBA. It’s possible to write a simple loop that reads the first column of the range, grouping rows together by item.”

Here’s Jon’ finished chart.

A new R user at the Learning R blog wrote a post on March 16, 2009 using Jon’s example. He/she used Hadley Wickham’s ggplot2 to reproduce Jon’s chart. Here’s a screen shot of the Learning R version.

They look pretty similar.

**Using R’s “factor” Data Type to subset() the data**

R has a built-in “factor” data type that is great for categorizing data into sub-groups. Male/Female, smoker/non smoker, income ranges ( < $50k, $50 – 100K, > $100K), even cities are examples of categorical variables that take on a limited set of values. We often want to subset our data by categories and then summarize or plot the subset results. Excel provides sumif(), sumproduct(), array formulas and pivot tables as partial solutions to subsetting by category. However, users are on their own when it comes to plotting factor type conditional data.

Since Excel does not handle factors directly, Jon Peltier developed a custom technique to subset his data set by category name and plot each subset as a separate data series. Jon’s VBA code assumes the categorical variable is in the first column and he groups the rows by the unique items in the 1st column. This is a nice workaround to a serious Excel limitation , as long as the categorical variable is in the 1st column. If the categorical variable is in columns 2, 3 or other, then Jon needs to reorganize his data set or adjust his VBA code. What about problems with 2 or even more factors? Suppose we wanted to look at disease incidence by gender, smoking, weight status? Jon will be still be busy slicing his data set with VBA while the R user will be looking at his panel charts and interpreting his/her results

R’s factor data type provides a more universal solution to this data subsetting problem than a VBA workaround. Here’s my R script for Jon’s example.

link <- "C:\\R_Home\\Charts & Graphs Blog\\Data_org\\city_data.txt" par(las=1) my_data <- read.table(link, header=T, colClasses= c("factor", rep("numeric", 2)), sep=" ", col.names=c("City", "X","Y")) attach(my_data) plot(X, Y, type = "n", xlim = c(0,20), ylim=c(0, 20),xlab="", ylab="", bty="o", xaxs="i", yaxs="i" ) for (i in 1:nlevels(City)) { my_sub <- subset(my_data, City == levels(City)[i]) points(my_sub$X, my_sub$Y, col = i,pch=14+i) text(my_sub$X, my_sub$Y, levels(City)[i], cex=0.85,pos=4) }

Here’s my R chart version.

In this short script, I read in Jon’s example data, specify City as a factor, subset, plot and label the City values with a simple for() loop. Clean, simple and quick!

Excel users who work with multivariate data and find that you are spending time looking for and/or developing Excel tricks and work arounds should seriously consider adding R to your analytical toolkit. Why reinvent statistical charting techniques in Excel when they are readily available in R?

Pingback: ggplot2: Don’t Try This With Excel « Learning R

Pingback: Don’t Try This With Excel « Charts & Graphs

That’s impressive. Nice script.