R Works With Factors

In this post, I show how to create a factor variable  from categorical data and analyze the role of the factor and the decade using  conditional  boxplots  on time series data.

Introduction

R has a factor data type that lets the user store categorical  data  and treat the values as category labels or levels rather than characters.  Factors allow the  user to subdivide/ split  data into subsets and compare patterns among subsets of the data.

There are many categorical data variables that  can be treated as factors with specific levels:

  • Gender (male, female)
  • Language (English, French,  German, etc)
  • Energy Source (fossil, renewable)
  • Profession (medical, legal,engineering,etc)
  • ENSO Phase (LaNina, Neutral, ElNino)

Factor variables can be either numeric or character strings.  Factor variables can be useful in multivariate plotting, data sumarization and statistical modeling. 

Example Data Set

In a previous post, I showed the  GISS global temperature anomaly trends and ENSO phase. The ENSO phases are LaNina, Neutral and ElNino.  

DaveT, in a comment to my post, suggested another way to show the GISS temperature trend and ENSO phase, change the line color to correspond to the ENSO phase. Here’s the GISS trend chart based on DaveT’s suggestion.

giss_line_color

The GISS temperature anomaly has been rising since 1950. The ENSO phase seems to have a role, however, it is hard to tell from this plot.

Let’s see how we can use factors to help explore the data.

Creating Factors in R

Here’s a list of the last 6 rows of our data file.  

       GISS_dt AnomC Enso_p
692 2007-08-15 0.560      1
693 2007-09-15 0.500      1
694 2007-10-15 0.550      1
695 2007-11-15 0.490      1
696 2007-12-15 0.400      1
697 2008-01-15 0.216      1

We have 3 variables: month, temperature anomaly, and Enso phase. We can treat the ENSO phase as a factor to see how temperature anomaly varies by phase.

We can use the factor() function to convert the Enso_p categorical variable into a factor with 3 levels:

>Enso_f <-factor(Enso_p,
                 labels=c("LaNina","Neutral","ElNino"))

Pretty simple! Now we can make a box plot of GISS anomalies by ENSO phase to see what role it plays in global temperature trends.

bw_by_enso_p1

Here’s the script needed to make the boxplot:

  boxplot(AnomC ~ Enso_f,
  main= "Box Plot:GISS Anomaly By ENSO Phase)",
  las = 1, pch=16, cex = 1,
  ylab = expression(paste("Anomaly - " , degree, "C")),
  ylim =c(-0.4,1.0), border = "grey", boxwex = 0.3)

 Notice that R treated Enso_f as a factor and prepared the subgroups by factor level, calculated the box plot parameters automatically.  

Creating a Decade Variable

This  boxplot shows that the ENSO phase has some  impact on the GISS temperature anomaly. Let’s make a decade variable so that we can see how the temperature anomaly changes with both decade and ENSO phase.

Here’s the script to make the decade variable:

GISS_yr.d <- format(as.Date(as.character(GISS_dt),
               format="%Y-%m-%d"),format="%Y")
GISS_yr_n <- as.numeric(GISS_yr.d)
GISS_dec <-(as.numeric((GISS_yr_n-1950)%/%10)*10)+1950

Now we can make a boxplot of GISS anomalies by decade.

bw_by_decade_p1

This boxplot shows how the decadal GISS temperature anomaly has increased from the 1950s to today.

Lattice Plot Conditioned On ENSO & Decade

We can make a lattice chart to show the combined effects of decade and ENSO phase:

lattice_enso_decade1

Separating the ENSO phase tells us that GISS anomalies are rising for all three phases. It also tells us that the LaNina phase is colder than the Neutral and ElNino phases. It’s hard to tell the difference between the Neutral and ElNino phases from this plot.

We can use R’s tapply() function to calculate the mean GISS anomaly for each Enso phase -decade subset . Here’s the script.

 Enso_mean<- tapply(AnomC, list(Enso_f, GISS_dec),
         FUN= "mean")

Here’s the Enso_mean summary table of GISS anomalies by Decade and ENSO phase using the tapply() function in single line of R script .

sum_table

Developing this table in Excel would require use of Index-Match, Sumproduct, Array formala or Pivot table techniques and would take considerably more time to set up that this single line of R script.

R’s factor data structure is a major reason why Excel users should consider R for multivariate data analysis tasks.

 

The data files and R script are available at on my ProcessTrends.com site.

About these ads

3 Responses to R Works With Factors

  1. Pingback: Excel Chart Misrepresents CO2 – Temperature Relationship « Charts & Graphs with R

  2. Pingback: Excel’s Missing Factor « Charts & Graphs

  3. Pingback: George Will’s Interpretation of Global Temperature Trends Is Flawed « Charts & Graphs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s