In this post, I show how to create a factor variable from categorical data and analyze the role of the factor and the decade using conditional boxplots on time series data.
R has a factor data type that lets the user store categorical data and treat the values as category labels or levels rather than characters. Factors allow the user to subdivide/ split data into subsets and compare patterns among subsets of the data.
There are many categorical data variables that can be treated as factors with specific levels:
- Gender (male, female)
- Language (English, French, German, etc)
- Energy Source (fossil, renewable)
- Profession (medical, legal,engineering,etc)
- ENSO Phase (LaNina, Neutral, ElNino)
Factor variables can be either numeric or character strings. Factor variables can be useful in multivariate plotting, data sumarization and statistical modeling.
Example Data Set
DaveT, in a comment to my post, suggested another way to show the GISS temperature trend and ENSO phase, change the line color to correspond to the ENSO phase. Here’s the GISS trend chart based on DaveT’s suggestion.
The GISS temperature anomaly has been rising since 1950. The ENSO phase seems to have a role, however, it is hard to tell from this plot.
Let’s see how we can use factors to help explore the data.
Creating Factors in R
Here’s a list of the last 6 rows of our data file.
GISS_dt AnomC Enso_p 692 2007-08-15 0.560 1 693 2007-09-15 0.500 1 694 2007-10-15 0.550 1 695 2007-11-15 0.490 1 696 2007-12-15 0.400 1 697 2008-01-15 0.216 1
We have 3 variables: month, temperature anomaly, and Enso phase. We can treat the ENSO phase as a factor to see how temperature anomaly varies by phase.
We can use the factor() function to convert the Enso_p categorical variable into a factor with 3 levels:
>Enso_f <-factor(Enso_p, labels=c("LaNina","Neutral","ElNino"))
Pretty simple! Now we can make a box plot of GISS anomalies by ENSO phase to see what role it plays in global temperature trends.
Here’s the script needed to make the boxplot:
boxplot(AnomC ~ Enso_f, main= "Box Plot:GISS Anomaly By ENSO Phase)", las = 1, pch=16, cex = 1, ylab = expression(paste("Anomaly - " , degree, "C")), ylim =c(-0.4,1.0), border = "grey", boxwex = 0.3)
Notice that R treated Enso_f as a factor and prepared the subgroups by factor level, calculated the box plot parameters automatically.
Creating a Decade Variable
This boxplot shows that the ENSO phase has some impact on the GISS temperature anomaly. Let’s make a decade variable so that we can see how the temperature anomaly changes with both decade and ENSO phase.
Here’s the script to make the decade variable:
GISS_yr.d <- format(as.Date(as.character(GISS_dt), format="%Y-%m-%d"),format="%Y") GISS_yr_n <- as.numeric(GISS_yr.d) GISS_dec <-(as.numeric((GISS_yr_n-1950)%/%10)*10)+1950
Now we can make a boxplot of GISS anomalies by decade.
This boxplot shows how the decadal GISS temperature anomaly has increased from the 1950s to today.
Lattice Plot Conditioned On ENSO & Decade
We can make a lattice chart to show the combined effects of decade and ENSO phase:
Separating the ENSO phase tells us that GISS anomalies are rising for all three phases. It also tells us that the LaNina phase is colder than the Neutral and ElNino phases. It’s hard to tell the difference between the Neutral and ElNino phases from this plot.
We can use R’s tapply() function to calculate the mean GISS anomaly for each Enso phase -decade subset . Here’s the script.
Enso_mean<- tapply(AnomC, list(Enso_f, GISS_dec), FUN= "mean")
Here’s the Enso_mean summary table of GISS anomalies by Decade and ENSO phase using the tapply() function in single line of R script .
Developing this table in Excel would require use of Index-Match, Sumproduct, Array formala or Pivot table techniques and would take considerably more time to set up that this single line of R script.
R’s factor data structure is a major reason why Excel users should consider R for multivariate data analysis tasks.
The data files and R script are available at on my ProcessTrends.com site.