Using tidyr to clean messy, wide data

A while ago I had some climate data that I needed to play with, but they were in a “wide” format that made analysis in R difficult. These data were obtained from weather stations across the range of C. xantiana, and their format may make sense to us humans, but for data wrangling / analysis, they’re hard to work with.

                  site easting Pnoja_2006 Pfeju_2006 Pjuoc_2006 Pnoja_2007 Pfeju_2007 Pjuoc_2007
1          Black Gulch  361558     127.70     152.38      50.90      73.18      76.04       8.64
2              Bodfish  369100     114.59     162.21      56.68      67.22      71.59      13.51
3           Borel Road  362540     120.32     151.20      53.77      76.90      75.32       8.72
4               Camp 3  369107     173.81     168.17      27.83      60.04      64.64       8.63
5 Cannell Meadow Trail  372040     156.40     145.40      24.16     144.50      59.30       9.00
6        China Gardens  350537     132.16     171.44      29.61      99.60     139.60      17.52

The table shows a portion of the nine years of climate data.”Pnoja_2006″ is precipitation received between November 2005 and January 2006, “Pfeju_2006” is precipitation received between February and June 2006 (we’re working with a winter germinating annual, so we’re splitting the climate into early, mid, and late portions of it’s life history). I would have named things differently, but you work with what you’ve got!

We need to reshape the data into “long” format, where each row represents a single year (“year” being used loosely here, in a more biological than calendar sense — November (germination) to June (seed set)). There are many tutorials on using reshape functions in R, but I found many of them unclear, or addressing problems slightly different from my own. I decided to use dplyr and plyr, but you can do all of this with the reshape2 package, as well.

reshape2 digression

The tidyr results below can be obtained in one line of code with reshape in the reshape2 package, like this

climateLong <- reshape(climateWide, varying = c(3:32), dir = "long", idvar = "site", sep = "_", timevar = "year")

I may just be overlooking some tidyr functionality that would make it this easy — if you know how, please let me know! I described the tidyr methods because the grammer is a bit more intuitive and I’m enjoying working with dplyr / tidyr a lot, but don’t know if there’s any good reason to use it (for this kind of problem) instead of reshape.

tidyr code

Here’s the tidyr code, with more explanation beneath. You can download a test data set here.

# load packages
library(tidyr)
library(plyr)

# read in messy, wide data file
climateWide <- read.csv("Downloads/Climate.csv", header = T, sep = ',')

# use gather to go from wide to long format, then separate year and season, then spread seasons back across columns
climateLong <- climateWide %>% gather(key, value, -site, -easting) %>%
separate(key, into = c("season", "year"), sep = "_") %>%
spread(season, value) %>%
rename(c("Pnoja" = "WinterPrecip", "Pfeju" = "SpringPrecip", "Pjuoc" = "SummerPrecip"))

Stepping through the functions:

gather takes the wide df, climateWide, and flips it to long-form. First you choose a column name for the ‘key’ — the variable you’re flipping long-ways, here ‘Pnoja_2006’, ‘Pfeju_2006’, etc. — I just used ‘key’. ‘Value’ in this case is the observed precipitation; you can name it ‘precip’ or whatever. We also need to specify the variables we don’t want gathered — here, site and easting — these will be copied down in the new rows with their appropriate keys and values. All other columns will be gathered. (The ‘pipe’, %>%, in this code allows us to simply pass the dataframe between functions without having to name the df explicitly)
separate is needed to split the ‘key’ (e.g., Pfeju_2006) into season and year. We just tell it what the key is (easy in this case), what we want to name our new variables we’re splitting the key into (‘season’ and ‘year), and what separates the variables in the key (here, ‘_’).
I need to use spread because for my purposes, this data set is now a bit too long. Each seasonal precipitation observation is on a separate row (ie, there are three rows for each year) — for the analysis I’m doing, I need each row to refer to a single year. So we use the opposite of gather, spread, to bring the ‘season’ key out of long into wide format, and supply the value associated with it (‘value’).
Then we just use rename to give our seasonal precipitation variables clearer names.

We end up with:

          site easting year WinterPrecip SpringPrecip SummerPrecip
1  Black Gulch  361558 2006       127.70       152.38        50.90
2  Black Gulch  361558 2007        73.18        76.04         8.64
3  Black Gulch  361558 2008       108.97       107.07         4.75
4  Black Gulch  361558 2009       125.88        78.14         7.61
5  Black Gulch  361558 2010       164.73       111.38        30.06
6  Black Gulch  361558 2011       382.49       113.21        25.99
7  Black Gulch  361558 2012        76.50        94.15         9.43
8  Black Gulch  361558 2013        63.90        45.97        32.36
9  Black Gulch  361558 2014        68.49        94.77         8.48
10 Black Gulch  361558 2015       115.13        72.37           NA
11     Bodfish  369100 2006       114.59       162.21        56.68
12     Bodfish  369100 2007        67.22        71.59        13.51
...

Hopefully this all makes sense, and can help somebody get their data set ready for analysis.