Skip to Main Content

Learn R

This guide focuses on transformation and cleaning functions in R that are especially useful for working with tabular datasets.

Tidyverse Introduction

The tidyverse is a suite of R tools that follow a tidy philosophy. There are 3 basic rules for data in data frames

  • Each type of observation gets a data frame
  • Each variable gets a column
  • Each observation gets a row

Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R.

  • Core packages: ggplot2, dplyr, tidyr, readr, purrr, tibble
  • Specialized data manipulation: hms, stringr, lubridate, forcats
  • Data import: DBI, haven, httr, jsonlite, readxl, rvest, xml2
  • Modeling: modelr, broom
install.packages(tidyverse) #installs all of the above packages.
library(tidyverse) #attaches only the core packages.

Tibble

Tibbles are a modern take on data frames.

Creating a Tibble

tibble() is a nice way to create data frames. It encapsulates best practices for data frames:

It never changes an input’s type.

Tibble can even contain list-columns:

Tibble doesnt alter variable names unlike data frames:

dplyr

What is dplyr?

dplyr is a powerful R-package to transform and summarize tabular data with rows and columns.

Why is it useful?

Dplyr contains a set of functions that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.

Pipe Operator (%>%)

Its chaining syntax using the Pipe operator (%>%) makes it highly adaptive to use. This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right.

How do I install it?

install.packages("dplyr")

Functions

6 major data manipulation functions:

  1. filter
  2. select
  3. arrange
  4. mutate
  5. summarise (with group by)
  6. rename
  • filter – filters the data based on a condition
    filter(mynewdata, cyl > 4 & gear > 4 )
    • #use filter to filter data with required condition
  • select – used to select columns of interest from a data set
    select(mynewdata, cyl,mpg,hp)
    • #use select to pick columns by name
  • arrange – used to arrange data set values on ascending or descending order
    mynewdata%>% select(cyl, wt, gear)%>% arrange(wt)
    • #arrange can be used to reorder rows
  • mutate – used to create new variables from existing variables
    newvariable <- mynewdata %>% mutate(newvariable = mpg*cyl)
    • #mutate - create new variables
  • summarise (with group_by) – used to perform analysis by commonly used operations such as min, max, mean count etc
    myirisdata%>%  group_by(Species)%>% summarise(Average = mean(Sepal.Length, na.rm = TRUE))
    • #summarise - used to find insights from data
  • rename – renames variables (columns in a data frame)
    df <- rename(df, new_name = old_name)

 

 

Wide to Long

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases)

Following is a dataset obtained from OECD dealing with countries and their respective mortality rates over few years organized in 5 year intervals.

 

View(migration)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

df<-(migration %>% gather(Year,Mortality_Rate,-name,-country_code))
View(df)

Long to Wide

spread() turns a pair of key:value columns into a set of tidy columns. To use spread(), pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column

spread() takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases.

Let us continue with the above example.

View(df) #The dataframe we converted from wide to long

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

spread(df,Year,Mortality_Rate) #Convert it back to wide

Liaison Librarian

Profile Photo
Martin Morris
Contact:
Schulich Library of Physical Sciences, Life Sciences and Engineering
Macdonald-Stewart Library Building
809 rue Sherbrooke Ouest
Montréal, Québec H3A 0C1
(514) 398 8140
Website Skype Contact: martinatmcgill
Social: Twitter Page

McGill LibraryQuestions? Ask us!
Privacy notice