Guides: Learn R: Tidyverse

Tidyverse Introduction

The tidyverse is a suite of R tools that follow a tidy philosophy. There are 3 basic rules for data in data frames

Each type of observation gets a data frame
Each variable gets a column
Each observation gets a row

Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R.

Core packages: ggplot2, dplyr, tidyr, readr, purrr, tibble
Specialized data manipulation: hms, stringr, lubridate, forcats
Data import: DBI, haven, httr, jsonlite, readxl, rvest, xml2
Modeling: modelr, broom

install.packages(tidyverse) #installs all of the above packages.

library(tidyverse) #attaches only the core packages.

Tibble

Tibbles are a modern take on data frames.

Creating a Tibble

tibble() is a nice way to create data frames. It encapsulates best practices for data frames:

It never changes an input’s type.

Tibble can even contain list-columns:

Tibble doesnt alter variable names unlike data frames:

dplyr

What is dplyr?

dplyr is a powerful R-package to transform and summarize tabular data with rows and columns.

Why is it useful?

Dplyr contains a set of functions that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.

Pipe Operator (%>%)

Its chaining syntax using the Pipe operator (%>%) makes it highly adaptive to use. This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right.

How do I install it?

install.packages("dplyr")

Functions

6 major data manipulation functions:

filter
select
arrange
mutate
summarise (with group by)
rename

filter – filters the data based on a condition
```
filter(mynewdata, cyl > 4 & gear > 4 )
```
- #use filter to filter data with required condition

select – used to select columns of interest from a data set
```
select(mynewdata, cyl,mpg,hp)
```
- #use select to pick columns by name

arrange – used to arrange data set values on ascending or descending order
```
mynewdata%>% select(cyl, wt, gear)%>% arrange(wt)
```
- #arrange can be used to reorder rows

mutate – used to create new variables from existing variables
```
newvariable <- mynewdata %>% mutate(newvariable = mpg*cyl)
```
- #mutate - create new variables

summarise (with group_by) – used to perform analysis by commonly used operations such as min, max, mean count etc
```
myirisdata%>%  group_by(Species)%>% summarise(Average = mean(Sepal.Length, na.rm = TRUE))
```
- #summarise - used to find insights from data

rename – renames variables (columns in a data frame)
```
df <- rename(df, new_name = old_name)
```

Wide to Long

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases)

Following is a dataset obtained from OECD dealing with countries and their respective mortality rates over few years organized in 5 year intervals.

View(migration)

df<-(migration %>% gather(Year,Mortality_Rate,-name,-country_code))

View(df)

Long to Wide

spread() turns a pair of key:value columns into a set of tidy columns. To use spread(), pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column

spread() takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases.

Let us continue with the above example.

View(df) #The dataframe we converted from wide to long

spread(df,Year,Mortality_Rate) #Convert it back to wide

Liaison Librarian

Martin Morris

Email me

Contact:

Schulich Library of Physical Sciences, Life Sciences and Engineering
Macdonald-Stewart Library Building
809 rue Sherbrooke Ouest
Montréal, Québec H3A 0C1

(514) 398 8140

Subjects: Indigenous health, LGBTQ studies, Medicine