The tidyverse is a suite of R tools that follow a tidy philosophy. There are 3 basic rules for data in data frames
Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R.
install.packages(tidyverse) #installs all of the above packages.
library(tidyverse) #attaches only the core packages.
Tibbles are a modern take on data frames.
Creating a Tibble
tibble() is a nice way to create data frames. It encapsulates best practices for data frames:
It never changes an input’s type.
Tibble can even contain list-columns:
Tibble doesnt alter variable names unlike data frames:
dplyr is a powerful R-package to transform and summarize tabular data with rows and columns.
Dplyr contains a set of functions that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.
Pipe Operator (%>%)
Its chaining syntax using the Pipe operator (%>%) makes it highly adaptive to use. This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right.
6 major data manipulation functions:
filter(mynewdata, cyl > 4 & gear > 4 )
mynewdata%>% select(cyl, wt, gear)%>% arrange(wt)
newvariable <- mynewdata %>% mutate(newvariable = mpg*cyl)
myirisdata%>% group_by(Species)%>% summarise(Average = mean(Sepal.Length, na.rm = TRUE))
df <- rename(df, new_name = old_name)
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases)
Following is a dataset obtained from OECD dealing with countries and their respective mortality rates over few years organized in 5 year intervals.
df<-(migration %>% gather(Year,Mortality_Rate,-name,-country_code))
spread() turns a pair of key:value columns into a set of tidy columns. To use
spread(), pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column
spread() takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases.
Let us continue with the above example.
View(df) #The dataframe we converted from wide to long
spread(df,Year,Mortality_Rate) #Convert it back to wide