Skip to Main Content

Learn R

This guide focuses on transformation and cleaning functions in R that are especially useful for working with tabular datasets.

Descriptive Statistics

Summary (or descriptive) statistics are the first analyses used to represent most numeric datasets. They form the foundation for much more complicated computations and analyses and data visualizations.  

1. mean

2. sd

3. min

4. max

5. range

6. quantiles

7. summary

 

Mean

In R, a mean can be calculated on an isolated variable via the mean(VAR) command, where VAR is the name of the variable whose mean you wish to compute. Here is an example -

 

 

Standard Deviation

Within R, standard deviations are calculated in the same way as means. The standard deviation of a single variable can be computed with the sd(VAR) command, where VAR is the name of the variable whose standard deviation you wish to retrieve.

 

 

Minimum and Maximum

Keeping with the pattern, a minimum can be computed on a single variable using the min(VAR) command. The maximum, via max(VAR), operates identically.

 

 

 

Range

The range of a particular variable, that is, its maximum and minimum, can be retrieved using the range(VAR) command.

 

 

Percentiles

Given a dataset and a desired percentile, a corresponding value can be found using the quantile(VAR, c(PROB1, PROB2,…)) command. VAR refers to the variable name and PROB1, PROB2, etc., relate to probability values.

The probabilities must be between 0 and 1, therefore making them equivalent to decimal versions of the desired percentiles (i.e. 50% = 0.5). The following example shows how this function can be used to find the data value that corresponds to a desired percentile.

> #calculate desired percentile values using quantile(VAR, c(PROB1, PROB2,…))
> #what are the 25th and 75th percentiles for age in the sample?

 

 

 

 

Note that quantile(VAR) command can also be used. When probabilities are not specified, the function will default to computing the 0, 25, 50, 75, and 100 percentile values, as shown in the following example.

 

 

 

Summary

A very useful multipurpose function in R is summary(X), where X can be one of any number of objects, including datasets, variables, and linear models, just to name a few. When used, the command provides summary data related to the individual object that was fed into it. Thus, the summary function has different outputs depending on what kind of object it takes as an argument. Besides being widely applicable, this method is valuable because it often provides exactly what is needed in terms of summary statistics. 

Liaison Librarian

Profile Photo
Martin Morris
Contact:
Schulich Library of Physical Sciences, Life Sciences and Engineering
Macdonald-Stewart Library Building
809 rue Sherbrooke Ouest
Montréal, Québec H3A 0C1
(514) 398 8140
Website Skype Contact: martinatmcgill
Social: Twitter Page

McGill LibraryQuestions? Ask us!
Privacy notice