Descriptive Statistics

R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic.

# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)

Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.

There are also numerous R functions designed to provide a range of descriptive statistics at once. For example

# mean,median,25th and 75th quartiles,min,max
summary(mydata)

# Tukey min,lower-hinge, median,upper-hinge,max
fivenum(x)

Using the Hmisc package

library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores

Using the pastecs package

library(pastecs)
stat.desc(mydata)
# nbr.val, nbr.null, nbr.na, min max, range, sum,
# median, mean, SE.mean, CI.mean, var, std.dev, coef.var

Using the psych package

library(psych)
describe(mydata)
# item name ,item number, nvalid, mean, sd,
# median, mad, min, max, skew, kurtosis, se

Summary Statistics by Group

A simple way of generating summary statistics by grouping variable is available in the psych package.

library(psych)
describe.by(mydata, group,...)

The doBy package provides much of the functionality of SAS PROC SUMMARY. It defines the desired table using a model formula and a function. Here is a simple example.

library(doBy)
summaryBy(mpg + wt ~ cyl + vs, data = mtcars,
  FUN = function(x) { c(m = mean(x), s = sd(x)) } )
# produces mpg.m wt.m mpg.s wt.s for each
# combination of the levels of cyl and vs

See also: aggregating data.