Missing Data

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data.

For more practice on working with missing data, try this course on cleaning data in R.

Testing for Missing Values

is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)

Recoding Values to Missing

# recode 99 to missing for variable v1
# select rows where v1 is 99 and recode column v1
mydata$v1[mydata$v1==99] <- NA

Excluding Missing Values from Analyses

Arithmetic functions on missing values yield missing values.

x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2

The function complete.cases() returns a logical vector indicating which cases are complete.

# list rows of data that have missing values
mydata[!complete.cases(mydata),]

The function na.omit() returns the object with listwise deletion of missing values.

# create new dataset without missing data
newdata <- na.omit(mydata)

Advanced Handling of Missing Data

Most modeling functions in R offer options for dealing with missing values. You can go beyond pairwise of listwise deletion of missing values through methods such as multiple imputation. Good implementations that can be accessed through R include Amelia II, Mice, and mitools.