Factors
A factor is a data type used to store categorical variables. Categorical variables only have a limited number of categories, such as eye colour. In contrast, a continuous variable can have an infinite number of values, such as heights of different people. It is important to differentiate between these because statistical models treat them differently. The categories that can be stored in a variable are referred to as factor levels in R.
Creating a Factor
Use the factor()
function to create a factor.
Before creating a factor, first generate a vector that contains all the data that belong to a limited number of categories.
That vector will be an argument in the factor()
function.
# create a vector that contains all the data that belong to a limited number of categories
eye_colour <- c("brown", "grey", "blue", "green", "blue", "blue", "brown", "green", "brown", "brown", "brown")
factor_eye_colour <- factor(eye_colour)
factor_eye_colour
This will create the following output:
[1] brown grey blue green blue blue brown green brown brown brown
Levels: blue brown green grey
Notice the last line: Levels: blue brown green grey
.
R has identified the categories of the data and has listed them as "levels".
Ordering Factors
Categorical factors can be further divided into nominal and ordinal categorical variables. A nominal categorical variable has data that cannot be put into an order, such as eye colour. An ordinal categorical variable contains data that has natural ordering, such as the quality of pain on a numerical scale.
By default, R will set any factor into a nominal categorical variable, meaning that the data cannot be ordered.
However, if you want to create an ordered factor (containing ordinal categorical variables), you require two additional arguments: order
and levels
:
- The
order
parameter is a logical clue. By default, R will not order the values factor. However, by setting theorder = TRUE
, the values will be sorted. - The
levels
argument is a vector which contains the values in the correct order.
# create a vector that contains all the data that belong to a limited number of categories
pain_vector <- c("strong pain", "little pain", "little pain","very strong pain", "no pain", "strong pain", "strong pain")
# create an ordered factor
factor_pain_vector <- factor(pain_vector, order = TRUE, levels = c("no pain", "little pain", "strong pain", "very strong pain"))
factor_pain_vector
This will give the following output:
[1] strong pain little pain little pain very strong pain no pain strong pain
[7] strong pain
Levels: no pain < little pain < strong pain < very strong pain
The levels()
Function
The levels()
function can also be used standalone to print out the categories of a factor.
levels(factor_pain_vector)
R will only output the categories and not the entire data set:
[1] "no pain" "little pain" "strong pain" "very strong pain"
In addition, the levels()
function is used to change the names of the categories to increase the clarity of the data.
You are conducting an experiment to find out the lethal dose of a new pharmaceutical drug. As part of your observation, you note down whether the experimental mice died from the given dose. To save time, you abbreviate your findings with "D" (Dead) or "ND" (Not dead) and save them into a vector in R:
lethal_dose_vector <- c("D", "ND", "ND", "ND","D", "ND", "D", "D", "D", "D", "D", "D", "ND", "ND", "D")
factor_lethal_dose_vector <- factor(lethal_dose_vector)
The abbreviation may lead to confusion when working with the data in R, you can change the factor levels to different names using the levels()
function:
levels(factor_lethal_dose_vector) <- c("dead", "not dead")
factor_lethal_dose_vector
The output will now show a factor with the data being converted from "D"
and "ND"
to "dead"
and "not dead"
:
[1] dead not dead not dead not dead dead not dead dead dead dead dead dead dead
[13] not dead not dead dead
Levels: dead not dead
The order in which you assign the new factor levels is important, as R may otherwise not correctly map the data!
Selecting Factors
You can select elements from the factor using square brackets. If the factor is ordered, you can compare different values, using the operators:
factor_pain_vector[2] < factor_pain_vector[5]
[1] FALSE
To change the value of a specific item in a factor, use the index and assign the new value using <-
.
factor_pain_vector[5] <- "little pain"
This will change the value in position 5 of the factor_pain_vector
from "no pain"
to "little pain"
:
[1] strong pain little pain little pain very strong pain little pain strong pain
[7] strong pain
Levels: no pain < little pain < strong pain < very strong pain
💡 You cannot change the value of a factor to an undefined level.
factor_pain_vector[5] <- "excruciating pain"
You will receive a warning message:
Warning message:
In `[<-.factor`(`*tmp*`, 5, value = "excruciating pain") :
invalid factor level, NA generated
Summary()
Function
The Summary()
function is a very convenient tool in R to get a quick overview of the content.
You can use the function on a vector:
# Summary for a vector
summary(lethal_dose_vector)
R will print out the following output:
Length Class Mode
15 character character
You can also use the Summary()
function on a factor:
# Summary for a factor
summary(factor_lethal_dose_vector)
The output for a factor will be a little different:
dead not dead
9 6
The summary()
function is a generic function which can be used across various data structures!