Zum Hauptinhalt springen

# Factors

A factor is a data type used to store categorical variables. Categorical variables only have a limited number of categories, such as eye colour. In contrast, a continuous variable can have an infinite number of values, such as heights of different people. It is important to differentiate between these because statistical models treat them differently. The categories that can be stored in a variable are referred to as factor levels in R.

## Creating a Factor​

Use the `factor()` function to create a factor. Before creating a factor, first generate a vector that contains all the data that belong to a limited number of categories. That vector will be an argument in the `factor()` function.

``# create a vector that contains all the data that belong to a limited number of categorieseye_colour <- c("brown", "grey", "blue", "green", "blue", "blue", "brown", "green", "brown", "brown", "brown")factor_eye_colour <- factor(eye_colour)factor_eye_colour ``

This will create the following output:

`` [1] brown grey  blue  green blue  blue  brown green brown brown brownLevels: blue brown green grey``

Notice the last line: `Levels: blue brown green grey`. R has identified the categories of the data and has listed them as "levels".

## Ordering Factors​

Categorical factors can be further divided into nominal and ordinal categorical variables. A nominal categorical variable has data that cannot be put into an order, such as eye colour. An ordinal categorical variable contains data that has natural ordering, such as the quality of pain on a numerical scale.

By default, R will set any factor into a nominal categorical variable, meaning that the data cannot be ordered. However, if you want to create an ordered factor (containing ordinal categorical variables), you require two additional arguments: `order` and `levels`:

• The `order` parameter is a logical clue. By default, R will not order the values factor. However, by setting the `order = TRUE`, the values will be sorted.
• The `levels` argument is a vector which contains the values in the correct order.
``# create a vector that contains all the data that belong to a limited number of categoriespain_vector <- c("strong pain", "little pain", "little pain","very strong pain", "no pain", "strong pain", "strong pain")# create an ordered factorfactor_pain_vector <- factor(pain_vector, order = TRUE, levels = c("no pain", "little pain", "strong pain", "very strong pain"))factor_pain_vector``

This will give the following output:

``[1] strong pain  little pain  little pain  very strong pain no pain  strong pain     [7] strong pain     Levels: no pain < little pain < strong pain < very strong pain``

### The `levels()` Function​

The `levels()` function can also be used standalone to print out the categories of a factor.

``levels(factor_pain_vector)``

R will only output the categories and not the entire data set:

``[1] "no pain"  "little pain"  "strong pain"  "very strong pain"``

In addition, the `levels()` function is used to change the names of the categories to increase the clarity of the data.

Example

You are conducting an experiment to find out the lethal dose of a new pharmaceutical drug. As part of your observation, you note down whether the experimental mice died from the given dose. To save time, you abbreviate your findings with "D" (Dead) or "ND" (Not dead) and save them into a vector in R:

``lethal_dose_vector <- c("D", "ND", "ND", "ND","D", "ND", "D", "D", "D", "D", "D", "D", "ND", "ND", "D")factor_lethal_dose_vector <- factor(lethal_dose_vector)``

The abbreviation may lead to confusion when working with the data in R, you can change the factor levels to different names using the `levels()` function:

``levels(factor_lethal_dose_vector) <- c("dead", "not dead")factor_lethal_dose_vector``

The output will now show a factor with the data being converted from `"D"` and `"ND"` to `"dead"` and `"not dead"`:

`` [1] dead  not dead  not dead  not dead dead  not dead  dead  dead  dead  dead  dead  dead    [13] not dead  not dead  dead    Levels: dead  not dead``
tip

The order in which you assign the new factor levels is important, as R may otherwise not correctly map the data!

## Selecting Factors​

You can select elements from the factor using square brackets. If the factor is ordered, you can compare different values, using the operators:

``factor_pain_vector[2] < factor_pain_vector[5][1] FALSE``

To change the value of a specific item in a factor, use the index and assign the new value using `<-`.

``factor_pain_vector[5] <- "little pain"``

This will change the value in position 5 of the `factor_pain_vector` from `"no pain"` to `"little pain"`:

``[1] strong pain  little pain  little pain  very strong pain  little pain  strong pain     [7] strong pain     Levels: no pain < little pain < strong pain < very strong pain``
caution

💡 You cannot change the value of a factor to an undefined level.

``factor_pain_vector[5] <- "excruciating pain"``

You will receive a warning message:

``Warning message:In `[<-.factor`(`*tmp*`, 5, value = "excruciating pain") :  invalid factor level, NA generated``

## `Summary()` Function​

The `Summary()` function is a very convenient tool in R to get a quick overview of the content. You can use the function on a vector:

``# Summary for a vectorsummary(lethal_dose_vector)``

R will print out the following output:

``   Length     Class      Mode        15     character  character ``

You can also use the `Summary()` function on a factor:

``# Summary for a factorsummary(factor_lethal_dose_vector)``

The output for a factor will be a little different:

``   dead not dead    9        6 ``
info

The `summary()` function is a generic function which can be used across various data structures!