<- factor(c("a", "b", "c", "b", "d", "a", "a")) y
9 Categorical Vectors
9.1 Warm-Up
Use this line of code to create a categorical variable object, y
.
Use the functions str()
and plot()
to explore the object.
What do you notice about the order of the letters we gave to
factor()
and the order returned bystr()
andplot()
?Change the order of the letters in the code. Then use
str()
andplot()
again. What do you notice now?
9.2 Outcomes
Objective: To create factor vectors from other vectors and manipulate factors by changing their orders, labels, and levels.
Why it matters: Factors provide a way to include character data in modeling and plotting, and manipulating factors can change things like intercept and interaction terms in models, and axis orders in plots.
Learning outcomes:
Fundamental Skills | Extended Skills |
|
|
Key functions:
factor()
levels()
fct_relevel()
fct_recode()
fct_relabel()
fct_collapse()
9.3 Factors
We can divide data into two general categories: continuous and categorical. Continuous data is numeric, has a natural order, and can potentially take on an infinite number of values. Examples include age, income, and health care expenditures. In contrast, categorical data takes on a limited number of values and may or may not have a natural order. Examples without a natural order include race, state of residence, and political affiliation. Examples with a natural order include Likert scale items (e.g., disagree, neutral, agree), socioeconomic status, and educational attainment.
The distinction between continuous and categorical variables is fundamental to how we use them in an analysis. For example, in a regression model, continuous variables give us slopes while categorical variables give us intercepts. If we want to use a variable as an outcome, we will have to decide whether to treat it as continuous or categorical (based on its distribution and number of observed values). If it is categorical, we need to decide whether it is ordered (for an ordinal logistic regression) or unordered (for a multinomial logistic regression).
In R, categorical data is managed as factors. We specify which variables are factors when we create and store them, and then they are treated as categorical variables in a model without any additional specification. This contrasts with other software like Stata, SAS, and SPSS, where we specify which variables are categorical in our model syntax.
9.3.1 Structure
Understanding how R represents factors will help us understand how to manipulate them.
Begin with a simple character vector, x
, that contains the letters a, b, c, and d.
<- c("a", "b", "c", "b", "d", "a", "a") x
When we print x
, R quotes the values:
x
[1] "a" "b" "c" "b" "d" "a" "a"
Now, make a factor called y
from x
with the factor()
function:
<- factor(x)
y y
[1] a b c b d a a
Levels: a b c d
y
prints differently than x
. It does not have quotes around its values, and it contains a line about something called “levels,” which contains the unique values of y
.
A factor is composed of two parts: an integer vector and an ordered label vector. To see the underlying integer vector of y
, we can coerce it to numeric with as.numeric()
:
as.numeric(y)
[1] 1 2 3 2 4 1 1
And we can see the ordered labels with the levels()
function:
levels(y)
[1] "a" "b" "c" "d"
The label “a” is the first level, “b” the second, “c” the third, and “d” the fourth. Factor levels are ordered alphabetically by default. We could make a key as follows:
Integer | Label |
---|---|
1 | a |
2 | b |
3 | c |
4 | d |
Now, our y
factor is actually an integer vector, but when we print it, R shows the corresponding labels.
Compare the output from y
and as.numeric(y)
once again, and refer to the table above.
y
[1] a b c b d a a
Levels: a b c d
as.numeric(y)
[1] 1 2 3 2 4 1 1
9.3.2 Defaults
When we include a character variable when plotting or modeling in R, R treats it as a factor, and its defaults are to
- sort levels alphabetically by their label
- leave the existing labels as-is
- include all unique values
Why are these problems?
To see why this is not ideal, simulate some survey response data, where individuals could indicate whether they agreed, neither agreed nor disagreed, or disagreed with a statement, or if the statement did not apply to them (“not applicable”). Individuals could also choose to not answer the question (“no answer”).
<- c("AGREE",
answers "NEITHER",
"DISAGREE",
"NOT APPLICABLE",
"NO ANSWER")
<- factor(sample(answers, 100, replace = T)) responses
Now, make a plot of counts by response.
ggplot() +
geom_bar(aes(x = responses))
(To learn more about how to use ggplot
, see Data Visualization in R.)
We can see R’s defaults in action:
- The levels were ordered alphabetically, but it would make more sense to have “neither” between “agree” and “disagree.”
- The labels were used as-is, but we might choose to not use all-uppercase labels to improve the appearance of the plot (e.g., “Agree” rather than “AGREE”).
- All values in our data were included, but we might consider combining the two categories of “no answer” and “not applicable” to a single “missing.”
We will learn how to manipulate these three aspects of factors in R: order (releveling), labels (recoding), and number of categories (collapsing). For all of these operations, we will be making use of the forcats
library from the tidyverse
, which makes it easy to manipulate factors.
9.4 Releveling
When we relevel a factor, we change the order of its levels. If we give fct_relevel()
the name of a factor and the name of a level, it moves that level to the first position and leaves the others in their current order.
First, load the tidyverse
.
library(tidyverse)
Using y
from earlier, create y_relevel
, which has “b” instead of “a” as its first level.
<- fct_relevel(y, "b") y_relevel
The levels of y_relevel
are as follows:
Integer | Label |
---|---|
1 | b |
2 | a |
3 | c |
4 | d |
Compare the output from y
and y_relevel
when printing them as is, printing their underlying integers, and printing their levels.
y
[1] a b c b d a a
Levels: a b c d
y_relevel
[1] a b c b d a a
Levels: b a c d
as.numeric(y)
[1] 1 2 3 2 4 1 1
as.numeric(y_relevel)
[1] 2 1 3 1 4 2 2
levels(y)
[1] "a" "b" "c" "d"
levels(y_relevel)
[1] "b" "a" "c" "d"
The data still prints the same, but the order of the levels has changed, and this is reflected in the integer output. In y_relevel
, “b” is 1 and “a” is 2.
9.4.1 Changing the Reference Level
We can use R’s built-in penguins
dataset, which has data on penguins from Antarctica. It has 3 factor variables and 5 numeric variables.
str(penguins)
'data.frame': 344 obs. of 8 variables:
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_len : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_dep : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_len: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
The first, or reference, level of island
is “Biscoe”:
levels(penguins$island)
[1] "Biscoe" "Dream" "Torgersen"
If we make box plots of flipper length (flipper_len
) by island
, we see that “Biscoe” is the first value on the x-axis:
plot(flipper_len ~ island, penguins)
And if we predict flipper_len
by island
in a linear model, we will get this output:
summary(lm(flipper_len ~ island, penguins))
Call:
lm(formula = flipper_len ~ island, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-37.707 -5.196 1.804 6.927 21.293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 209.7066 0.8621 243.25 <2e-16 ***
islandDream -16.6340 1.3207 -12.60 <2e-16 ***
islandTorgersen -18.5105 1.7824 -10.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.14 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.376, Adjusted R-squared: 0.3723
F-statistic: 102.1 on 2 and 339 DF, p-value: < 2.2e-16
Our reference level, Biscoe, is omitted since it is represented by the intercept. The other islands’ coefficients are comparisons to the reference category, or adjustments to the intercept. In R model output, the coefficients for the levels of a factor variable are labeled with the variable name followed by the level label, so the coefficient for the Dream
level of the island
variable is called islandDream
.
If we want to change the ordering of a categorical variable in a plot or change the reference level in a statistical model, we need to relevel our factor. We can make another variable in penguins
called island_relevel
where “Dream” is the reference level:
$island_relevel <- fct_relevel(penguins$island, "Dream") penguins
Now, if we plot flipper_len
by island with island_relevel
, “Dream” is the first value on the x-axis, followed by “Biscoe” and “Torgersen.”
plot(flipper_len ~ island_relevel, penguins)
And if we fit our model again with island_relevel
, we see that the intercept has changed since it now represents the expected value of flipper_len
when the island
is set to Dream.
summary(lm(flipper_len ~ island_relevel, penguins))
Call:
lm(formula = flipper_len ~ island_relevel, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-37.707 -5.196 1.804 6.927 21.293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 193.073 1.000 192.983 <2e-16 ***
island_relevelBiscoe 16.634 1.321 12.595 <2e-16 ***
island_relevelTorgersen -1.877 1.853 -1.013 0.312
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.14 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.376, Adjusted R-squared: 0.3723
F-statistic: 102.1 on 2 and 339 DF, p-value: < 2.2e-16
If we want to calculate the expected value for Biscoe, we would add its coefficient to the intercept, and the result would match the intercept in our previous model.
For more on interpreting and visualizing regression models, see Visualizing Regression Results in R.
9.4.2 Changing All the Levels
So far we have only specified one level with fct_relevel()
, but we can specify as many levels as we want. Anything we do not name will follow those we do name.
This is useful when our factor levels have a natural order, like responses on a Likert scale:
<-
survey factor(c("Strongly agree",
"Agree",
"Neither agree nor disagree",
"Disagree",
"Strongly disagree"))
levels(survey)
[1] "Agree" "Disagree"
[3] "Neither agree nor disagree" "Strongly agree"
[5] "Strongly disagree"
<-
survey fct_relevel(survey,
"Strongly disagree",
"Disagree",
"Neither agree nor disagree",
"Agree",
"Strongly agree")
levels(survey)
[1] "Strongly disagree" "Disagree"
[3] "Neither agree nor disagree" "Agree"
[5] "Strongly agree"
9.5 Recoding
Recoding is changing the labels on our factors. To do this, we can supply fct_recode()
with our factor and a series of "new_label" = "current_label"
pairs. Anything we do not name will be left in its existing state.
Using y
again, create y_recode
, which has full fruit names instead of just their first letters:
<-
y_recode fct_recode(y,
"apple" = "a",
"banana" = "b",
"cantaloupe" = "c",
"durian" = "d")
The levels of y_recode
are as follows:
Integer | Label |
---|---|
1 | apple |
2 | banana |
3 | cantaloupe |
4 | durian |
Only the labels of the data are changed when recoding. The order and the underlying numeric values remain the same.
y
[1] a b c b d a a
Levels: a b c d
y_recode
[1] apple banana cantaloupe banana durian apple apple
Levels: apple banana cantaloupe durian
as.numeric(y)
[1] 1 2 3 2 4 1 1
as.numeric(y_recode)
[1] 1 2 3 2 4 1 1
levels(y)
[1] "a" "b" "c" "d"
levels(y_recode)
[1] "apple" "banana" "cantaloupe" "durian"
With penguins
, we can change how one or more levels are labeled. Abbreviate Torgersen to “Torg”:
$island_recode <-
penguinsfct_recode(penguins$island,
"Torg" = "Torgersen")
levels(penguins$island)
[1] "Biscoe" "Dream" "Torgersen"
levels(penguins$island_recode)
[1] "Biscoe" "Dream" "Torg"
Use the table()
function after recoding to verify that the recoding worked as intended:
table(penguins$island, penguins$island_recode)
Biscoe Dream Torg
Biscoe 168 0 0
Dream 0 124 0
Torgersen 0 0 52
The first argument of table()
(island
) is the rows, and the second (island_recode
) is the columns. To verify recoding, we only need to care whether the cells are zero or nonzero. We can see that wherever we had “Torgersen” in island
, we have “Torg” in island_recode
.
9.5.1 Make Levels Missing
Missing values can be coded with a variety of labels. To turn a level of a factor into missing, recode it as NULL
.
All the corresponding values are converted to NA
, and the level we made missing is removed from the levels()
output.
Remove the level “b” from y
:
<- fct_recode(y, NULL = "b")
y_miss y_miss
[1] a <NA> c <NA> d a a
Levels: a c d
as.numeric(y_miss)
[1] 1 NA 2 NA 3 1 1
levels(y_miss)
[1] "a" "c" "d"
The levels of y_miss
are as follows:
Integer | Label |
---|---|
1 | a |
2 | c |
3 | d |
9.5.2 Numeric Categorical Variables
Sometimes, variables appear to be continuous, numeric variables, but they are actually categorical variables.
An example of this is if we have Likert scale data. Responses may range 1-5 and represent level of agreement. Some may argue that we can treat such a variable as continuous, but for now we will force it to be categorical.
<- sample(1:5, 100, replace = T) survey
If we were to pass survey
to fct_recode()
, we will get an error:
<-
survey fct_recode(survey,
"Strongly agree" = 1,
"Agree" = 2,
"Neither agree nor disagree" = 3,
"Disagree" = 4,
"Strongly disagree" = 5)
Error in `fct_recode()`:
! `.f` must be a factor or character vector, not an integer vector.
This is because fct_recode()
(as well as all the other fct_*()
functions) require a character or factor vector as the first argument (called .f
). If we want to manipulate a numeric vector, we need to first coerce it to a character or factor before recoding it. Because its values/labels will then be characters, we need to quote the names of the existing levels (e.g., "1"
rather than 1
).
<-
survey fct_recode(as.character(survey),
"Strongly agree" = "1",
"Agree" = "2",
"Neither agree nor disagree" = "3",
"Disagree" = "4",
"Strongly disagree" = "5")
9.6 Collapsing
Another factor manipulation is reducing the number of levels, called collapsing. Sometimes we have a factor with many levels, but very few observations exist at some levels. This can cause problems in estimation, especially in logistic regression models. We can combine levels with few observations together.
We can do this with fct_collapse()
, and our arguments follow the pattern "new_level" = c("current_level_1", "current_level_2", ...)
.
We can combine levels from y
so that “a” is in a new level called “vowel”, and the other three letters (“b”, “c”, and “d”) are combined into a group called “consonant.”
<-
y_collapse fct_collapse(y,
"vowel" = "a",
"consonant" = c("b", "c", "d"))
Our levels for y_collapse
are now as follows:
Integer | Label |
---|---|
1 | vowel |
2 | consonant |
This process is irreversible since we have lost data granularity. We could change “vowel” back into “a” with fct_recode()
since this level of y_collapse
is only composed of one level from y
. However, because “consonant” is a composite of multiple levels, it is impossible to separate this level back into “b”, “c”, and “d”. Therefore, it is a good idea to create a new object or variable when collapsing a factor.
Compare the output and structure of y
and y_collapse
to see how the data has changed:
y
[1] a b c b d a a
Levels: a b c d
y_collapse
[1] vowel consonant consonant consonant consonant vowel vowel
Levels: vowel consonant
as.numeric(y)
[1] 1 2 3 2 4 1 1
as.numeric(y_collapse)
[1] 1 2 2 2 2 1 1
levels(y)
[1] "a" "b" "c" "d"
levels(y_collapse)
[1] "vowel" "consonant"
After collapsing, we should compare the original and new factors to verify the coding worked as intended:
table(y, y_collapse)
y_collapse
y vowel consonant
a 3 0
b 0 2
c 0 1
d 0 1
Again, we only need to check whether the cells are zero or nonzero. The exact count does not matter. Here, all the values of “a” in y
correspond to “vowel” in y_collapse
, while all the values of “b”, “c”, and “d” correspond to “consonant.”
9.6.1 Empty Levels
If we take a subset of our data, the levels data for factor variables remains unchanged, even if we have excluded all observations at a certain level.
Take a subset of the first six rows (with head()
) of penguins
and store them as penguins_subset
.
<- head(penguins) penguins_subset
The levels data for penguins_subset
still includes all three islands, even though we only have “Torgersen” in our subset.
$island penguins_subset
[1] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen
Levels: Biscoe Dream Torgersen
as.numeric(penguins_subset$island)
[1] 3 3 3 3 3 3
levels(penguins_subset$island)
[1] "Biscoe" "Dream" "Torgersen"
This can have some annoying consequences for some plots, where levels with no observations are still included:
plot(flipper_len ~ island, penguins_subset)
To remove these extra levels, simply use factor()
to “reset” the factor and drop unused levels:
$island <- factor(penguins_subset$island) penguins_subset
Now verify that these levels were removed:
$island penguins_subset
[1] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen
Levels: Torgersen
as.numeric(penguins_subset$island)
[1] 1 1 1 1 1 1
levels(penguins_subset$island)
[1] "Torgersen"
plot(flipper_len ~ island, penguins_subset)
9.7 Cheat Sheet
A cheat sheet for factor manipulation is given below. It has examples of fct_relevel()
, fct_recode()
, and fct_collapse()
with the y
vector, showing the integer vector and integer-label mapping after each operation.

9.8 Exercises
9.8.1 Fundamental
- Creating factors: R has a built-in list of colors accessible with
colors()
. Create a vector with the first letter of each color (seestr_sub()
). Convert it to a factor and thenplot()
it. Which letter is most common?
chickwts
is a dataset with the weights of chickens (weight
) who at different feeds (feed
). Use it for exercises 2-4.
If we are working with external data files and make a mistake like overwriting the data, all we need to do is re-import the data.
With built-in datasets, our process looks a little different. Built-in datasets exist in the datasets
or other packages. When we reference one of those objects, like penguins
or chickwts
, R will first look in our environment and then in the other packages we have loaded.
That means, if we have something else called penguins
or chickwts
in our environment, R will find that one instead of the copy in the datasets
package:
<- 1
chickwts str(chickwts)
num 1
The solution is simple. Just remove the copy in the environment with rm()
, so that R will find the copy in the datasets
package:
rm(chickwts)
str(chickwts)
'data.frame': 71 obs. of 2 variables:
$ weight: num 179 160 136 227 217 168 108 124 143 140 ...
$ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
If you make a mistake with chickwts
in the exercises below, just run rm(chickwts)
!
Releveling: Using the
chickwts
dataset, create a new column calledfeed_relevel
fromfeed
with code like this:$feed_relevel <- fct_relevel(chickwts$feed, ...) chickwts
Order the levels so that running
plot(chickwts$feed_relevel)
returns this exact plot:Recoding: Using the
chickwts
dataset, create a new column calledfeed_recode
fromfeed
so thatplot(chickwts$feed_recode)
returns this exact plot:Collapsing: Using the
chickwts
dataset, create a new column calledfeed_collapse
fromfeed
so thatplot(chickwts$feed_collapse)
returns this exact plot:
9.8.2 Extended
Relabeling with a function: Create a vector of 100 random numbers 1 to 10. Convert it to a factor. Then, rename them to start with “id” and end with “x”, like “id1x”, “id2x”, etc.
Numeric to factor: In the
mtcars
data, all the variables are numeric. Convertvs
to a factor, where 0 has the label “V-shaped” and 1 has the label “Straight”.