9 Categorical Data
We generally think of data as a collection of “measurements,” in a loose sense of the word “measurement”. In this loose sense, there are two basic types of “measurement”, measurements on continuous scales, and measurements on categorical scales. (In ordinary speech the word “measurement” often implies a continuous scale.)
Continuous measurements can be represented by a point on a number line, are well-ordered, and in principle can take on one value of an infinite set of choices. Think of a variable like age, which varies continuously from 0 to higher numbers, and where there is a unique order to the ages represented by 1, 5.5, and 55. In principle we can measure age with arbitrary precision - 5.5, or 5.5001 or 5.5000001. The scale here might be measured in days or years, but in any case it is continuous.
Categorical measurements can be represented by arbitrary labels (maybe numerals, maybe character strings), have no conceptual order, and take one value from a finite set of choices. Think of a variable like state of residence, which takes one of about 50 values (Washington, D.C.? Territories?), which have no inherent order.
(Interval and ordinal measurements may be thought of, and are often treated, as continuous measurements with limited precision.)
The distinction between continuous and categorical variables is fundamental to how we use them the analysis. In a regression for example, continuous variables give us slopes and curvature terms, where categorical variables give us intercepts.
In R, it is convenient to manage categorical data as factors. In software like Stata, SAS, and SPSS, we specify which variables are categorical when we call an analytical procedure like regression - no special distinction is made when we are managing or storing our data. In R, we specify which variables are factors when we create and store them - in an analytical procedure we need make no additional specification to distinguish levels of measurement.
In R, a factor refers to a class of data stored in numeric form, usually with some sort of values labels. The numbers (integers) merely represent distinct categories, with no meaningful order to the categories.
For example, we might have a data set where ‘1’ means Green Bay, ‘2’ means Madison, and ‘3’ means Milwaukee.
As with Date class data, we will seldom need to manipulate the underlying integers, we will mainly work with the “human-readable” value labels.
The basic constructor function for data with class ‘factor’ is
For example, we can begin with a character vector of city names, and
factor() to construct a factor from this.
city <- c("Madison", "Milwaukee", "Green Bay") city  "Madison" "Milwaukee" "Green Bay" x <- factor(city) x  Madison Milwaukee Green Bay Levels: Green Bay Madison Milwaukee
Notice that factors print differently than character data - no quotes.
9.1 Factors in Generic Functions
In addition to printing slightly differently than character data, in
generic functions that take numeric inputs, factors are treated differently
as well. Three functions that give different output with factors (versus
a numeric vector) are
We can look at the example data set
chickwts, which includes both
a numeric variable and a factor variable. We learn from
that this data set
was created from an experiment testing the effect of different
feeds on chicken weights.
summary function, the factor
feed produces a frequency table,
rather than the six number summary produced by
str(chickwts) # "weight" is numeric, "feed" is categorical 'data.frame': 71 obs. of 2 variables: $ weight: num 179 160 136 227 217 168 108 124 143 140 ... $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ... head(chickwts) weight feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean summary(chickwts) weight feed Min. :108.0 casein :12 1st Qu.:204.5 horsebean:10 Median :258.0 linseed :12 Mean :261.3 meatmeal :11 3rd Qu.:323.5 soybean :14 Max. :423.0 sunflower:12
In plots, a factor produces a categorical x-axis, and a boxplot rather than a scatter plot.
plot(weight ~ feed, data=chickwts)
In modeling, a factor is used as a categorical variables, generating a set of dummy variables and a set of parameters, rather than a single parameter.
summary(lm(weight ~ feed, data=chickwts))
Call: lm(formula = weight ~ feed, data = chickwts) Residuals: Min 1Q Median 3Q Max -123.909 -34.413 1.571 38.170 103.091 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 323.583 15.834 20.436 < 2e-16 *** feedhorsebean -163.383 23.485 -6.957 2.07e-09 *** feedlinseed -104.833 22.393 -4.682 1.49e-05 *** feedmeatmeal -46.674 22.896 -2.039 0.045567 * feedsoybean -77.155 21.578 -3.576 0.000665 *** feedsunflower 5.333 22.393 0.238 0.812495 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.85 on 65 degrees of freedom Multiple R-squared: 0.5417, Adjusted R-squared: 0.5064 F-statistic: 15.36 on 5 and 65 DF, p-value: 5.936e-10
Here, all the categorical parameters are named with the prefix “feed”.
9.2 Logical Comparisons and Math Operators
Logical comparisons are made with the value labels (which are character strings), not the underlying integer codes. Only some logical operators are allowed with factors, namely those based on equality.
rs <- sample(chickwts$feed, 7) rs  meatmeal casein sunflower casein linseed horsebean casein Levels: casein horsebean linseed meatmeal soybean sunflower rs == "casein"  FALSE TRUE FALSE TRUE FALSE FALSE TRUE rs == 1 # no error message, but WRONG!  FALSE FALSE FALSE FALSE FALSE FALSE FALSE rs > "casein" # error Warning in Ops.factor(rs, "casein"): '>' not meaningful for factors  NA NA NA NA NA NA NA
Notice that if we try to check for a numeric value, the numeral is treated as if it were a label and not the underlying data! It would be nice if this at least gave us a warning!
In a similar manner, we will not be doing any math at all with categorical data.
rs + 1 Warning in Ops.factor(rs, 1): '+' not meaningful for factors  NA NA NA NA NA NA NA mean(rs) Warning in mean.default(rs): argument is not numeric or logical: returning NA  NA
9.3 Levels and Labels
factor() function has two key arguments:
There is a crucial difference between these two arguments. The
levels parameter sets the order. The
assigns value labels to the “given” order. If no
specified, the order is sorted alphabetically (which is seldom
what we want).
As an example, we can make a new copy of the
with the levels reordered (“re-leveled”, this is also called
chick <- chickwts # first, create a working copy chick$feed2 <- factor(chick$feed, levels= c("horsebean", "linseed", "soybean", "meatmeal", "casein", "sunflower")) plot(weight ~ feed2, data=chick, main="Reordered levels")
If we attempt to use the
labels argument, we merely relabel
the data (which in this case is WRONG!):
chick$feed3 <- factor(chick$feed, labels= c("horsebean", "linseed", "soybean", "meatmeal", "casein", "sunflower")) opar <- par(mfcol=c(2,1)) plot(weight ~ feed, data=chick, main="Original") plot(weight ~ feed3, data=chick, main="Bad labels")
levels as an “input” function, specifying what numbers to use
to encode categories. Then
labels is an
“output” function, specifying how the underlying data should be displayed.
One other useful function here, to slightly reorder the levels
so that we have a new
category in the first position, is the
chick$feed4 <- relevel(chick$feed, "sunflower") plot(weight ~ feed4, data=chick, main="New first level")
In practice, there are many different ways we might want
to reorder (recode) the levels of a factor. The package
from the tidyverse collection of packages has quite a
few useful functions for this.
9.4 Collapsing and Dropping Categories
To combine several categories of an existing factor into one category, we relabel them.
While we use the labels to combine categories, notice that the underlying integers are also recoded.
chick$feed5 <- factor(chick$feed, labels= c("A", "A", "A", "B", "B", "B")) plot(weight ~ feed5, data=chick)
Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
If you have a lot of categories, but only a few to combine,
fct_recode() is very useful.
To drop categories, just do not mention them among the input
levels. The data is encoded
chick$feed6 <- factor(chick$feed, levels= c("horsebean", "casein")) plot(weight ~ feed6, data=chick)
table(chick$feed, chick$feed6, useNA="ifany")
horsebean casein <NA> casein 0 12 0 horsebean 10 0 0 linseed 0 0 12 meatmeal 0 0 11 soybean 0 0 14 sunflower 0 0 12
9.5 A confusing function name
Just to make things confusing, the
levels() function returns
and sets the level labels
chick$feed2 <- chick$feed levels(chick$feed) # what we have
 "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
# sets WRONG! labels again levels(chick$feed2) <-c("horsebean", "linseed", "soybean", "meatmeal", "casein", "sunflower") plot(weight ~ feed2, data=chick, main="Bad labels again")
9.6 Categorical Vector Exercises
Convert Numeric Codes to Factors
mtcarsdata, all the variables are numeric. Convert
vsto a factor, where 0 has the label “V-shaped” and 1 has the label “Straight”. How are the levels encoded, numerically?
Do the same for
am, giving 0 the label “automatic” and 1 the label “manual”.
Extract and Convert Row Names
mtcars, take the
row.namesand extract the make of each car type (the first word in the row name). Convert this into a factor.
Produce a frequency count of each car maker.
Plot car weight (
wt) over car make.
Re-level so that Maserati is the first category, then replot.
If is quite possible to have a factor defined with extra levels, levels for which you have labels but no actual data. Here is an example:
x <- sample(c("A", "B"), 20, replace=TRUE) xf <- factor(x, levels=c("A", "B", "C"))
Produce a table of frequency counts for
Produce a bar chart of counts,
For most analyses, we would rather drop the extra category. The simplest approach is to refactor, without specifying levels or labels.
xf2 <- factor(xf)
Then retabulate and replot.