3.6 Creating/changing variables
The mutate()
function is used to modify a variable or
recreate a new variable.
Variable are changed by using the name of variable as a
parameter and the parameter value is set to the new variable.
The data frame is typically piped in and the data frame name
is not needed when referencing the variable names.
A new variable is create when a parameter name is used that is
not an existing variable of the data frame.
The examples in this section also demonstrate a number of useful functions to create and inspect variables.
Examples
Creating a new variable.
This example uses a simple mathematical formula to create a new variable based on the value of two other variables.
forbes <- forbes %>% mutate( pe = market_value / profits ) forbes %>% pull(pe) %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -377.00 11.67 18.50 Inf 28.66 Inf 5
The
pull()
function returns a vector from a data frame. The base Rsummary()
function calculates the five number summary of a numerical vector.Conditionally changing the values of a variable.
This example demonstrates two functions that can be used to change the values of variable based on the condition of the variable (or possibly another variable.)
The following code checks to see if profits is a (strictly) positive number. It it is it calculates the price to earnings ratio. Otherwise it set the price to earning ratio to zero.
The
if_else()
function takes three parameters. The first is the condition. When the row of the condition isTRUE
, the true value will be used, the second parameter. Otherwise the false value will be used, the third parameter. The true and false parameters can be either a column or a scalar value.forbes <- forbes %>% mutate( pe = if_else(profits > 0, market_value / profits, 0) ) forbes %>% pull(pe) %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 11.53 18.28 26.08 28.42 758.00 5
The folowing code uses the
recode()
function to change values ofUnited States
toUSA
. Therecode()
function does a test of equality to the parameter name. If the valus of the variable equals the parameter name value, this value is replace with the parameter value.United States
valid variable and parameter name, since it contains a space. To get R to acceptUnited States
as a parameter name it is enclosed in backticks, ```.forbes <- forbes %>% mutate( country_abrv = recode(country, `United States` = "USA") ) forbes %>% pull(`United States`) %>% summary()
Creating indicator variables.
The
%in%
operator is used to determine if the set of values on the left is in the set of values on the right. It returns a boolean,TRUE
orFALSE
.In this example the
%in%
operator is used to determine if each of the countries is one of the NAFTA countries.forbes <- forbes %>% mutate( nafta = country %in% c("United States", "Canada", "Mexico") ) forbes %>% pull(nafta) %>% summary()
Mode FALSE TRUE logical 1176 824
The base R
summary()
function counts the number ofTRUE
andFALSE
values in the variable.Creating a factor variable.
Factor variables are used to represent categorical variables, a variable that takes on a fixed set of values. Indictor variable are a special case of factor variable, having two values,
TRUE
andFALSE
.The following code uses the base R
factor()
function to change thecountry
variable from character to a factor variable.forbes <- forbes %>% mutate( country = factor(country) ) forbes %>% pull(country) %>% summary()
Africa Australia 2 37 Australia/ United Kingdom Austria 2 8 Bahamas Belgium 1 9 Bermuda Brazil 20 15 Canada Cayman Islands 56 5 Chile China 4 25 Czech Republic Denmark 2 10 Finland France 11 63 France/ United Kingdom Germany 1 65 Greece Hong Kong/China 12 20 Hungary India 2 27 Indonesia Ireland 7 8 Islands Israel 1 8 Italy Japan 41 316 Jordan Kong/China 1 4 Korea Liberia 4 1 Luxembourg Malaysia 2 16 Mexico Netherlands 17 28 Netherlands/ United Kingdom New Zealand 2 1 Norway Pakistan 8 1 Panama/ United Kingdom Peru 1 1 Philippines Poland 2 1 Portugal Russia 7 12 Singapore South Africa 16 15 South Korea Spain 45 29 Sweden Switzerland 26 34 Taiwan Thailand 35 9 Turkey United Kingdom 12 137 United Kingdom/ Australia United Kingdom/ Netherlands 1 1 United Kingdom/ South Africa United States 1 751 Venezuela 1
The base R
summary()
function counts the number of occurances of each level for factor variables.The following code uses the base R
cut()
function to convert a numeric variable to factor variable. You define a set bins to sort the values into. This is done using thebreak
parameter. These breaks form the upper and lower limits of each bin. The names given to each of these bins is set by thelabels
parameter.The following code uses
cut()
to divideprofits
intolow
,mid
,high
, andvery high
bins.forbes <- forbes %>% mutate( profit_lev = cut(profits, breaks = c(-Inf, .08, .44, 10, Inf), labels = c("low", "mid", "high", "very high") ) ) forbes %>% pull(profit_lev) %>% summary()
low mid high very high NA's 501 999 489 6 5
Combining categories
Categories of string variables can be combined using
recode()
andif_else()
. This can result in a fair amount of repitition in code when there are a number of values that need to be grouped. This example usedcase_when()
to recode the categories of thecategory
variable into four new categories. The conditions are checked in order. TheTRUE
condition serves as the else condition. It is what will be done if none of the prior conditions areTRUE
.forbes <- forbes %>% mutate( industry = case_when( category %in% c("Banking", "Insurance", "Diversified financials") ~ "finance", category == "Oil & gas operations" ~ "oil_gas", category %in% c("Technology hardware & equipment", "Semiconductors", "Drugs & biotechnology", "Software & services") ~ "tech", TRUE ~ "other" ) ) forbes %>% select(name, category, industry) %>% head()
# A tibble: 6 x 3 name category industry <chr> <chr> <chr> 1 Citigroup Banking finance 2 General Electric Conglomerates other 3 American Intl Group Insurance finance 4 ExxonMobil Oil & gas operations oil_gas 5 BP Oil & gas operations oil_gas 6 Bank of America Banking finance