 Supporting Statistical Analysis for Research

## 3.6 Creating/changing variables

The mutate() function is used to modify a variable or recreate a new variable. Variable are changed by using the name of variable as a parameter and the parameter value is set to the new variable. The data frame is typically piped in and the data frame name is not needed when referencing the variable names. A new variable is create when a parameter name is used that is not an existing variable of the data frame.

The examples in this section also demonstrate a number of useful functions to create and inspect variables.

Examples

1. Creating a new variable.

This example uses a simple mathematical formula to create a new variable based on the value of two other variables.

forbes <-
forbes %>%
mutate(
pe = market_value / profits
)

forbes %>%
pull(pe) %>%
summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
-377.00   11.67   18.50     Inf   28.66     Inf       5 

The pull() function returns a vector from a data frame. The base R summary() function calculates the five number summary of a numerical vector.

2. Conditionally changing the values of a variable.

This example demonstrates two functions that can be used to change the values of variable based on the condition of the variable (or possibly another variable.)

The following code checks to see if profits is a (strictly) positive number. It it is it calculates the price to earnings ratio. Otherwise it set the price to earning ratio to zero.

The if_else() function takes three parameters. The first is the condition. When the row of the condition is TRUE, the true value will be used, the second parameter. Otherwise the false value will be used, the third parameter. The true and false parameters can be either a column or a scalar value.

forbes <-
forbes %>%
mutate(
pe = if_else(profits > 0, market_value / profits, 0)
)

forbes %>%
pull(pe) %>%
summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.00   11.53   18.28   26.08   28.42  758.00       5 

The folowing code uses the recode() function to change values of United States to USA. The recode() function does a test of equality to the parameter name. If the valus of the variable equals the parameter name value, this value is replace with the parameter value.

United States valid variable and parameter name, since it contains a space. To get R to accept United States as a parameter name it is enclosed in backticks, .

forbes <-
forbes %>%
mutate(
country_abrv = recode(country, United States = "USA")
)

forbes %>%
pull(United States) %>%
summary()
3. Creating indicator variables.

The %in% operator is used to determine if the set of values on the left is in the set of values on the right. It returns a boolean, TRUE or FALSE.

In this example the %in% operator is used to determine if each of the countries is one of the NAFTA countries.

forbes <-
forbes %>%
mutate(
nafta = country %in% c("United States", "Canada", "Mexico")
)

forbes %>%
pull(nafta) %>%
summary()
   Mode   FALSE    TRUE
logical    1176     824 

The base R summary() function counts the number of TRUE and FALSE values in the variable.

4. Creating a factor variable.

Factor variables are used to represent categorical variables, a variable that takes on a fixed set of values. Indictor variable are a special case of factor variable, having two values, TRUE and FALSE.

The following code uses the base R factor() function to change the country variable from character to a factor variable.

forbes <-
forbes %>%
mutate(
country = factor(country)
)

forbes %>%
pull(country) %>%
summary()
                      Africa                    Australia
2                           37
Australia/ United Kingdom                      Austria
2                            8
Bahamas                      Belgium
1                            9
Bermuda                       Brazil
20                           15
56                            5
Chile                        China
4                           25
Czech Republic                      Denmark
2                           10
Finland                       France
11                           63
France/ United Kingdom                      Germany
1                           65
Greece              Hong Kong/China
12                           20
Hungary                        India
2                           27
Indonesia                      Ireland
7                            8
Islands                       Israel
1                            8
Italy                        Japan
41                          316
Jordan                   Kong/China
1                            4
Korea                      Liberia
4                            1
Luxembourg                     Malaysia
2                           16
Mexico                  Netherlands
17                           28
Netherlands/ United Kingdom                  New Zealand
2                            1
Norway                     Pakistan
8                            1
Panama/ United Kingdom                         Peru
1                            1
Philippines                       Poland
2                            1
Portugal                       Russia
7                           12
Singapore                 South Africa
16                           15
South Korea                        Spain
45                           29
Sweden                  Switzerland
26                           34
Taiwan                     Thailand
35                            9
Turkey               United Kingdom
12                          137
United Kingdom/ Australia  United Kingdom/ Netherlands
1                            1
United Kingdom/ South Africa                United States
1                          751
Venezuela
1 

The base R summary() function counts the number of occurances of each level for factor variables.

The following code uses the base R cut() function to convert a numeric variable to factor variable. You define a set bins to sort the values into. This is done using the break parameter. These breaks form the upper and lower limits of each bin. The names given to each of these bins is set by the labels parameter.

The following code uses cut() to divide profits into low, mid, high, and very high bins.

forbes <-
forbes %>%
mutate(
profit_lev = cut(profits,
breaks = c(-Inf, .08, .44, 10, Inf),
labels = c("low", "mid", "high", "very high")
)
)

forbes %>%
pull(profit_lev) %>%
summary()
      low       mid      high very high      NA's
501       999       489         6         5 
5. Combining categories

Categories of string variables can be combined using recode() and if_else(). This can result in a fair amount of repitition in code when there are a number of values that need to be grouped. This example used case_when() to recode the categories of the category variable into four new categories. The conditions are checked in order. The TRUE condition serves as the else condition. It is what will be done if none of the prior conditions are TRUE.

forbes <-
forbes %>%
mutate(
industry =
case_when(
category %in% c("Banking", "Insurance", "Diversified financials") ~
"finance",
category == "Oil & gas operations" ~
"oil_gas",
category %in% c("Technology hardware & equipment", "Semiconductors",
"Drugs & biotechnology", "Software & services") ~
"tech",
TRUE ~
"other"
)
)

forbes %>%
select(name, category, industry) %>%
head()
# A tibble: 6 x 3
name                category             industry
<chr>               <chr>                <chr>
1 Citigroup           Banking              finance
2 General Electric    Conglomerates        other
3 American Intl Group Insurance            finance
4 ExxonMobil          Oil & gas operations oil_gas
5 BP                  Oil & gas operations oil_gas
6 Bank of America     Banking              finance `