4.8 Coding missing values - part 2
4.8.1 Programming skills
4.8.1.1 Anonymous functions
An anonymous function is a function that does not have a name. This means that the function is for a single use and has no permanence. It can not be used latter without being created again.
Anonymous functions are a quick and simple approach to creating your own function. (The syntax for their specification is much simpler than a typical user-defined function.) They are a handy tool for data wranglers when a custom function (not already defined) is needed.
4.8.1.2 Repeating an operation on multiple variables
In the prior sections of this chapter we introduced the
concept of vectorization and used it to repeat operations
on all rows of a variable.
There are times when the same (vectorized) operation needs
to be applied to two or more variables of a data frame.
For example, in the prior section, the ?
values where
changed to missing in two variables of the data frame.
The code to make the change was written twice,
once for each of the two variable.
This is not hard if there are only two variables that
the operation is being repeated on.
When there are more, maybe many more, you will want to
use a function/method to apply the operation to a
set of variables.
This will be demonstrated in the examples of this section.
4.8.2 Examples - R
These examples use the airAccs.csv
data set.
We begin by using the same code as in the prior sections of this chapter to load the tidyverse, import the csv file, and rename the variables.
Note, the
lubridate
package is also loaded for use in these examples.library(tidyverse) library(lubridate)
airAccs_path <- file.path("..", "datasets", "airAccs.csv") air_accidents_in <- read_csv(airAccs_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
air_accidents_in <- air_accidents_in %>% rename( obs_num = 1, date = Date, plane_type = planeType, dead = Dead, aboard = Aboard, ground = Ground ) air_accidents <- air_accidents_in glimpse(air_accidents)
Observations: 5,666 Variables: 8 $ obs_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ... $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Anonymous functions.
The tidyverse (
purrr
package) allows anonymous functions to be specified using what is known asformula
notation. The anonymous function begins with~
and the parameters are named.x
,.y
, and.z
(defined in this order.) Not all three parameters need to be used in the function. The function does not need to use any parameters. If there is only one parameter,.
can be used in place of.x
. For example,~(.x + .y)
This would add the first and second parameter together and return the results. Another example is,
~ if_else(. == -9, NA_character_, .)
This would test the rows of the first parameter for equality with
-9
. IfTRUE
, the row isNA
, otherwise it is unchanged. (This is equivalent to thena_if(x, -9)
function.)The
case_when()
function is a generalization ofif_else()
that can check more than one condition.Each condition is paired with a replacement value. The replacement value is specified as a function, possibly an anonymous function. If the replacement value is the same for all observations, an anonymous function with simply that value can be used.
If a condition is found to match, the remaining conditions are not checked for these rows. Using
TRUE
as the last condition provides a default value to use if none of the other conditions areTRUE
.We will change the
operator
values of?
andPrivate
toNA
using thecase_when()
function.# restore original values of the variables air_accidents <- air_accidents %>% mutate(operator = pull(air_accidents_in, operator)) air_accidents <- mutate( air_accidents, operator = case_when( operator == "?" ~ NA_character_, operator == "Private" ~ NA_character_, TRUE ~ operator ) ) air_accidents %>% filter(operator == "?") %>% select("location", "operator", "plane_type")
# A tibble: 0 x 3 # ... with 3 variables: location <chr>, operator <chr>, plane_type <chr>
In this example the
case_when()
function is used with only two test conditions. Therecode()
orna_if()
methods could have been used here. Thecase_when()
function was used to demonstrate its use.Operating on multiple columns.
The
mutate_at()
function is a variant ofmutate()
. It is used when you want to apply the same function to a set of columns.We will use
mutate_at()
to apply thena_if()
function to theoperator
andplane_type
columns. The columns to apply the function to are given using thevar()
function.The
na_if()
function needs two parameters, the column name and"?"
. We will create an anonymous function to do that for us. The anonymous function will be,~ na_if(., "?")
, where the.
, short for.x
the first parameter, is passed in from themutate_at()
function.# restore original values of the variables air_accidents <- air_accidents %>% mutate(operator = pull(air_accidents_in, operator)) air_accidents <- mutate_at(air_accidents, vars(operator, plane_type), ~ na_if(., "?")) air_accidents %>% filter(operator == "?" | plane_type == "?") %>% select("location", "operator", "plane_type")
# A tibble: 0 x 3 # ... with 3 variables: location <chr>, operator <chr>, plane_type <chr>
The
mutate_at()
method can pass additional parameters to the called function,na_if()
here. The following code uses this approach instead of the anonymous function approach.# restore original values of the variables air_accidents <- air_accidents %>% mutate( operator = pull(air_accidents_in, operator), plane_type = pull(air_accidents_in, plane_type) ) air_accidents <- mutate_at(air_accidents, vars(operator, plane_type), na_if, "?") air_accidents %>% filter(operator == "?" | plane_type == "?") %>% select("location", "operator", "plane_type")
The parameters to# A tibble: 0 x 3 # ... with 3 variables: location <chr>, operator <chr>, plane_type <chr>
mutate_at()
are,.tbl, .vars, .funs, ...
.- The
.tbl
parameter is the data frame being operated on. This parameter is typically piped intomutate_at()
. - The
.vars
parameter is the list of variable in the data frame to be passed to.funs
.
Thevars()
function is one of severalselect()
helper functions that can genereate lists of variables. - The
.funs
parameter is the name of the function to be called for each of the.vars
. It is the name only, with out()
.
- The
...
are a list of any additional parameter to be used by the.funs
function. They can be named or not. The first parameter tofuns.
is the variable. The...
will follow this parameter.
- The
4.8.3 Examples - Python
These examples use the airAccs.csv data set.
We begin by using the same code as in the prior sections of this chapter to load the packages, import the csv file, and rename the variables.
from pathlib import Path import pandas as pd import numpy as np
airAccs_path = Path('..') / 'datasets' / 'airAccs.csv' air_accidents_in = pd.read_csv(airAccs_path) air_accidents_in = ( air_accidents_in .rename( columns={ air_accidents_in.columns[0]: 'obs_num', 'Date': 'date', 'planeType': 'plane_type', 'Dead': 'dead', 'Aboard': 'aboard', 'Ground': 'ground'})) air_accidents = air_accidents_in.copy(deep=True) print(air_accidents.dtypes)
obs_num int64 date object location object operator object plane_type object dead float64 aboard float64 ground float64 dtype: object
Using anonymous functions
Anonymous functions in python are called
lambda
functions. They are specified as,lambda <parameter list>: code
For example,
lambda x, y: (x + y)
This would add the first and second parameters together and return the reslts. The calling function\method would need to supply the two parameters for this function.
A
lambda
function called inside of theassign()
method gets passed the data frame (that theassign()
method is modifying) as its only parameter. An example of alambda
function that could be called inside of anassign()
is,lambda df: df.loc[:, 'x'].mask(df.loc[:, 'x'] == -9, np.NaN)
The data frame parameter of the
lambda
function is nameddf
. Themask()
method is used in this example uses a test condition to determine which rows to replace with a specified value,np.NAN
here. This example tests the rows of thex
variable for equality with-9
. IfTrue
, the row is set tonp.NaN
, otherwise it is unchanged.# restore original values of the operator variable air_accidents.operator = air_accidents_in.operator air_accidents = ( air_accidents .assign(operator=lambda df: df['operator'].mask(df['operator'] == '?', np.NaN))) (air_accidents .query('operator == "?"') .copy() .loc[:, ["location", "operator", "plane_type"]] .head(n=10) .pipe(print))
Empty DataFrame Columns: [location, operator, plane_type] Index: []
Change a variable with multiple conditions
Python and Pandas do not have a switch, also called case when, type of function that can test multiple conditions. (This is a common control mechanism found in many programming languages.) If all the conditions are checks of equality, then
replace()
works. If one or more of the conditions is a range of values, then the solution is to use multiple comprehensions or multiplelambda
fuctions. Sequential comprehensions can be used to achive mostcase when
type functions. Nested comprehensions may sometimes be needed to implementcase when
orelse if
type conditions. Nest the comprehensions should only be used if sequential comprehensions will not accomplish what is needed.We will change the
operator
values of?
andPrivate
toNAN
using sequentialmask()
functions.# restore original values of the operator variable air_accidents['operator'] = air_accidents_in['operator'] air_accidents = ( air_accidents .assign( operator=(lambda df: df ['operator'] .mask(df.operator == '?', np.NaN) .mask(df.operator == 'Private', np.NaN)))) (air_accidents .query('operator == "?"') .copy() .loc[:, ["location", "operator", "plane_type"]] .head(n=10) .pipe(print))
Empty DataFrame Columns: [location, operator, plane_type] Index: []
Operating on multiple columns.
The
apply()
method will iterate over the columns of the data frame, by default operation. (Theapply()
method can also iterate over rows, if needed.) This can be used to do the same operation on multiple columns. There are two parts to coding this.One of the parts is specifying the function to be applied. In this example we will use a
lambda
function to change?
toNaN
. Theapply()
method is operating on columns. So the parameter to thelambda
function is a column of the data frame.The second part is determining the columns that the operation will be applied to. For this we will use a comprehension. The comprehension will compare each column name in the data frame to a set of names. If the name is in the set, the
lambda
function is run for the column. Otherwise the original column is returned unchanged.# restore original values of the variables air_accidents = ( air_accidents .assign( operator=air_accidents_in['operator'], plane_type=air_accidents_in['plane_type'])) air_accidents = ( air_accidents .apply( func=lambda x: x.mask(x == '?', np.NaN) if x.name in ['operator', 'plane_type'] else x)) (air_accidents .query('operator == "?" | plane_type == "?"') .copy() .loc[:, ["location", "operator", "plane_type"]] .head(n=10) .pipe(print))
Empty DataFrame Columns: [location, operator, plane_type] Index: []
4.8.4 Exercises
These exercises use the PSID.csv
data set
that was imported in the prior section.
Import the
PSID.csv
data set.Code
NA
s for theNA/DF
andno histories
values for the marriage status variable.Change the units on the
earnings
andhours
variables to be thousands of dollars or hours. Use a method that operates on multiple columns.Hint, to do the unit change on a variable
x
, one would dox / 1000
.