SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.4 Dropping unneeded variables

4.4.1 Data Concepts - Removing unneeded variables

Data sets often contain variables that are not of interest for a project. Removing these unneeded variables makes it easier to focus on the variables that we need to wrangle.

There are two approaches to identifying the variables that you want to keep in a data frame. These approaches are inclusion, explicitly identifying which variables to retain, and exclusion, explicitly identifying the variables that are to be removed. Determining which of these two approaches to use depends on how easy it is define each of these two sets of variables. For example, if a data frame has over 15 variables and you want all of them except two, it is easier and clearer coding to list the two variables that are not needed.

Identifying what you want to operate on through inclusion and exclusion will be used in many of the later wrangling tasks.

4.4.2 Programming skills - Chaining/pipes

As a task gets more complex, the number of steps needed to complete the task goes up. These steps can produce a number of intermediate results that need to be passed from one step to the next. When an intermediate result will only be used by the following steps, saving and referencing the intermediate result can make your code more difficult to read. The tidyverse and Python both have operators that allow for intermediate results to be passed to following methods and functions without the need for explicitly naming the intermediate results.

R and Python have different forms for calling methods. Recall that Python uses

    object.method_name(parms)

and R uses

    method_name(object, parms).

Due to this difference in method calling approaches, there are different approaches to connecting method results together.

In the tidyverse two methods are connected using the pipe operator, %>%. (The pipe operator is used in a number of other programming languages, though the symbol for it is different.) This would be done as follows

    result <-
      object %>%
      method_1(parms_1) %>%
      method_2(parms_2)

In python the methods would be chained together as follows

    result = (object
              .method_1(parms_1)
              .method_2(parms_2))

The set of outer parenthesis in the Python example makes it easier to split the code over multiple lines.

In both the tidyverse and Python example, method_1 is run and the result from method_1 is used as the object for method_2 to work on.

Both of these method-chaining approaches place the most important information about what is being operated on and the operations that are being done on the left side of the page. The details of what is being done are found further to the right on the page. This is considered easier to read.

4.4.3 Examples - R

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior section to loading packages, import the csv file, and rename the variables.

    library(tidyverse)
    airAccs_path <- file.path("..", "datasets", "airAccs.csv")
    air_accidents_in <- read_csv(airAccs_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    air_accidents_in <-
      rename(
        air_accidents_in,
        obs_num = 1,
        date = Date,
        plane_type = planeType,
        dead = Dead,
        aboard = Aboard,
        ground = Ground
        )
    
    air_accidents <- air_accidents_in
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 8
    $ obs_num    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

4.4.3.1 Cleaning - Drop variables

The select() method is used to reduce the number of columns and order the columns. The first parameter to the select() method is the data frame on which to select. The remaining parameters are used to identify the columns to be included or excluded. The negation sign, "-", preceding a variable name indicates that the variable is to be excluded.

  1. We will use select() to keep all the variables except the obs_num.

    air_accidents <-
      select(air_accidents, date, plane_type, dead, aboard, ground)
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 5
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

    The variables will be ordered in the data set as they are given in select().

  2. Using exclusion to keep all the variables except the obs_num.

    air_accidents <- air_accidents_in
    air_accidents <- select(air_accidents_in, -obs_num)
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 7
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

    Typically one uses either inclusion or exclusion to select the variables of a data frame, but not both.

  3. Ordering the variables using select(). The dead column will be made the first variable in the data frame.

    This example makes use of the everything() function. This is one of the helper functions of select(). See the documentation for select for the other helper functions.

    air_accidents <-
      select(air_accidents, dead, everything())
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 7
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
  4. This example uses pipes to rename and select the variables.

    Note there are comments in this code that provide one possible way to read the code.

    library(tidyverse)
    # The comments in this block of code are to show you how the code might
    # be read.  Theis kind of comments should not be put in real code.
    
    airAccs_path <- file.path("..", "datasets", "airAccs.csv")
    air_accidents_in <- read_csv(airAccs_path, col_types = cols())
    air_accidents_in <-           # air_accidents_in is the result of
      air_accidents_in %>%        # taking the air_accident_in data, then
      rename(                     # renaming the following variables
        obs_num = 1,
        date = Date,
        plane_type = planeType,
        dead = Dead,
        aboard = Aboard,
        ground = Ground
        )
    
    air_accidents <-              # air_accidents is the result of
      air_accidents_in %>%        # taking the air_accident_in data, then
      select(-obs_num)            # selecting all of the variables except obs_num
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 7
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

The tidyverse includes support for a compound assignment operator, %<>%. Its use would replace the assignment operator. This is not recommend by Hadley Wickham (architect of the tidyverse.) So while you will see examples of this on the web, I recommend avoiding its use.

4.4.4 Examples - Python

These examples use the airAccs.csv data set.

  1. We begin by using the same code as in the prior section to load the packages, import the csv file, and rename the variables.

    import pandas as pd
    import os
    airAccs_path = os.path.join('..', 'datasets', 'airAccs.csv')
    air_accidents_in = pd.read_csv(airAccs_path)
    air_accidents_in.rename(
        columns={
            'Unnamed: 0': 'obs_num',
            'Date': 'date', 
            'planeType': 'plane_type',
            'Dead': 'dead',
            'Aboard': 'aboard',
            'Ground': 'ground'},
        inplace=True)
    
    air_accidents = air_accidents_in.copy(deep=True)
    print(air_accidents.dtypes)
    obs_num         int64
    date           object
    location       object
    operator       object
    plane_type     object
    dead          float64
    aboard        float64
    ground        float64
    dtype: object

4.4.4.1 Cleaning - Drop variables

The loc[] attribute is used to reduce the number of columns and order the columns. (Note the use of square brackets with this attribute.) The loc[] attribute can be used on both rows and columns, loc[<rows>, <columns>]. We are using only columns in these examples. The row index will be set to : to return all rows. The : operator will be explained further in a following section.

  1. We will use loc[] to keep all the variables except the obs_num.

    air_accidents = air_accidents.loc[:, 
        ['date', 'plane_type', 'dead', 'aboard', 'ground']]
    
    print(air_accidents.dtypes)
    date           object
    plane_type     object
    dead          float64
    aboard        float64
    ground        float64
    dtype: object

    Note that square brackets, [ ], are used inside of the loc[] attribute to provide a set of columns.

    The assignment line of code has been split across two physical lines. It is recommended to keep the length of Python lines to 72 character. The line can be split as it is here due to the split being done inside of the square brackets. The open square bracket lets Python know that there is more to this line of code and it then ignores the end of the physical line.

  2. The drop() method will drop variables from a data frame. We will use it to exclude the obs_num variable.

    air_accidents = air_accidents_in.copy(deep=True)
    air_accidents.drop(columns='obs_num', inplace=True)
    
    print(air_accidents.dtypes)
    date           object
    location       object
    operator       object
    plane_type     object
    dead          float64
    aboard        float64
    ground        float64
    dtype: object

    The drop() method can be given a set of columns to drop in the same way as was done above with loc[], by including them in [ ].

    Typically one uses either inclusion to keep variables or exclusion to drop variables, but not both.

  3. Ordering the variables using loc[].

    The dead column will be made the first variable in the data frame.

    air_accidents = air_accidents.loc[:,
        ['dead', 'date', 'plane_type', 'aboard', 'ground']]
    
    print(air_accidents.dtypes)
    dead          float64
    date           object
    plane_type     object
    aboard        float64
    ground        float64
    dtype: object
  4. This example uses chained methods to display the column names.

    In prior sections the column index was saved as air_accidents_columns. Then the value attribute was applied to air_accidents_columns. Both columns and values are attributes and not methods. Chaining applies to attributes as well as methods. Here we retreive the name values without using an intermidiate named object.

    print(air_accidents.columns.values)
    ['dead' 'date' 'plane_type' 'aboard' 'ground']

    Chained methods are often written across several lines of code. This helps keep the line length below the recommended length of 72 characters and allows the eye to quckly see all the methods being used. This improves the readability of the code. Two accepted mthods to split the chained methods across multiple lines are shown below.

    print(
        air_accidents
            .columns
            .values)
    ['dead' 'date' 'plane_type' 'aboard' 'ground']

    Chaining is a useful tool that generally makes code more readable. There are limits to its use. One of them is with the inplace equals True. There is no object returned from a method with inplace=True and as such nothing to chain. Chaining and inplace changes are typically not used together. Another is chaining of subsetting methods (subsetting is discussed in the next section.) This is called chained indexing and is to be avoided.

4.4.5 Exercises

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

  2. Drop the first variable in the data frame. You may have renamed it after it was loaded.

  3. Make the age variable the first variable in the data frame.