Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?

3 Data Wrangling

This chapter is under construction.

3.1 Overview

This chapter provides an introduction to the following.

  • Data frame structure and how this relates to writing code

  • The pipe operator and how use it effectively

  • How to creating and modifying variables

  • Summarizing data

  • Conditionally and unconditionally subsetting a data frame

  • Conditional operations based on group inclusion

  • Reshaping data frames

3.2 Introduction

Data wrangling is the act of preparing data for further analysis. This definition requires a definition of what kind of data and what needs to be done to the data. We review these two questions before beginning the descriptions.

In this chapter data is structured data in tables, we refer to these as data frames. Date frames structure data in rows and columns This is not the only form data can take. Data can be structure as records. This is how data is structured in SQL. Record formatted data and data frames are similar in a number of ways. Their main differences are in how the data Is stored and modified. Data can also be unstructured. Text data is a common example of unstructured data. The position of a word within a text, paragraph, or sentences tells one little of how the word relates to the other words. This chapter only covers wrangling data frames.

What needs to be done varies from data frame to data frame and analysis to analysis. This chapter covers a set of wrangling tools that are applicable to many wrangling needs. It is by no means a complete set of wrangling tools. It is a quick introduction the structured approached to wrangling that is provided by the tidyverse.

Note, this chapter covers some of the same functions as the Introduction to the tidyverse. The focus in this chapter is using the tidyverse to wrangle data. Where as the focus of the Introduction to the tidyverse chapter is the functions.

Example

  1. Loading the tidyverse package

    library(tidyverse)

3.3 Data frames

A data frame in the tidyverse is called a tibble. A tibble contains a set of columns, that are called variables. The data of a data frame are located in these variables. The variables of a data frame are organized by an index. The index provides the order of the variables and any names that are given to the variables. The preferred method of referencing the variables of a data frame Is by their names.

A tibble can be thought of as a list variables. This list of variables is the column index of the data frame. See the figure below. In this figure, name 1, name 2, and name k are the column names contained in the index.

Column index of a data frame

Figure 3.1: Column index of a data frame

The figure above shows that the variables of a data frame are only connected by the index. An observation (row) is made up of the values in the same row of the variables of the tibble.

A data frame can be imported from a variety of sources, delimited file, excel, SQL, etc. The tidyverse has readers for the common types of data files. These reader all share a common set of parsers, a function that formats the data of a variable. This chapter will focus on reading .csv files, a form of delimited files.

Example

  1. Importing a .csv file.

    forbes_in <-
      read_csv(
        file.path("datasets", "Forbes2000.csv"),
        col_types = cols()
      )
    Warning: Missing column names filled in: 'X1' [1]
    glimpse(forbes_in)
    Rows: 2,000
    Columns: 9
    $ X1          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21~
    $ rank        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21~
    $ name        <chr> "Citigroup", "General Electric", "American Intl Group", "ExxonMobil", "BP~
    $ country     <chr> "United States", "United States", "United States", "United States", "Unit~
    $ category    <chr> "Banking", "Conglomerates", "Insurance", "Oil & gas operations", "Oil & g~
    $ sales       <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.33, 135.82, 53.13, 256.33~
    $ profits     <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.99, 6.48, 9.05, 5.15, 4.~
    $ assets      <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, 757.60, 171.71, 1019.17,~
    $ marketvalue <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 177.96, 115.40, 76.84, 24~

    The col_type parameter is used to define which parser will be used for each variable in the the data frame. Using cols() as the value for col_type tells the tidyverse to use its best guess for the type. This is useful for explority work with a data set that is not well defined. For production code, you should define the type for each variable in the col_types parameter.

3.4 Pipe operator

The pipe operator, %>%, passes an object to a function as the first parameter. The function call,

    function_name(data_object, other_parameters)

becomes,

    data_object %>% function_name(other_parameters)

With the pipe operator.

The pipe operator reduces the coding load of saving intermediate results that will only be referencing in next line of code. This reduction in managing intermediate results can make your code easier to read.

3.4.1 Examples

  1. Base R

    The following code creates a vector of 15 numeric values. This vector is then rounded to two significant digits, sorted in descending order, and then head() displays a few of the largest values.

    set.seed(749875)
    number_data <- runif(n = 15, min = 0, max = 1000)
    
    head(sort(round(number_data, digit = 2), decreasing = TRUE))
    [1] 997.62 813.26 797.96 733.98 732.67 675.45

    To read the above base R code, one reads from the inner most parenthises to the outer most. This nesting of functions can make reading base R code challenging.

    Another base R approach that avoids deeply nesting functions is to save the intermediate results. The intermediate results are then used in the next function as a separate command.

    number_round <- round(number_data, digit = 2)
    number_sort <- sort(number_round, decreasing = TRUE)
    head(number_sort)
    [1] 997.62 813.26 797.96 733.98 732.67 675.45

    This is also a more natural order of the functions. It does require the intermediate results to be saved. These intermediate results are only used by the function on the next line.

  2. Using the pipe operator

    The pipe operator allow the order of the data and functions to more closely match the order they are evaluated, without needing to save the intermediate results.

    number_data %>%
      round(digits = 2) %>%
      sort(decreasing = TRUE) %>%
      head()
    [1] 997.62 813.26 797.96 733.98 732.67 675.45

    This coding style places the most important information about what is being operated on and the operations that are being done on the left side of the page. The details of what is being done are found further to the right on the page. This is considered easier to read code.

The following are a few caveats on the use of the pipe operator.

  • If the pipe operator does not enhance the clarity of the code, use normal parameter passing. An example of this is the pull() function. Writing df %>% pull(var) may not be considered an improvement on pull(df, var).

  • Saving intermediate values can sometime make your code more meaningful. This occurs if you have a longer chain of piped functions and purpose of the chain is not obvious. Then breaking the chain with an assignment to an intermediate variable with a well chosen name can be helpful.

3.5 Column index

The two common column operations are renaming columns, rename(), and selecting columns, select(). The select() function has a number of helper functions that make it easier to select a set of columns, such as, starts_with(), ends_with(), contains(), everything() and the slice operator.

Examples

  1. Renaming the variables of the forbes data.

    The new names are given as the parameter names and the old names as the parameter variables. Neither of these names needs to be quoted, as long as there are no spaces in a name. Multiple variable renames can be listed in the parameters to rename().

    forbes_in <-
      forbes_in %>%
      rename(
        market_value = marketvalue
        )
  2. Dropping variables.

    Variables are removed from a data frame by using the negation symbol before the variable name.

    forbes <-
      forbes_in %>%
      select(-X1)
  3. Selecting variables.

    The slice operator is used in this example to select the columns starting with the company name through the sales variable.

    forbes %>%
      select(
        name:sales
        ) %>%
      glimpse()
    Rows: 2,000
    Columns: 4
    $ name     <chr> "Citigroup", "General Electric", "American Intl Group", "ExxonMobil", "BP", ~
    $ country  <chr> "United States", "United States", "United States", "United States", "United ~
    $ category <chr> "Banking", "Conglomerates", "Insurance", "Oil & gas operations", "Oil & gas ~
    $ sales    <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.33, 135.82, 53.13, 256.33, 4~

    Here the forbes data frame was not changed. The modified data frame was only diplayed.

  4. Reordering variables.

    The order of the variables of a data frame will match the order of the variables in select().

    This exmaple reorders the variables to put the name, market_value, and country vraibles first in the data frame. Rather than list all the remaining variables, the everything() helper function is used. This list the remaining variable.

    forbes <-
      forbes %>%
      select(
        name,
        market_value,
        country,
        everything()
        )
    
    glimpse(forbes)
    Rows: 2,000
    Columns: 8
    $ name         <chr> "Citigroup", "General Electric", "American Intl Group", "ExxonMobil", "B~
    $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 177.96, 115.40, 76.84, 2~
    $ country      <chr> "United States", "United States", "United States", "United States", "Uni~
    $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2~
    $ category     <chr> "Banking", "Conglomerates", "Insurance", "Oil & gas operations", "Oil & ~
    $ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.33, 135.82, 53.13, 256.3~
    $ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.99, 6.48, 9.05, 5.15, 4~
    $ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, 757.60, 171.71, 1019.17~

3.6 Creating/changing variables

The mutate() function is used to modify a variable or recreate a new variable. Variable are changed by using the name of variable as a parameter and the parameter value is set to the new variable. The data frame is typically piped in and the data frame name is not needed when referencing the variable names. A new variable is create when a parameter name is used that is not an existing variable of the data frame.

The examples in this section also demonstrate a number of useful functions to create and inspect variables.

Examples

  1. Creating a new variable.

    This example uses a simple mathematical formula to create a new variable based on the value of two other variables.

    forbes <-
      forbes %>%
      mutate(
        pe = market_value / profits
        )
    
    forbes %>%
      pull(pe) %>%
      summary()
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    -377.00   11.67   18.50     Inf   28.66     Inf       5 

    The pull() function returns a vector from a data frame. The base R summary() function calculates the five number summary of a numerical vector.

  2. Conditionally changing the values of a variable.

    This example demonstrates two functions that can be used to change the values of variable based on the condition of the variable (or possibly another variable.)

    The following code checks to see if profits is a (strictly) positive number. It it is it calculates the price to earnings ratio. Otherwise it set the price to earning ratio to zero.

    The if_else() function takes three parameters. The first is the condition. When the row of the condition is TRUE, the true value will be used, the second parameter. Otherwise the false value will be used, the third parameter. The true and false parameters can be either a column or a scalar value.

    forbes <-
      forbes %>%
      mutate(
        pe = if_else(profits > 0, market_value / profits, 0)
        )
    
    forbes %>%
      pull(pe) %>%
      summary()
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
       0.00   11.53   18.28   26.08   28.42  758.00       5 

    The folowing code uses the recode() function to change values of 0 to NA. The recode() function does a test of equality to the parameter name. If the valus of the variable equals the parameter name value, this value is replace with the parameter value.

    Numbers are not valid variable and parameter names. To get R to accept the value 0 as a parameter name it is enclosed in backticks, `.

    The tidyverse has more type checking than base R. We need to use the NA_real_ value to match the type of the pe variable.

    forbes <-
      forbes %>%
      mutate(
        pe = recode(pe, `0` = NA_real_)
        )
    
    forbes %>%
      pull(pe) %>%
      summary()
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
      0.1101  14.6706  20.6875  30.5115  30.9231 758.0000      295 
  3. Creating indicator variables.

    The %in% operator is used to determine if the set of values on the left is in the set of values on the right. It returns a boolean, TRUE or FALSE.

    In this example the %in% operator is used to determine if each of the countries is one of the NAFTA countries.

    forbes <-
      forbes %>%
      mutate(
        nafta = country %in% c("United States", "Canada", "Mexico")
      )
    
    forbes %>%
      pull(nafta) %>%
      summary()
       Mode   FALSE    TRUE 
    logical    1176     824 

    The base R summary() function counts the number of TRUE and FALSE values in the variable.

  4. Creating a factor variable.

    Factor variables are used to represent categorical variables, a variable that takes on a fixed set of values. Indictor variable are a special case of factor variable, having two values, TRUE and FALSE.

    The following code uses the base R factor() function to change the country variable from character to a factor variable.

    forbes <-
      forbes %>%
      mutate(
        country = factor(country)
      )
    
    forbes %>%
      pull(country) %>%
      summary()
                          Africa                    Australia    Australia/ United Kingdom 
                               2                           37                            2 
                         Austria                      Bahamas                      Belgium 
                               8                            1                            9 
                         Bermuda                       Brazil                       Canada 
                              20                           15                           56 
                  Cayman Islands                        Chile                        China 
                               5                            4                           25 
                  Czech Republic                      Denmark                      Finland 
                               2                           10                           11 
                          France       France/ United Kingdom                      Germany 
                              63                            1                           65 
                          Greece              Hong Kong/China                      Hungary 
                              12                           20                            2 
                           India                    Indonesia                      Ireland 
                              27                            7                            8 
                         Islands                       Israel                        Italy 
                               1                            8                           41 
                           Japan                       Jordan                   Kong/China 
                             316                            1                            4 
                           Korea                      Liberia                   Luxembourg 
                               4                            1                            2 
                        Malaysia                       Mexico                  Netherlands 
                              16                           17                           28 
     Netherlands/ United Kingdom                  New Zealand                       Norway 
                               2                            1                            8 
                        Pakistan       Panama/ United Kingdom                         Peru 
                               1                            1                            1 
                     Philippines                       Poland                     Portugal 
                               2                            1                            7 
                          Russia                    Singapore                 South Africa 
                              12                           16                           15 
                     South Korea                        Spain                       Sweden 
                              45                           29                           26 
                     Switzerland                       Taiwan                     Thailand 
                              34                           35                            9 
                          Turkey               United Kingdom    United Kingdom/ Australia 
                              12                          137                            1 
     United Kingdom/ Netherlands United Kingdom/ South Africa                United States 
                               1                            1                          751 
                       Venezuela 
                               1 

    The base R summary() function counts the number of occurances of each level for factor variables.

    The following code uses the base R cut() function to convert a numeric variable to factor variable. You define a set bins to sort the values into. This is done using the break parameter. These breaks form the upper and lower limits of each bin. The names given to each of these bins is set by the labels parameter.

    The following code uses cut() to divide profits into low, mid, high, and very high bins.

    forbes <-
      forbes %>%
      mutate(
        profit_lev = cut(profits, 
                       breaks = c(-Inf, .08, .44, 10, Inf),
                       labels = c("low", "mid", "high", "very high")
                       )
      )
    
    forbes %>%
      pull(profit_lev) %>%
      summary()
          low       mid      high very high      NA's 
          501       999       489         6         5 
  5. Combining categories

    Categories of string variables can be combined using recode() and if_else(). This can result in a fair amount of repitition in code when there are a number of values that need to be grouped. This example used case_when() to recode the categories of the category variable into four new categories. The conditions are checked in order. The TRUE condition serves as the else condition. It is what will be done if none of the prior conditions are TRUE.

    forbes <-
      forbes %>%
      mutate(
        industry = 
          case_when(
            category %in% c("Banking", "Insurance", "Diversified financials") ~
              "finance",
            category == "Oil & gas operations" ~
              "oil_gas",
            category %in% c("Technology hardware & equipment", "Semiconductors",
                            "Drugs & biotechnology", "Software & services") ~
              "tech",
            TRUE ~
              "other"
          )
         )
    
    forbes %>%
      select(name, category, industry) %>%
      head()
    # A tibble: 6 x 3
      name                category             industry
      <chr>               <chr>                <chr>   
    1 Citigroup           Banking              finance 
    2 General Electric    Conglomerates        other   
    3 American Intl Group Insurance            finance 
    4 ExxonMobil          Oil & gas operations oil_gas 
    5 BP                  Oil & gas operations oil_gas 
    6 Bank of America     Banking              finance 

3.7 Summarizing data

There are a couple of ways that summary statics are used when wrangling data. One of them is to generate tables of the summary statistics. Other is to use them to calculate new variables.

Examples

  1. Summary table.

    This example uses summarise() to create a table of summary statistics for the profits variable. The summarise() function returns a tibble with a column for each summary statistic it calculates.

    forbes_summary <-
      forbes %>%
      summarise(
        `profits-mean` = mean(profits, na.rm = TRUE),
        `profits-sd` = sd(profits, na.rm = TRUE),
        `profits-1q` = quantile(profits, prob = .25, na.rm = TRUE),
        `profits-3q` = quantile(profits, prob = .75, na.rm = TRUE)
      )
    
    forbes_summary
    # A tibble: 1 x 4
      `profits-mean` `profits-sd` `profits-1q` `profits-3q`
               <dbl>        <dbl>        <dbl>        <dbl>
    1          0.381         1.77         0.08         0.44
  2. Calculating with summary statistics

    This example calculate ths same mean and standard deviation of profits as the prior example. Rather than use summarise() to create a table with these values, this example calculates a z-score for profits with the summary statistics.

    forbes <-
      forbes %>%
      mutate(
        profits_std = (profits - mean(profits, na.rm = TRUE)) / sd(profits, na.rm = TRUE)
      )
    
    forbes %>%
      pull(profits_std) %>%
      summary()
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
    -14.84668  -0.17057  -0.10260   0.00000   0.03334  11.65642         5 
    outlier_bounds <-
      forbes_summary %>%
      mutate(
        iqr = `profits-3q` - `profits-1q`,
        lower_bounds = `profits-1q` - iqr,
        upper_bounds = `profits-3q` + iqr
        )
    
    forbes <-
      forbes %>%
      mutate(
        outlier =
          profits < pull(outlier_bounds, lower_bounds) |
          profits > pull(outlier_bounds, upper_bounds)
      )
    
    forbes %>%
      pull(outlier) %>%
      summary()
       Mode   FALSE    TRUE    NA's 
    logical    1578     417       5 
  3. Proportion of observations

    The outlier indicator variable (created in the prior example) can be used to determine the proportion of the companies that have pofit values that are outlier to the distribution of profits.

    forbes %>%
      summarise(
        outlier_proportion = mean(outlier, na.rm = TRUE)
      )
    # A tibble: 1 x 1
      outlier_proportion
                   <dbl>
    1              0.209

3.8 Selecting rows

Rows are typically selected based on some condition in the data, as apposed to by name as columns typically are. The filter() function takes a Boolean variable and removes rows (from all columns) that are FALSE, keeping only the rows with TRUE.

Examples

  1. Dropping observations (rows.)

    This example uses filter() to create a subset data frame containing the 1000 largest companies.

    forbes_1000 <-
      forbes %>%
      filter(
        rank >= 1000
        )
    
    forbes_1000 %>%
      head() %>%
      print(10)
    # A
    #   tibble:
    #   6 x
    #   14
      name    
      <chr>   
    1 Northea~
    2 Korea G~
    3 MOL     
    4 First A~
    5 Sumitom~
    6 Hibernia
    # ...
    #   with
    #   13
    #   more
    #   variables:
    #   market_value <dbl>,
    #   country <fct>,
    #   rank <dbl>,
    #   category <chr>,
    #   sales <dbl>,
    #   profits <dbl>,
    #   assets <dbl>,
    #   pe <dbl>,
    #   nafta <lgl>,
    #   profit_lev <fct>,
    #   industry <chr>,
    #   profits_std <dbl>,
    #   outlier <lgl>
  2. Conditional examination of the data.

    forbes %>%
      select(name, country, rank, market_value, nafta) %>%
      filter(nafta)
    # A tibble: 824 x 5
       name                country        rank market_value nafta
       <chr>               <fct>         <dbl>        <dbl> <lgl>
     1 Citigroup           United States     1        255.  TRUE 
     2 General Electric    United States     2        329.  TRUE 
     3 American Intl Group United States     3        195.  TRUE 
     4 ExxonMobil          United States     4        277.  TRUE 
     5 Bank of America     United States     6        118.  TRUE 
     6 Fannie Mae          United States     9         76.8 TRUE 
     7 Wal-Mart Stores     United States    10        244.  TRUE 
     8 Berkshire Hathaway  United States    14        141.  TRUE 
     9 JP Morgan Chase     United States    15         81.9 TRUE 
    10 IBM                 United States    16        172.  TRUE 
    # ... with 814 more rows
  3. Conditional proportion

    The filter() function can be used to calculate a proportion conditional some state of the data.

    Here we will recalculate the proportion of outlier profits conditional on being based in a NAFTA country.

    forbes %>%
      filter(
        nafta
        ) %>%
      summarise(
        outlier_proportion = mean(outlier, na.rm = TRUE)
      )
    # A tibble: 1 x 1
      outlier_proportion
                   <dbl>
    1              0.229

3.9 Working with groups

The group_by() function conceptually divides a data frame into a set of data frames. Function applied to a grouped data frame will operate separately on each group of the data. The ungroup() function return a data from from a grouped data frame. The group_by() and ungroup() functions are typically used in pairs. That is, when the work that relies on grouping is complete there should be an ungroup(). An exception to this is when a summarizing operation is done on the data frame, there typically is no need to ungroup().

Examples

  1. Calculate proportions for each group

    Here we will recalculate the proportion of outlier profits conditional on each level of the nafta variable.

    forbes %>%
      group_by(
        nafta
        ) %>%
      summarise(
        outlier_proportion = mean(outlier, na.rm = TRUE)
      )
    # A tibble: 2 x 2
      nafta outlier_proportion
      <lgl>              <dbl>
    1 FALSE              0.195
    2 TRUE               0.229
  2. Summarising with multiple grouping variables

    nafta_industries <-
      forbes %>%
      group_by(nafta, industry) %>%
      summarise(mean = mean(market_value))
    `summarise()` has grouped output by 'nafta'. You can override using the `.groups` argument.
    nafta_industries
    # A tibble: 8 x 3
    # Groups:   nafta [2]
      nafta industry  mean
      <lgl> <chr>    <dbl>
    1 FALSE finance   8.23
    2 FALSE oil_gas  23.4 
    3 FALSE other     8.80
    4 FALSE tech     19.8 
    5 TRUE  finance  14.1 
    6 TRUE  oil_gas  17.1 
    7 TRUE  other    11.7 
    8 TRUE  tech     30.9 

    The returned data frame in this example has two columns that identify the sub-group that has been summarized.

  3. Find largest values within groups

    Here we the largest profits in each industry.

    This example makes use of the rank() function. This function returns an order number for each observation based on the assending value of the variable being ranked.

    This example also makes use of the arrange() function to sort the data frame.

    forbes <-
      forbes %>%
      group_by(
        industry
        ) %>%
      mutate(
        profit_rank = rank(desc(profits))
      ) %>%
      ungroup
    
    forbes %>%
      filter(
        profit_rank <= 5
        ) %>%
      arrange(industry, desc(profits)) %>%
      select(name, country, industry, profits, market_value)
    # A tibble: 20 x 5
       name                    country                     industry profits market_value
       <chr>                   <fct>                       <chr>      <dbl>        <dbl>
     1 Citigroup               United States               finance    17.8         255. 
     2 Bank of America         United States               finance    10.8         118. 
     3 Freddie Mac             United States               finance    10.1          44.2
     4 Berkshire Hathaway      United States               finance     6.95        141. 
     5 HSBC Group              United Kingdom              finance     6.66        178. 
     6 ExxonMobil              United States               oil_gas    21.0         277. 
     7 BP                      United Kingdom              oil_gas    10.3         174. 
     8 Total                   France                      oil_gas     8.84        117. 
     9 Royal Dutch/Shell Group Netherlands/ United Kingdom oil_gas     8.4         163. 
    10 ChevronTexaco           United States               oil_gas     7.43         92.5
    11 General Electric        United States               other      15.6         329. 
    12 Altria Group            United States               other       9.2         111. 
    13 Wal-Mart Stores         United States               other       9.05        244. 
    14 Toyota Motor            Japan                       other       7.99        115. 
    15 SBC Communications      United States               other       5.97         82.9
    16 Microsoft               United States               tech        8.88        287. 
    17 IBM                     United States               tech        7.58        172. 
    18 Merck & Co              United States               tech        7.33        109. 
    19 Johnson & Johnson       United States               tech        6.74        161. 
    20 GlaxoSmithKline         United Kingdom              tech        6.34        125. 

    The above code creates a rank variable. Then created the display based on this rank variable. If there is no need for a rank variable, the following code could be used. This code uses the top_n() function to filter on the largest values of a variable.

    forbes %>%
      group_by(
        industry
        ) %>%
      top_n(
        n = 5,
        wt = profits
      ) %>%
      arrange(industry, desc(profits)) %>%
      select(name, country, industry, profits, market_value)
    # A tibble: 20 x 5
    # Groups:   industry [4]
       name                    country                     industry profits market_value
       <chr>                   <fct>                       <chr>      <dbl>        <dbl>
     1 Citigroup               United States               finance    17.8         255. 
     2 Bank of America         United States               finance    10.8         118. 
     3 Freddie Mac             United States               finance    10.1          44.2
     4 Berkshire Hathaway      United States               finance     6.95        141. 
     5 HSBC Group              United Kingdom              finance     6.66        178. 
     6 ExxonMobil              United States               oil_gas    21.0         277. 
     7 BP                      United Kingdom              oil_gas    10.3         174. 
     8 Total                   France                      oil_gas     8.84        117. 
     9 Royal Dutch/Shell Group Netherlands/ United Kingdom oil_gas     8.4         163. 
    10 ChevronTexaco           United States               oil_gas     7.43         92.5
    11 General Electric        United States               other      15.6         329. 
    12 Altria Group            United States               other       9.2         111. 
    13 Wal-Mart Stores         United States               other       9.05        244. 
    14 Toyota Motor            Japan                       other       7.99        115. 
    15 SBC Communications      United States               other       5.97         82.9
    16 Microsoft               United States               tech        8.88        287. 
    17 IBM                     United States               tech        7.58        172. 
    18 Merck & Co              United States               tech        7.33        109. 
    19 Johnson & Johnson       United States               tech        6.74        161. 
    20 GlaxoSmithKline         United Kingdom              tech        6.34        125. 

3.10 Reshaping

Data frames may not be shaped as we need them. For example, a data frame could contain one or more variables for each year of a multi-year study. This would result in the information about year being in the variable name. This is called wide form since the repeating of the data is done with multiple columns. Alternatively the information about the year could be in a variable and the data from each of the years would be in a single set of variables. This is called long form since the repeating for the years is done through multiple rows.

Reshaping the data from wide to long or long to wide can be visualized In the following graphic. To go from long to wide, a variable is identified as containing the information what the name of the repeated variable is, here that variable is key. Each unique value of the key variable will result in a variable of that name in wide form. Another variable is identified as containing the values that will be used to construct the new variables, here that variable is val. The key variable has 3 unique values. This results in key and val variables being transformed into three new variables, A, B, and C.

Conceptual representation of reshaping a data frame

Figure 3.2: Conceptual representation of reshaping a data frame

To go from wide to long form the set of wide variables needs to be identified, here that would be, A, B, and C. Two new variables will be created from these wide variables. One of these two new variables will contain the names of the of the wide variables, here this is the key variable. The other new variable will contain the values from the wide variables, here this is the val variable. The values of the key variable tell us which wide variable the val values are from. For example, in the first data row of the long form the A in key tells us that 9 is from the A variable.

Examples

  1. Reshaping to wide

    nafta_industries_tbl <-
      nafta_industries %>%
      spread(key = industry, value = mean)
    
    nafta_industries_tbl
    # A tibble: 2 x 5
    # Groups:   nafta [2]
      nafta finance oil_gas other  tech
      <lgl>   <dbl>   <dbl> <dbl> <dbl>
    1 FALSE    8.23    23.4  8.80  19.8
    2 TRUE    14.1     17.1 11.7   30.9
  2. Reshaping to long

      nafta_industries_tbl %>%
      gather(
        key = industry,
        value = mean,
        finance, oil_gas, other,  tech
        )
    # A tibble: 8 x 3
    # Groups:   nafta [2]
      nafta industry  mean
      <lgl> <chr>    <dbl>
    1 FALSE finance   8.23
    2 TRUE  finance  14.1 
    3 FALSE oil_gas  23.4 
    4 TRUE  oil_gas  17.1 
    5 FALSE other     8.80
    6 TRUE  other    11.7 
    7 FALSE tech     19.8 
    8 TRUE  tech     30.9 

3.11 Getting help

Help is available in a variety of places. What follows is some approaches to looking for help.

You may need help in knowing how to accomplish something. That is, you do not know what functions/methods to use or maybe the steps needed. A good place to start here is the cheat sheets for the tidyverse. These can be scanned quickly to see what is provided by the package. What you need to do may already be directly implemented. You may also notice functions/methods that do part of what is needed and leave you something smaller that you do not know how to do. If you do not find what you need on the cheat sheets, goggling is your best option. We suggest starting the goggle search with tidyverse and follow this with what you are trying to do. This may not get you the help you need if there is a technical name for something that you do not know. In this case you may have to read through several of the initial responses to see how others talk about doing what you are doing. This is a good way to learn more of the lingo of programming and wrangling. Additional goggle searches can be done based on the use of new key words you see in the initial responses. You can also ask a question on help sites such as stackexchange or stackoverflow. You will likely have seen these kind of sites in response to your initial queries.

When you know how you want to accomplish something but do not remember the function/method to use, the cheat sheets would be a good place to start. If you do not find what you need on the cheat sheet, reviewing the table of contents of this book my help you find what you need. You can also goggle tidyverse followed by what you want to do.

When you know the name of the function/method you want to use and need help with what the parameters are or details of how the function/method works, the function documentation is a good place to start. This documentation can be found by goggling tidyverse followed by the name of the function or method you want help with.

If you are an SSCC member, you can also send a question to the help desk or come in and see one of the consultants for help if you are not able to resolve the issue for yourself. Information on getting help from the SSCC can be found at the SSCC website.

Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?