3.7 Summarizing data

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

There are a couple of ways that summary statics are used when wrangling data. One of them is to generate tables of the summary statistics. Other is to use them to calculate new variables.

Examples

Summary table.

This example uses summarise() to create a table of summary statistics for the profits variable. The summarise() function returns a tibble with a column for each summary statistic it calculates.

forbes_summary <-
  forbes %>%
  summarise(
    `profits-mean` = mean(profits, na.rm = TRUE),
    `profits-sd` = sd(profits, na.rm = TRUE),
    `profits-1q` = quantile(profits, prob = .25, na.rm = TRUE),
    `profits-3q` = quantile(profits, prob = .75, na.rm = TRUE)
  )

forbes_summary

# A tibble: 1 x 4
  `profits-mean` `profits-sd` `profits-1q` `profits-3q`
           <dbl>        <dbl>        <dbl>        <dbl>
1          0.381         1.77         0.08         0.44

Calculating with summary statistics

This example calculate ths same mean and standard deviation of profits as the prior example. Rather than use summarise() to create a table with these values, this example calculates a z-score for profits with the summary statistics.

forbes <-
  forbes %>%
  mutate(
    profits_std = (profits - mean(profits, na.rm = TRUE)) / sd(profits, na.rm = TRUE)
  )

forbes %>%
  pull(profits_std) %>%
  summary()

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
-14.84668  -0.17057  -0.10260   0.00000   0.03334  11.65642         5

outlier_bounds <-
  forbes_summary %>%
  mutate(
    iqr = `profits-3q` - `profits-1q`,
    lower_bounds = `profits-1q` - iqr,
    upper_bounds = `profits-3q` + iqr
    )

forbes <-
  forbes %>%
  mutate(
    outlier =
      profits < pull(outlier_bounds, lower_bounds) |
      profits > pull(outlier_bounds, upper_bounds)
  )

forbes %>%
  pull(outlier) %>%
  summary()

   Mode   FALSE    TRUE    NA's 
logical    1578     417       5

Proportion of observations

The outlier indicator variable (created in the prior example) can be used to determine the proportion of the companies that have pofit values that are outlier to the distribution of profits.
```
forbes %>%
  summarise(
    outlier_proportion = mean(outlier, na.rm = TRUE)
  )
```
```
# A tibble: 1 x 1
  outlier_proportion
               <dbl>
1              0.209
```