3.7 Summarizing data
There are a couple of ways that summary statics are used when wrangling data. One of them is to generate tables of the summary statistics. Other is to use them to calculate new variables.
Examples
Summary table.
This example uses
summarise()
to create a table of summary statistics for theprofits
variable. Thesummarise()
function returns a tibble with a column for each summary statistic it calculates.forbes_summary <- forbes %>% summarise( `profits-mean` = mean(profits, na.rm = TRUE), `profits-sd` = sd(profits, na.rm = TRUE), `profits-1q` = quantile(profits, prob = .25, na.rm = TRUE), `profits-3q` = quantile(profits, prob = .75, na.rm = TRUE) ) forbes_summary
# A tibble: 1 x 4 `profits-mean` `profits-sd` `profits-1q` `profits-3q` <dbl> <dbl> <dbl> <dbl> 1 0.381 1.77 0.08 0.44
Calculating with summary statistics
This example calculate ths same mean and standard deviation of profits as the prior example. Rather than use
summarise()
to create a table with these values, this example calculates a z-score for profits with the summary statistics.forbes <- forbes %>% mutate( profits_std = (profits - mean(profits, na.rm = TRUE)) / sd(profits, na.rm = TRUE) ) forbes %>% pull(profits_std) %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -14.84668 -0.17057 -0.10260 0.00000 0.03334 11.65642 5
outlier_bounds <- forbes_summary %>% mutate( iqr = `profits-3q` - `profits-1q`, lower_bounds = `profits-1q` - iqr, upper_bounds = `profits-3q` + iqr ) forbes <- forbes %>% mutate( outlier = profits < pull(outlier_bounds, lower_bounds) | profits > pull(outlier_bounds, upper_bounds) ) forbes %>% pull(outlier) %>% summary()
Mode FALSE TRUE NA's logical 1578 417 5
Proportion of observations
The
outlier
indicator variable (created in the prior example) can be used to determine the proportion of the companies that have pofit values that are outlier to the distribution of profits.forbes %>% summarise( outlier_proportion = mean(outlier, na.rm = TRUE) )
# A tibble: 1 x 1 outlier_proportion <dbl> 1 0.209