5.6 Related observations

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

These exercises use the mtcars.csv data set.

Import the mtcars.csv data set.

from pathlib import Path
import pandas as pd
import numpy as np

mtcars_path = Path('..') / 'datasets' / 'mtcars.csv'
mtcars_in = pd.read_csv(mtcars_path)
mtcars_in = mtcars_in.rename(columns={'Unnamed: 0': 'make_model'})
mtcars =  mtcars_in.copy(deep=True)

print(mtcars.dtypes)

make_model     object
mpg           float64
cyl             int64
disp          float64
hp              int64
drat          float64
wt            float64
qsec          float64
vs              int64
am              int64
gear            int64
carb            int64
dtype: object

Find the most efficient car (mpg) for each number of cylinders.

best_mpg_cyl = (
    mtcars
    .assign(mpg_rank = lambda df: df
        .groupby('cyl')
        ['mpg']
        .rank(method='min', ascending=False)))

(best_mpg_cyl
    .query('mpg_rank == 1')
    .loc[:, ['make_model', 'cyl', 'mpg', 'disp']]
    .head()
    .pipe(print))

          make_model  cyl   mpg   disp
3     Hornet 4 Drive    6  21.4  258.0
19    Toyota Corolla    4  33.9   71.1
24  Pontiac Firebird    8  19.2  400.0

The weight of a car is a major contributor to how efficient it is. Create a variable that measures mpg per unit of weight. Plot this new variable against hp and then disp, These plots should consider the relationship with the number of cylinders. From these plots, does hp or disp seem to be more related to the new variable when considering the number of cylinders?

mtcars = (
    mtcars
    .assign(mpg_per_wt = lambda df:
        df['mpg'] / df['wt']))

(mtcars
    .loc[:, ['make_model', 'wt', 'mpg', 'mpg_per_wt']]
    .head()
    .pipe(print))

          make_model     wt   mpg  mpg_per_wt
0          Mazda RX4  2.620  21.0    8.015267
1      Mazda RX4 Wag  2.875  21.0    7.304348
2         Datsun 710  2.320  22.8    9.827586
3     Hornet 4 Drive  3.215  21.4    6.656299
4  Hornet Sportabout  3.440  18.7    5.436047

print(
    p9.ggplot(mtcars, p9.aes(x='disp', y='mpg_per_wt')) + 
    p9.geom_point() +
    p9.facet_wrap('~cyl') +
    p9.theme_bw())

<ggplot: (-9223371893261961419)>

print(
    p9.ggplot(mtcars, p9.aes(x='hp', y='mpg_per_wt')) + 
    p9.geom_point() +
    p9.facet_wrap('~cyl') +
    p9.theme_bw())

<ggplot: (-9223371893262604358)>

Both hp and disp seem to be related to mpg_per_wt. The disp variable seems to have a stronger relationship with mpg_per_wt.

Find the least efficient car (using the new variable that considers both mpg and weight) for each number of cylinders and gear combination. Exclude any combination that does not have at least two observations.

eff_cyl_gear = (
    mtcars
    .assign(
        num_group_obs = lambda df: df
            .groupby(['cyl', 'gear'])
            ['mpg_per_wt']
            .transform('size'),
        efficiency_rank = lambda df: df
            .groupby(['cyl', 'gear'])
            ['mpg_per_wt']
           .rank(method='min', ascending=True))
    .query('num_group_obs >= 2 & efficiency_rank == 1'))

(eff_cyl_gear
    .loc[:, ['make_model', 'cyl', 'gear', 'mpg_per_wt', 'mpg', 'disp']]
    .sort_values(['cyl', 'gear', 'mpg_per_wt'])
    .head(5)
    .pipe(print))

             make_model  cyl  gear  mpg_per_wt   mpg   disp
8              Merc 230    4     4    7.238095  22.8  140.8
26        Porsche 914-2    4     5   12.149533  26.0  120.3
5               Valiant    6     3    5.231214  18.1  225.0
10            Merc 280C    6     4    5.174419  17.8  167.6
15  Lincoln Continental    8     3    1.917404  10.4  460.0