SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.4 Factors and Indicators

These exercises use the mtcars.csv data set.

  1. Import the mtcars.csv data set.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    mtcars_path = Path('..') / 'datasets' / 'mtcars.csv'
    mtcars_in = pd.read_csv(mtcars_path)
    mtcars_in = mtcars_in.rename(columns={'Unnamed: 0': 'make_model'})
    mtcars =  mtcars_in.copy(deep=True)
    
    print(mtcars.dtypes)
    make_model     object
    mpg           float64
    cyl             int64
    disp          float64
    hp              int64
    drat          float64
    wt            float64
    qsec          float64
    vs              int64
    am              int64
    gear            int64
    carb            int64
    dtype: object
  2. Factor the cyl, gear and carb variables.

    mtcars = (
        mtcars
            .apply(
                func=lambda x: x.astype('category')
                if x.name in ['cyl', 'gear', 'carb'] else x))
    
    print(mtcars.dtypes)
    make_model      object
    mpg            float64
    cyl           category
    disp           float64
    hp               int64
    drat           float64
    wt             float64
    qsec           float64
    vs               int64
    am               int64
    gear          category
    carb          category
    dtype: object

    or

    mtcars =  mtcars_in.copy(deep=True)
    
    cyl_lev = pd.Series(mtcars['cyl'].unique()).sort_values()
    gear_lev = pd.Series(mtcars['gear'].unique()).sort_values()
    carb_lev = pd.Series(mtcars['carb'].unique()).sort_values()
    mtcars = (
        mtcars
            .assign(
                cyl = lambda df:
                    pd.Categorical(df['cyl'], categories=cyl_lev),
                gear = lambda df:
                    pd.Categorical(df['gear'], categories=gear_lev),
                carb = lambda df:
                    pd.Categorical(df['carb'], categories=carb_lev)))
    
    print(mtcars.dtypes)
    make_model      object
    mpg            float64
    cyl           category
    disp           float64
    hp               int64
    drat           float64
    wt             float64
    qsec           float64
    vs               int64
    am               int64
    gear          category
    carb          category
    dtype: object
  3. Create a variable that identifies the observations that are in the top 25 percent of miles per gallon. Display a few of these vehicles.

    Hint, you will need to find a function to identify the percentage points of a variable.

    Note, that the quantile function returns a series.

    mtcars = (
        mtcars
            .assign(
                efficient = lambda df:
                    np.where(
                        df['mpg'] >= df['mpg'].quantile([0.75]).at[0.75],
                        True,
                        False)))
    (mtcars
        .loc[:, ['make_model', 'mpg', 'efficient']]
        .head()
        .pipe(print))
              make_model   mpg  efficient
    0          Mazda RX4  21.0      False
    1      Mazda RX4 Wag  21.0      False
    2         Datsun 710  22.8       True
    3     Hornet 4 Drive  21.4      False
    4  Hornet Sportabout  18.7      False

    or

    mtcars = (
        mtcars
            .assign(
                efficient = lambda df:
                    np.where(
                        df['mpg'] >= df['mpg'].quantile([0.75]).iloc[0],
                        True,
                        False)))
    (mtcars
        .loc[:, ['make_model', 'mpg', 'efficient']]
        .head()
        .pipe(print))
              make_model   mpg  efficient
    0          Mazda RX4  21.0      False
    1      Mazda RX4 Wag  21.0      False
    2         Datsun 710  22.8       True
    3     Hornet 4 Drive  21.4      False
    4  Hornet Sportabout  18.7      False
  4. Create a variables that bins the values of hp using the following amounts of hp: 100, 170, 240, and 300.

    mtcars = (
        mtcars
            .assign(
                power = lambda df:
                    pd.cut(df['hp'],
                        bins=[-np.inf, 100, 170, 240, 300, np.inf],
                        labels=['gocart', 'slow', 'typical',
                                'fast', 'beast'])))
    
    (mtcars
        .loc[:, ['make_model', 'mpg', 'efficient', 'power']]
        .head()
        .pipe(print))
              make_model   mpg  efficient    power
    0          Mazda RX4  21.0      False     slow
    1      Mazda RX4 Wag  21.0      False     slow
    2         Datsun 710  22.8       True   gocart
    3     Hornet 4 Drive  21.4      False     slow
    4  Hornet Sportabout  18.7      False  typical