SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.1 Preparatory exercises

The skills in these exercises are used in the exercises at the end of the discourses of this chapter. Take a moment and complete these to confirm that you are prepared for this chapter. If these exercises are difficult, review the prior chapters.

  1. Import the auto.csv data set.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    import plotnine as p9
    auto_path = Path('..') / 'datasets' / 'auto.csv'
    auto_in = pd.read_csv(auto_path)
    auto =  auto_in.copy(deep=True)
    
    print(auto.dtypes)
    Unnamed: 0        int64
    mpg             float64
    cylinders         int64
    displacement    float64
    horsepower        int64
    weight            int64
    acceleration    float64
    year              int64
    origin            int64
    name             object
    dtype: object
  2. Is there any missing data in the name column?

    The data set description does not list any special missing identifiers.

    The following code checks if any of the values of names were set to NA by the read_csv() function.

    (auto
        .query('name != name')
        .pipe(print))
    Empty DataFrame
    Columns: [Unnamed: 0, mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin, name]
    Index: []

    There are no rows with np.NaN in the name column.

  3. Are there any duplicated observations in the data set? Hint, look at the columns to determine what an observation is this data set.

    I start with the assumption that make and year uniquely identify an observation.

    The following code identifies any duplicate year and make pairs in the auto data set.

    dup_rows = (
        auto.duplicated(subset=['year', 'name'], keep=False))
    (auto
        .loc[dup_rows, 
             ['cylinders', 'horsepower', 'weight', 'year', 'name']]
        .pipe(print))
         cylinders  horsepower  weight  year              name
    166          4          83    2639    75        ford pinto
    172          6          97    2984    75        ford pinto
    334          4          84    2490    81  plymouth reliant
    338          4          84    2385    81  plymouth reliant

    There are two sets of matches. The first is for the ford pinto. There are two different engins for the pinto. The purpose of the study would determine if these are different observations. (An observation might be defined by a unique make, year, and cylinders.)

    The second duplicates is for the plymouth reliant. Here the engine seems to be the same with both rows have an 84 hourspower 4 cylinder engine. There is a difference in weight. It is unclear if these are different observations. It could be that there is a different trim kit for the car that accounts for the difference in wieght. It could also be from an entery error. The purpose of the study may provide some direction for these duplicates.

    When possible, duplicates should be reviewed with the people responsible for creating the data set. This is not always possible.

  4. Plot the horsepower, weight, and cylinders variables.

    print(
        p9.ggplot(data=auto, mapping=p9.aes(x='weight', y='horsepower')) + 
        p9.geom_point() +
        p9.facet_wrap('~ cylinders') +
        p9.theme_bw())
    <ggplot: (-9223371893263964500)>

  5. Does the plot from the prior problem show a relationship between horsepower and weight for all cylinder levels?

    No. The six cylinder autos do not show a relationship between weight and horse power.