5.1 Preparatory exercises
The skills in these exercises are used in the exercises at the end of the discourses of this chapter. Take a moment and complete these to confirm that you are prepared for this chapter. If these exercises are difficult, review the prior chapters.
Import the auto.csv data set.
from pathlib import Path import pandas as pd import numpy as np import plotnine as p9
auto_path = Path('..') / 'datasets' / 'auto.csv' auto_in = pd.read_csv(auto_path) auto = auto_in.copy(deep=True) print(auto.dtypes)
Unnamed: 0 int64 mpg float64 cylinders int64 displacement float64 horsepower int64 weight int64 acceleration float64 year int64 origin int64 name object dtype: object
Is there any missing data in the
name
column?The data set description does not list any special missing identifiers.
The following code checks if any of the values of names were set to
NA
by theread_csv()
function.(auto .query('name != name') .pipe(print))
Empty DataFrame Columns: [Unnamed: 0, mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin, name] Index: []
There are no rows with
np.NaN
in thename
column.Are there any duplicated observations in the data set? Hint, look at the columns to determine what an observation is this data set.
I start with the assumption that
make
andyear
uniquely identify an observation.The following code identifies any duplicate
year
andmake
pairs in the auto data set.dup_rows = ( auto.duplicated(subset=['year', 'name'], keep=False)) (auto .loc[dup_rows, ['cylinders', 'horsepower', 'weight', 'year', 'name']] .pipe(print))
cylinders horsepower weight year name 166 4 83 2639 75 ford pinto 172 6 97 2984 75 ford pinto 334 4 84 2490 81 plymouth reliant 338 4 84 2385 81 plymouth reliant
There are two sets of matches. The first is for the
ford pinto
. There are two different engins for the pinto. The purpose of the study would determine if these are different observations. (An observation might be defined by a uniquemake
,year
, andcylinders
.)The second duplicates is for the
plymouth reliant
. Here the engine seems to be the same with both rows have an 84 hourspower 4 cylinder engine. There is a difference in weight. It is unclear if these are different observations. It could be that there is a different trim kit for the car that accounts for the difference in wieght. It could also be from an entery error. The purpose of the study may provide some direction for these duplicates.When possible, duplicates should be reviewed with the people responsible for creating the data set. This is not always possible.
Plot the horsepower, weight, and cylinders variables.
print( p9.ggplot(data=auto, mapping=p9.aes(x='weight', y='horsepower')) + p9.geom_point() + p9.facet_wrap('~ cylinders') + p9.theme_bw())
<ggplot: (-9223371893263964500)>
Does the plot from the prior problem show a relationship between horsepower and weight for all cylinder levels?
No. The six cylinder autos do not show a relationship between weight and horse power.