SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.7 Relationships between columns

These exercises use the Chile.csv data set.

  1. Import the Chile.csv file.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    chile_path = Path('..') / 'datasets' / 'Chile.csv'
    chile_in = pd.read_csv(chile_path)
    chile_in = chile_in.rename(columns={'statusquo': 'status_quo'})
    chile =  (
        chile_in
            .copy(deep=True)
            .drop('Unnamed: 0', axis='columns'))
    
    print(chile.dtypes)
    region         object
    population      int64
    sex            object
    age           float64
    education      object
    income        float64
    status_quo    float64
    vote           object
    dtype: object
  2. Find all rows with a missing value in any column using a related columns method.

    chile_na_rows = (
        chile
            .assign(missing=lambda df: df
                .isna()
                .any(axis='columns')
                >= 1)
            .query('missing')
            .drop('missing', axis='columns'))
    
    print(chile_na_rows.head())
       region  population sex   age education   income  status_quo vote
    12      N      175000   F  27.0        PS      NaN     1.43448    Y
    14      N      175000   M  36.0        PS  35000.0     1.49026  NaN
    27      N      175000   F  43.0         P      NaN     0.15489    A
    75      N      125000   F  32.0         S      NaN    -0.85035    N
    97      N      125000   F  34.0         P   2500.0     0.10807  NaN