4.6 Subsets of a data frame

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

Import the PSID.csv data set that was imported in the prior section.

from pathlib import Path
import pandas as pd

psid_path = Path('..') / 'datasets' / 'PSID.csv'
psid_in = pd.read_csv(psid_path)
psid_in = (
    psid_in
        .rename( columns={
            'Unnamed: 0': 'obs_num',
            'intnum': 'intvw_num', 
            'persnum': 'person_id',
            'married': 'marital_status'}))
psid = psid_in.copy(deep=True)

print(psid.dtypes)

obs_num             int64
intvw_num           int64
person_id           int64
age                 int64
educatn           float64
earnings            int64
hours               int64
kids                int64
marital_status     object
dtype: object

The obs_num variable is retained for these exaple. The examples of this section operate on row numbers and this variable has the row numbers.

Display the last three rows of the data frame using positional values to subset.

(psid
    .iloc[-3:, :]
    .pipe(print))

      obs_num  intvw_num  person_id  age  ...  earnings  hours  kids  marital_status
4853     4854       9302          1   37  ...     22045   2793    98        divorced
4854     4855       9305          2   40  ...       134     30     3         married
4855     4856       9306          2   37  ...     33000   2423     4         married

[3 rows x 9 columns]

Displaying using the tail() function to confirm the correct three rows are displayed.

(psid
    .tail(3)
    .pipe(print))

      obs_num  intvw_num  person_id  age  ...  earnings  hours  kids  marital_status
4853     4854       9302          1   37  ...     22045   2793    98        divorced
4854     4855       9305          2   40  ...       134     30     3         married
4855     4856       9306          2   37  ...     33000   2423     4         married

[3 rows x 9 columns]

Display the first, third, fifth, and seventh rows of columns two and three.

(psid
    .iloc[[0, 2, 4, 6], :]
    .pipe(print))

   obs_num  intvw_num  person_id  age  ...  earnings  hours  kids  marital_status
0        1          4          4   39  ...     77250   2940     2         married
2        3          4          7   33  ...      8000    693     1         married
4        5          5          2   47  ...      6500   1683     5         married
6        7          6        172   38  ...      7000   1144     3         married

[4 rows x 9 columns]

Create a smaller data frame using the first 20 rows.
```
psid_small = psid.iloc[1:20, :]
```