Supporting Statistical Analysis for Research
4.6 Subsets of a data frame
Import the
PSID.csv
data set that was imported in the prior section.from pathlib import Path import pandas as pd
psid_path = Path('..') / 'datasets' / 'PSID.csv' psid_in = pd.read_csv(psid_path) psid_in = ( psid_in .rename( columns={ 'Unnamed: 0': 'obs_num', 'intnum': 'intvw_num', 'persnum': 'person_id', 'married': 'marital_status'})) psid = psid_in.copy(deep=True) print(psid.dtypes)
obs_num int64 intvw_num int64 person_id int64 age int64 educatn float64 earnings int64 hours int64 kids int64 marital_status object dtype: object
The obs_num variable is retained for these exaple. The examples of this section operate on row numbers and this variable has the row numbers.
Display the last three rows of the data frame using positional values to subset.
(psid .iloc[-3:, :] .pipe(print))
obs_num intvw_num person_id age ... earnings hours kids marital_status 4853 4854 9302 1 37 ... 22045 2793 98 divorced 4854 4855 9305 2 40 ... 134 30 3 married 4855 4856 9306 2 37 ... 33000 2423 4 married [3 rows x 9 columns]
Displaying using the
tail()
function to confirm the correct three rows are displayed.(psid .tail(3) .pipe(print))
obs_num intvw_num person_id age ... earnings hours kids marital_status 4853 4854 9302 1 37 ... 22045 2793 98 divorced 4854 4855 9305 2 40 ... 134 30 3 married 4855 4856 9306 2 37 ... 33000 2423 4 married [3 rows x 9 columns]
Display the first, third, fifth, and seventh rows of columns two and three.
(psid .iloc[[0, 2, 4, 6], :] .pipe(print))
obs_num intvw_num person_id age ... earnings hours kids marital_status 0 1 4 4 39 ... 77250 2940 2 married 2 3 4 7 33 ... 8000 693 1 married 4 5 5 2 47 ... 6500 1683 5 married 6 7 6 172 38 ... 7000 1144 3 married [4 rows x 9 columns]
Create a smaller data frame using the first 20 rows.
psid_small = psid.iloc[1:20, :]