5.6 Related observations
5.6.1 Data Concepts
5.6.1.1 Relationships between observations
Observation level transformations use only information that is available in the row to create the new value. That is, no knowledge of other observations is needed. Most of the prior transformations discussed in this chapter have been observation level transformations.
It is common for data sets to have observations that are related to each other. For example, subjects with the same gender or ethnicity may have some commonality. These relationships (when known) are recorded as categorical variables. In sections 3.3 and 3.4, we used categorical variables to visually explore similarities within and differences between groups of subjects identified by categorical variables. In this section we will be looking at relationships of observations within groups.
When there is more than one categorical variable, also called a grouping variable, relationships between groups can exist. For example, in a study of primary school children, the data set might contain classroom, school, and ethnicity. A student belongs to only one classroom, school, and ethnicity. All the students of one classroom also attend the same school. If there is more than one classroom from the school in the data set, then the classroom groups from that school are nested within the school. If the study has only one classroom from a school, then that classroom level and school level identify redundant groups. Redundantly identified groups can not distinguish between the class level or school level of the redundant group. Ethnicity levels can be expected to exists in multiple classrooms. That is ethnicity at level 1 may be present in classroom 1 and 2. These are called crossed variables.
In the made-up data frame below, Ethnicity is crossed with the classroom 1 of school 1 and classroom 3 of school 2 groups. All observations of classroom 2, school 1 have the same ethnicity. These are not redundant groups since belonging to classroom 2 of school 1 does not uniquely identify ethnicity level 1 and likewise ethnicity level 1 does not uniquely identify classroom 2 of school 1. Classroom 1 and 2 of school 1 are nested within the 1 level of school. Classroom level 3 is redundant with level 2 of school. There are no other classrooms for school level 2.
student school classroom ethinicity
1 1 1 1
2 1 1 3
3 1 1 1
4 1 2 1
5 1 2 1
6 1 2 1
7 2 3 2
8 2 3 2
9 2 3 1
10 2 3 2
When a data frame contains nested levels, it is said to hierarchical. Level identifiers are sometimes reused in hierarchical data. For example, classroom 3 in school level 2 could be identified as classroom 1, that is, the first classroom at school 2. Reused level identifiers often need to be disambiguated before analysis (and exploration of the grouping variable.)
5.6.1.2 Ordering observations
Sorting observations is an ordering operation. Rank operations are used to access positions within a sort of observations. Rank is used for things like finding the first, last, or top ten.
5.6.2 Examples R
These examples use the Forbes2000.csv
data set.
We begin by loading the tidyverse, importing the csv file, and naming variables.
library(tidyverse)
forbes_path <- file.path("..", "datasets", "Forbes2000.csv") forbes_in <- read_csv(forbes_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
forbes_in <- forbes_in %>% rename(market_value = marketvalue) forbes <- forbes_in %>% select(-X1) glimpse(forbes)
Observations: 2,000 Variables: 8 $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ name <chr> "Citigroup", "General Electric", "American Intl G... $ country <chr> "United States", "United States", "United States"... $ category <chr> "Banking", "Conglomerates", "Insurance", "Oil & g... $ sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3... $ profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7.... $ assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ... $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
Find the company with the largest market value in each country.
The
group_by()
andungroup()
functions are used in pairs. That is, when the work that relies on grouping is complete there should be anungroup()
.This example also makes use of the
rank()
function. This function returns an order number for each observation based on the assending value of the variable being ranked.largest_market_value <- forbes %>% group_by(country) %>% mutate( size_rank = rank(desc(market_value)) ) %>% filter(size_rank == 1) %>% ungroup() largest_market_value %>% select(name, country, market_value, sales, profits) %>% head()
# A tibble: 6 x 5 name country market_value sales profits <chr> <chr> <dbl> <dbl> <dbl> 1 General Electric United States 329. 134. 15.6 2 HSBC Group United Kingdom 178. 44.3 6.66 3 Toyota Motor Japan 115. 136. 7.99 4 ING Group Netherlands 54.6 94.7 4.73 5 Royal Dutch/Shell Gr~ Netherlands/ United Kin~ 163. 134. 8.4 6 Total France 117. 132. 8.84
Note: since we want the largest market value, the smallest rank is selected.
In the above display, you can see that the Netherlands is listed twice. This is due to some companies being listed with two countries separated by a slash. How this is corrected would depend on the nature of what is being studied. Two possible ways this could be corrected are to consider only the primary country (first listed) or consider both contries.
For this example we will assume that only the primary country is of importance in finding the largest market value.
largest_market_value <- forbes %>% separate(country, "country_mv", sep = "/", remove = FALSE, extra = "drop") %>% group_by(country_mv) %>% mutate( size_rank = rank(market_value) ) %>% filter(size_rank == max(size_rank)) %>% ungroup() largest_market_value %>% select(name, country, market_value, sales, profits) %>% head()
# A tibble: 6 x 5 name country market_value sales profits <chr> <chr> <dbl> <dbl> <dbl> 1 General Electric United States 329. 134. 15.6 2 HSBC Group United Kingdom 178. 44.3 6.66 3 Toyota Motor Japan 115. 136. 7.99 4 Royal Dutch/Shell Gr~ Netherlands/ United Kin~ 163. 134. 8.4 5 Total France 117. 132. 8.84 6 ENI Italy 76.1 53.3 4.82
Note: the
country
variable is displayed so that a reader would know that the company has a large presence in another country.Find the countries with at least three companies. Display the three companies with the largest market values for each country.
The
row_number()
function is a variant ofrank()
. It will number tied values with separate numbers. For example, if two values are tied for fifth place, one will be five and the other will be six.Since we are interested in the order of market value starting from the largest, the
desc()
function is used to reverse the order.largest_three <- forbes %>% separate(country, "country_mv", sep = "/", remove = FALSE, extra = "drop") %>% group_by(country_mv) %>% mutate( num_companies = n(), size_rank = row_number(desc(market_value)) ) %>% filter(num_companies >= 3) %>% ungroup() %>% arrange(country_mv, size_rank) largest_three %>% filter(size_rank <= 3) %>% arrange(country_mv, size_rank) %>% select(name, country, market_value, size_rank) %>% head()
# A tibble: 6 x 4 name country market_value size_rank <chr> <chr> <dbl> <int> 1 BHP Billiton Australia/ United Kingdom 57.4 1 2 News Corp Australia 55.4 2 3 Telstra Australia 47.2 3 4 Erste Bank Austria 8.12 1 5 Telekom Austria Austria 7.58 2 6 OMV Group Austria 4.45 3
5.6.3 Examples - Python
These examples use the Forbes2000.csv
data set.
We begin by loading the packages, importing the csv file, and naming the variables.
from pathlib import Path import pandas as pd import numpy as np
forbes_path = Path('..') / 'datasets' / 'Forbes2000.csv' forbes_in = pd.read_csv(forbes_path) forbes_in = ( forbes_in .rename(columns={'marketvalue': 'market_value'})) forbes = ( forbes_in .copy(deep=True) .drop(columns='Unnamed: 0') .assign( name=lambda df: df['name'].str[:20], country=lambda df: df['country'].str[:25])) print(forbes.dtypes)
rank int64 name object country object category object sales float64 profits float64 assets float64 market_value float64 dtype: object
There are some long string values in
name
andcountry
names`. These have been truncated to make displaying them easier in these examples.The GroupBy object.
The pandas
groupby()
method is used to group observations that have a common value in one or more variables. The returned object fromgroupby()
is not a data frame. This can be seen in the following code.country_group = forbes.groupby('country') print(type(country_group))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
GroupBy
objects are used for several types of operations. One of these is aggregation. This is seen in the following code, which finds the number of compainies in each country.num_comp_country = country_group.size() print(type(num_comp_country))
<class 'pandas.core.series.Series'>
The returned object is now a standard pandas series. It is not the same shape as the data frame it was constructed from. There is only one row for each value of
country
. This is seen in the following code.print(num_comp_country.head())
country Africa 2 Australia 37 Australia/ United Kingdom 2 Austria 8 Bahamas 1 dtype: int64
Another use of the
GroupBy
object is to filter. This is seen in the following code.countries_head = country_group.head(2) print(type(countries_head))
<class 'pandas.core.frame.DataFrame'>
print(countries_head.shape)
(108, 8)
A data frame object is returned after the
head()
method. This data frame has the same columns as the orignal data frame. There are fewer rows in this filtered data frame.(countries_head .loc[:, ['name', 'country']] .head(6) .pipe(print))
name country 0 Citigroup United States 1 General Electric United States 4 BP United Kingdom 6 HSBC Group United Kingdom 7 Toyota Motor Japan 10 UBS Switzerland
This head of the grouped data frame is useful for inpsecting the results of code that operates on groups.
Another use of the the
GroupBy
object is to construct new variables using thetransform()
method.num_comp_country = country_group['market_value'].transform('size') print(type(num_comp_country))
<class 'pandas.core.series.Series'>
print(num_comp_country.size)
2000
The new variable is of type
series
and the same length as the original data frame.The use of
groupby()
paired withtransform()
is the focus of this section.Find the company with the largest market value in each country.
This example makes use of the
rank()
method. This method returns an order number for each observation based on the value of the variable being ranked. There are several ways a rank variable can be calculated. Themethod
parameter allows you to select which of these approaches to use. See the documentation for an explaination of the various methods.The
rank()
method is used withgroupby()
in this example. Therank()
method produces a series of the same length and observation alignment as the data frame. This is the same as thetransform()
method. Here we are using therank()
method on themarket_value
variable.largest_market_value = ( forbes .assign(size_rank = lambda df: df .groupby('country') ['market_value'] .rank(method='min', ascending=False)) .query('size_rank == 1')) (largest_market_value .loc[:, ['name', 'country', 'market_value']] .head(10) .pipe(print))
name country market_value 1 General Electric United States 328.54 6 HSBC Group United Kingdom 177.96 7 Toyota Motor Japan 115.40 11 ING Group Netherlands 54.59 12 Royal Dutch/Shell Gr Netherlands/ United Kingd 163.45 16 Total France 116.64 36 ENI Italy 76.13 44 Samsung Electronics South Korea 72.72 54 PetroChina China 90.49 63 Novartis Group Switzerland 116.43
Note: since we want the largest market value, the smallest rank is selected.
In the above display, you can see that the Netherlands is listed twice. This is due to some companies being listed with two countries separated by a slash. How this is corrected would depend on the nature of what is being studied. Two possible ways this could be corrected are to consider only the primary country (first listed) or consider both contries.
For this example we will assume that only the primary country is of importance in finding the largest market value.
largest_market_value = ( forbes .assign(country_mv = lambda df: df['country'].str.extract('([^/]+)', expand=True)) .assign(size_rank = lambda df: df .groupby('country_mv') ['market_value'] .rank(method='min', ascending=False)) .query('size_rank == 1')) (largest_market_value .loc[:, ['name', 'country', 'market_value']] .head(10) .pipe(print))
name country market_value 1 General Electric United States 328.54 6 HSBC Group United Kingdom 177.96 7 Toyota Motor Japan 115.40 12 Royal Dutch/Shell Gr Netherlands/ United Kingd 163.45 16 Total France 116.64 36 ENI Italy 76.13 44 Samsung Electronics South Korea 72.72 54 PetroChina China 90.49 63 Novartis Group Switzerland 116.43 75 Tyco International Bermuda 58.41
Note, the
country
variable is displayed so that a reader would know that the company has a large presence in another country.Find the countries with at least three companies. Display the three companies with the largest market values for each country.
The
transform()
andsize()
methods are used to create a variable with number of companies in each country. Then aquery()
is done to save only the countries with three or more companies. This could also have been done using thefilter()
method.The
groupby()
andhead()
method are use to save only the three largest companies in each country.largest_three = ( forbes .assign(country_mv = lambda df: df['country'].str.extract('([^/]+)', expand=True)) .assign(num_companies = lambda df: df .groupby('country_mv') ['market_value'] .transform('size')) .query('num_companies >= 3') .assign(size_rank = lambda df: df .groupby('country_mv') ['market_value'] .rank(method='min', ascending=False)) .sort_values(by=['country_mv', 'size_rank']) .groupby('country_mv') .head(3)) (largest_three .loc[:, ['name', 'country', 'size_rank']] .head(10) .pipe(print))
name country size_rank 121 BHP Billiton Australia/ United Kingdom 1.0 109 News Corp Australia 2.0 163 Telstra Australia 3.0 362 Erste Bank Austria 1.0 1104 Telekom Austria Austria 2.0 719 OMV Group Austria 3.0 98 Dexia Belgium 1.0 417 Interbrew Belgium 2.0 157 Almanij Belgium 3.0 75 Tyco International Bermuda 1.0
5.6.4 Exercises
These exercises use the mtcars.csv
data set.
Import the
mtcars.csv
data set.Find the most efficient car (mpg) for each number of cylinders.
The weight of a car is a major contributor to how efficient it is. Create a variable that measures mpg per unit of weight. Plot this new variable against
hp
and thendisp
, These plots should consider the relationship with the number of cylinders. From these plots, doeshp
ordisp
seem to be more related to the new variable when considering the number of cylinders?Find the least efficient car (using the new variable that considers both mpg and weight) for each number of cylinders and gear combination. Exclude any combination that does not have at least two observations.