5.6 Related observations

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

5.6.1 Data Concepts

5.6.1.1 Relationships between observations

Observation level transformations use only information that is available in the row to create the new value. That is, no knowledge of other observations is needed. Most of the prior transformations discussed in this chapter have been observation level transformations.

It is common for data sets to have observations that are related to each other. For example, subjects with the same gender or ethnicity may have some commonality. These relationships (when known) are recorded as categorical variables. In sections 3.3 and 3.4, we used categorical variables to visually explore similarities within and differences between groups of subjects identified by categorical variables. In this section we will be looking at relationships of observations within groups.

When there is more than one categorical variable, also called a grouping variable, relationships between groups can exist. For example, in a study of primary school children, the data set might contain classroom, school, and ethnicity. A student belongs to only one classroom, school, and ethnicity. All the students of one classroom also attend the same school. If there is more than one classroom from the school in the data set, then the classroom groups from that school are nested within the school. If the study has only one classroom from a school, then that classroom level and school level identify redundant groups. Redundantly identified groups can not distinguish between the class level or school level of the redundant group. Ethnicity levels can be expected to exists in multiple classrooms. That is ethnicity at level 1 may be present in classroom 1 and 2. These are called crossed variables.

In the made-up data frame below, Ethnicity is crossed with the classroom 1 of school 1 and classroom 3 of school 2 groups. All observations of classroom 2, school 1 have the same ethnicity. These are not redundant groups since belonging to classroom 2 of school 1 does not uniquely identify ethnicity level 1 and likewise ethnicity level 1 does not uniquely identify classroom 2 of school 1. Classroom 1 and 2 of school 1 are nested within the 1 level of school. Classroom level 3 is redundant with level 2 of school. There are no other classrooms for school level 2.

    student  school classroom ethinicity
          1       1         1          1    
          2       1         1          3    
          3       1         1          1    
          4       1         2          1    
          5       1         2          1    
          6       1         2          1    
          7       2         3          2    
          8       2         3          2    
          9       2         3          1    
         10       2         3          2

When a data frame contains nested levels, it is said to hierarchical. Level identifiers are sometimes reused in hierarchical data. For example, classroom 3 in school level 2 could be identified as classroom 1, that is, the first classroom at school 2. Reused level identifiers often need to be disambiguated before analysis (and exploration of the grouping variable.)

5.6.1.2 Ordering observations

Sorting observations is an ordering operation. Rank operations are used to access positions within a sort of observations. Rank is used for things like finding the first, last, or top ten.

5.6.2 Examples R

These examples use the Forbes2000.csv data set.

We begin by loading the tidyverse, importing the csv file, and naming variables.

library(tidyverse)

forbes_path <- file.path("..", "datasets", "Forbes2000.csv")
forbes_in <- read_csv(forbes_path, col_types = cols())

Warning: Missing column names filled in: 'X1' [1]

forbes_in <-
  forbes_in %>%
  rename(market_value = marketvalue)

forbes <- 
  forbes_in %>%
  select(-X1)

glimpse(forbes)

Observations: 2,000
Variables: 8
$ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$ name         <chr> "Citigroup", "General Electric", "American Intl G...
$ country      <chr> "United States", "United States", "United States"...
$ category     <chr> "Banking", "Conglomerates", "Insurance", "Oil & g...
$ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
$ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
$ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
$ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...

Find the company with the largest market value in each country.

The group_by() and ungroup() functions are used in pairs. That is, when the work that relies on grouping is complete there should be an ungroup().

This example also makes use of the rank() function. This function returns an order number for each observation based on the assending value of the variable being ranked.

largest_market_value <-
  forbes %>%
  group_by(country) %>%
  mutate(
    size_rank = rank(desc(market_value))
  ) %>%
  filter(size_rank == 1) %>%
  ungroup()

largest_market_value %>%
  select(name, country, market_value, sales, profits) %>%
  head()

# A tibble: 6 x 5
  name                  country                  market_value sales profits
  <chr>                 <chr>                           <dbl> <dbl>   <dbl>
1 General Electric      United States                   329.  134.    15.6 
2 HSBC Group            United Kingdom                  178.   44.3    6.66
3 Toyota Motor          Japan                           115.  136.     7.99
4 ING Group             Netherlands                      54.6  94.7    4.73
5 Royal Dutch/Shell Gr~ Netherlands/ United Kin~        163.  134.     8.4 
6 Total                 France                          117.  132.     8.84

Note: since we want the largest market value, the smallest rank is selected.

In the above display, you can see that the Netherlands is listed twice. This is due to some companies being listed with two countries separated by a slash. How this is corrected would depend on the nature of what is being studied. Two possible ways this could be corrected are to consider only the primary country (first listed) or consider both contries.

For this example we will assume that only the primary country is of importance in finding the largest market value.

largest_market_value <-
  forbes %>%
  separate(country, "country_mv", sep = "/", remove = FALSE, extra = "drop") %>%
  group_by(country_mv) %>%
  mutate(
    size_rank = rank(market_value)
  ) %>%
  filter(size_rank == max(size_rank)) %>%
  ungroup()

largest_market_value %>%
  select(name, country, market_value, sales, profits) %>%
  head()

# A tibble: 6 x 5
  name                  country                  market_value sales profits
  <chr>                 <chr>                           <dbl> <dbl>   <dbl>
1 General Electric      United States                   329.  134.    15.6 
2 HSBC Group            United Kingdom                  178.   44.3    6.66
3 Toyota Motor          Japan                           115.  136.     7.99
4 Royal Dutch/Shell Gr~ Netherlands/ United Kin~        163.  134.     8.4 
5 Total                 France                          117.  132.     8.84
6 ENI                   Italy                            76.1  53.3    4.82

Note: the country variable is displayed so that a reader would know that the company has a large presence in another country.

Find the countries with at least three companies. Display the three companies with the largest market values for each country.

The row_number() function is a variant of rank(). It will number tied values with separate numbers. For example, if two values are tied for fifth place, one will be five and the other will be six.

Since we are interested in the order of market value starting from the largest, the desc() function is used to reverse the order.

largest_three <-
  forbes %>%
  separate(country, "country_mv", sep = "/", remove = FALSE, extra = "drop") %>%
  group_by(country_mv) %>%
  mutate(
    num_companies = n(),
    size_rank = row_number(desc(market_value))
  ) %>%
  filter(num_companies >= 3) %>%
  ungroup() %>%
  arrange(country_mv, size_rank)

largest_three %>%
  filter(size_rank <= 3) %>%
  arrange(country_mv, size_rank) %>%
  select(name, country, market_value, size_rank) %>%
  head()

# A tibble: 6 x 4
  name            country                   market_value size_rank
  <chr>           <chr>                            <dbl>     <int>
1 BHP Billiton    Australia/ United Kingdom        57.4          1
2 News Corp       Australia                        55.4          2
3 Telstra         Australia                        47.2          3
4 Erste Bank      Austria                           8.12         1
5 Telekom Austria Austria                           7.58         2
6 OMV Group       Austria                           4.45         3

5.6.3 Examples - Python

These examples use the Forbes2000.csv data set.

We begin by loading the packages, importing the csv file, and naming the variables.

from pathlib import Path
import pandas as pd
import numpy as np

forbes_path = Path('..') / 'datasets' / 'Forbes2000.csv'
forbes_in = pd.read_csv(forbes_path)
forbes_in = (
    forbes_in
        .rename(columns={'marketvalue': 'market_value'}))

forbes =  (
    forbes_in
        .copy(deep=True)
        .drop(columns='Unnamed: 0')
        .assign(
            name=lambda df: df['name'].str[:20],
            country=lambda df: df['country'].str[:25]))

print(forbes.dtypes)

rank              int64
name             object
country          object
category         object
sales           float64
profits         float64
assets          float64
market_value    float64
dtype: object

There are some long string values in name and country names`. These have been truncated to make displaying them easier in these examples.

The GroupBy object.

The pandas groupby() method is used to group observations that have a common value in one or more variables. The returned object from groupby() is not a data frame. This can be seen in the following code.
```
country_group =  forbes.groupby('country')

print(type(country_group))
```
```
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
```
GroupBy objects are used for several types of operations. One of these is aggregation. This is seen in the following code, which finds the number of compainies in each country.
```
num_comp_country = country_group.size()

print(type(num_comp_country))
```
```
<class 'pandas.core.series.Series'>
```
The returned object is now a standard pandas series. It is not the same shape as the data frame it was constructed from. There is only one row for each value of country. This is seen in the following code.
```
print(num_comp_country.head())
```
```
country
Africa                        2
Australia                    37
Australia/ United Kingdom     2
Austria                       8
Bahamas                       1
dtype: int64
```
Another use of the GroupBy object is to filter. This is seen in the following code.
```
countries_head = country_group.head(2)

print(type(countries_head))
```
```
<class 'pandas.core.frame.DataFrame'>
```
```
print(countries_head.shape)
```
```
(108, 8)
```
A data frame object is returned after the head() method. This data frame has the same columns as the orignal data frame. There are fewer rows in this filtered data frame.
```
(countries_head
    .loc[:, ['name', 'country']]
    .head(6)
    .pipe(print))
```
```
                name         country
0          Citigroup   United States
1   General Electric   United States
4                 BP  United Kingdom
6         HSBC Group  United Kingdom
7       Toyota Motor           Japan
10               UBS     Switzerland
```
This head of the grouped data frame is useful for inpsecting the results of code that operates on groups.

Another use of the the GroupBy object is to construct new variables using the transform() method.
```
num_comp_country = country_group['market_value'].transform('size')

print(type(num_comp_country))
```
```
<class 'pandas.core.series.Series'>
```
```
print(num_comp_country.size)
```
```
2000
```
The new variable is of type series and the same length as the original data frame.

The use of groupby() paired with transform() is the focus of this section.

Find the company with the largest market value in each country.

This example makes use of the rank() method. This method returns an order number for each observation based on the value of the variable being ranked. There are several ways a rank variable can be calculated. The method parameter allows you to select which of these approaches to use. See the documentation for an explaination of the various methods.

The rank() method is used with groupby() in this example. The rank() method produces a series of the same length and observation alignment as the data frame. This is the same as the transform() method. Here we are using the rank() method on the market_value variable.

largest_market_value =  (
    forbes
        .assign(size_rank = lambda df: df
            .groupby('country')
            ['market_value']
            .rank(method='min', ascending=False))
        .query('size_rank == 1'))

(largest_market_value
    .loc[:, ['name', 'country', 'market_value']]
    .head(10)
    .pipe(print))

                    name                    country  market_value
1       General Electric              United States        328.54
6             HSBC Group             United Kingdom        177.96
7           Toyota Motor                      Japan        115.40
11             ING Group                Netherlands         54.59
12  Royal Dutch/Shell Gr  Netherlands/ United Kingd        163.45
16                 Total                     France        116.64
36                   ENI                      Italy         76.13
44   Samsung Electronics                South Korea         72.72
54            PetroChina                      China         90.49
63        Novartis Group                Switzerland        116.43

Note: since we want the largest market value, the smallest rank is selected.

For this example we will assume that only the primary country is of importance in finding the largest market value.

largest_market_value =  (
    forbes
        .assign(country_mv = lambda df:
            df['country'].str.extract('([^/]+)', expand=True))
        .assign(size_rank = lambda df: df
            .groupby('country_mv')
            ['market_value']
            .rank(method='min', ascending=False))
        .query('size_rank == 1'))

(largest_market_value
    .loc[:, ['name', 'country', 'market_value']]
    .head(10)
    .pipe(print))

                    name                    country  market_value
1       General Electric              United States        328.54
6             HSBC Group             United Kingdom        177.96
7           Toyota Motor                      Japan        115.40
12  Royal Dutch/Shell Gr  Netherlands/ United Kingd        163.45
16                 Total                     France        116.64
36                   ENI                      Italy         76.13
44   Samsung Electronics                South Korea         72.72
54            PetroChina                      China         90.49
63        Novartis Group                Switzerland        116.43
75    Tyco International                    Bermuda         58.41

Note, the country variable is displayed so that a reader would know that the company has a large presence in another country.

Find the countries with at least three companies. Display the three companies with the largest market values for each country.

The transform() and size() methods are used to create a variable with number of companies in each country. Then a query() is done to save only the countries with three or more companies. This could also have been done using the filter() method.

The groupby() and head() method are use to save only the three largest companies in each country.

largest_three =  (
    forbes
        .assign(country_mv = lambda df:
            df['country'].str.extract('([^/]+)', expand=True))
        .assign(num_companies = lambda df: df
            .groupby('country_mv')
            ['market_value']
            .transform('size'))
        .query('num_companies >= 3')
        .assign(size_rank = lambda df: df
            .groupby('country_mv')
            ['market_value']
            .rank(method='min', ascending=False))
        .sort_values(by=['country_mv', 'size_rank'])
        .groupby('country_mv')
        .head(3))

(largest_three
    .loc[:, ['name', 'country', 'size_rank']]
    .head(10)
    .pipe(print))

                    name                    country  size_rank
121         BHP Billiton  Australia/ United Kingdom        1.0
109            News Corp                  Australia        2.0
163              Telstra                  Australia        3.0
362           Erste Bank                    Austria        1.0
1104     Telekom Austria                    Austria        2.0
719            OMV Group                    Austria        3.0
98                 Dexia                    Belgium        1.0
417            Interbrew                    Belgium        2.0
157              Almanij                    Belgium        3.0
75    Tyco International                    Bermuda        1.0

5.6.4 Exercises

These exercises use the mtcars.csv data set.

Import the mtcars.csv data set.
Find the most efficient car (mpg) for each number of cylinders.
The weight of a car is a major contributor to how efficient it is. Create a variable that measures mpg per unit of weight. Plot this new variable against hp and then disp, These plots should consider the relationship with the number of cylinders. From these plots, does hp or disp seem to be more related to the new variable when considering the number of cylinders?
Find the least efficient car (using the new variable that considers both mpg and weight) for each number of cylinders and gear combination. Exclude any combination that does not have at least two observations.