2.1 Data frames

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

2.1.1 Data concepts

2.1.1.1 Observations

An observation is a set of measures taken together on one subject/unit under study. To determine an observation, one needs to know what the unit or subject of the study is.

For example, consider a study designed to examining the effects of four different pesticides. The four pesticides will be used in fields that have been divided into four plots. The pesticides will be applied to all fields on the same day. The pesticides are randomly assigned to the plots of each field. There are fifteen fields, each of which will only have one of five different crops. The number of insects will be counted in three randomly chosen two square foot samples from each plot. These counts of insects will be done on all samples on the same day.

Here, the units of the study are plots of a field. The observation level of the study is the two square foot samples. There are multiple observations for each plot. This is called repeated measures. The data collected for an observation is field identifier, crop grown on field, pesticide applied to the plot, sample identifier, and the number of insects counted in sample.

2.1.1.2 Data frames

A data frame is a table with rows and columns. The rows of a data frame are observations. Each column of a data frame contains one of the collected measures of the study. These measures are commonly referred to as variables. The data is related both across rows and over columns. Both of these relationships will be used when cleaning and transforming the data.

2.1.2 Acquisition - Creating a data frame

Both R and Python provide a function to create a data frame. The first parameter to the function is the data used to populate the data frame. This data can come from already existing objects or it can be entered directly into your script. If the amount of data is small, directly entering the data is a reasonable option. As the amount of data gets larger, other methods for getting the data are preferred.

The examples will demonstrate constructing a small data frame from a small amount of data directly entered.

2.1.3 Explore - attributes of a data object

Objects in R and Python can have attributes. An attribute is information about the object. For example, data frames have attributes for the number of observations (rows) and number of variables (columns.) These attributes are also data objects, although they are not part of the data of the data frame object.

Functions and methods are used to inspect an attribute value. Functions to provide type/class and size information are commonly used in both languages.

2.1.4 Examples - R

2.1.4.1 Acquisition - Creating a data frame

Data frames can be created by the tibble() function. A tibble is a tidyverse data frame. When a data frame is constructed this way, the data is given in columns as parameters to the tibble() function. This function is special in that the parameter names and positions are not set in advance. The parameter names of tibble() are used as the column names in the data frame. The data for a parameter is the variable (column of the data frame).

A column of a data frame is typically a vector, a one dimensional structure. Each element of a vector must have the same type, such as numeric, character, etc. All vectors in a data frame must have the same length. A single value can be repeated for each observation. The c() function is used here to create a vector that is used as a variable in the data frame.

The following code creates a data frame with 5 observations of 2 variables. The name given to this data set is df. The variables have been named A and B. The observations have not been given names.
```
df <-
  tibble(
    A = c(1, 2, 3, 4, 5),
    B = c(5, 3, 2, 1, 4)
  )
```
Note, the format of the code in the prior chunk. The data frame name and the assignment are on one line of code. The name of the function, here tibble() is on the next row and indented two spaces. The name for each column starts a new line of code and is indented an additional two spaces. This makes a nice list that one can read down to see the variables of the data frame. The command is completed with the closing parenthesis on its own line and aligned under the function name. This is done to make the code easier to read, R does not require this formatting. This formatting is meant to accomplish two goals. The first is to make it clear to the reader what code is nested together. The two columns of the data frame are at the same level of nesting and this nesting is within the tibble() function. The other goal that the key information is easy to find on the left side of the code. The actions (functions and methods) and what is being acted on (the data) are highlighted by this approach. The details of the code are to the right.

2.1.4.2 Explore - Attributes of a data object

The class of an object is returned by the class() function. Note that the results of class() are displayed without using print().
```
class(df)
```
```
[1] "tbl_df"     "tbl"        "data.frame"
```
Note that the df object has three classes, tbl_df, tbl and data.frame. The tbl_df and tbl classes are used by tidyverse functions. The data.frame class is used by base R functions. This set of classes serves as a bridge between tidyverse objects and the rest of R. Almost all base R and tidyverse functions that take a data frame as a parameter will function with either a data.frames or tbl_df. We will continue to describe them by their function (data frame) rather than class, except where the class is needed.
The dimensions of a data frame are returned by the dim() function. Note, dim() only works on objects that have two or more dimensions.
```
dim(df)
```
```
[1] 5 2
```
We can see that the dimensions of this data frame are five rows (observations) of two columns (variables). (Rows are ordered before columns in R.)

2.1.4.3 Explore - Initial look at a data frame

The glimpse() function displays the dimensions of the data frame, the type associated with each column, and the first few values of each column of the data frame.
```
glimpse(df)
```
```
Observations: 5
Variables: 2
$ A <dbl> 1, 2, 3, 4, 5
$ B <dbl> 5, 3, 2, 1, 4
```
From this we can see that the type of both the A and B variable is numeric.

Since the df data frame is small, the top portion of the data frame is the entire data frame. The glimpse() function is most useful when the data frame is larger.
The beginning of a data frame can be displayed in table format with the head() function. The head() function will truncate the number of columns of a tibble, if there are more than can be displayed in the width of a page.

2.1.5 Examples - Python

2.1.5.1 Acquisition - Creating a data frame

Data frames can be specified by the DataFrame() function. This function is in the pandas package and the call needs to be written as pd.DataFrame(). When a data frame is constructed this way, the data is given in columns as parameters to the DataFrame() function. The parameter names of DataFrame() are used as the column names in the data frame and the parameter data are the variables of the data frame (columns).

The following code creates a data frame with 5 observations of 2 variables. The name given to this data set is df. The variables have been named A and B. The observations have not been given names.
```
import pandas as pd
```
```
df = (
    pd.DataFrame(data={ 
        'A': [1, 2, 3, 4, 5],
        'B': [5, 3, 2, 1, 4]}))
```
Note, the format of the above code. The data frame name and the assignment are on one line of code. This line ends with an open parenthesis. The unclosed parenthesis allows the code to be continued over multiple physical lines. The name of the function, here pd.dataFrame(), is on the next row and indented four spaces. The name for each column starts a new line of code and is indented an additional four spaces. The closing parentheses are at the end of the last line. This is the coding style for Python. Nesting is shown using indents and the indents are always four spaces. This makes a nice list that one can read down to see the variables of the data frame. Python does not require this formatting. This formatting is meant to accomplish two goals. The first is to make it clear to the reader what code is nested together. The two columns of the data frame are at the same level of nesting and this nesting is within the pd.DataFrame() function. The other goal is to make the key information easy to find on the left side of the code. The actions (functions and methods) and what is being acted on (the data) are highlighted by this approach. The details of the code are to the right.

The data parameter is given inside of { and}. This is a Python dictionary object. A dictionary is an object that matches names to objects. In this example it maps the names A and B to lists. The list operators are [ and ] and the values of the list are given inside the brackets. Each list is used as a column of the data frame.

2.1.5.2 Explore - Attributes of a data object

The class of an object is displayed by the Python type() function. This is not a pandas function and as such does not use the pd. prefix.
```
print(type(df))
```
```
<class 'pandas.core.frame.DataFrame'>
```
The dimensions of a data frame is returned by the shape attribute of a data frame.
```
print(df.shape)
```
```
(5, 2)
```
We can see that the dimensions of this data frame is five rows (observations) of two columns (variables.)
The dtypes attribute of a data frame is the type associated with each column of the data frame.
```
print(df.dtypes)
```
```
A    int64
B    int64
dtype: object
```

2.1.5.3 Explore - Initial look at a data frame

The beginning of a data frame can be displayed in table format with the head() data frame method.
```
print(df.head())
```
```
   A  B
0  1  5
1  2  3
2  3  2
3  4  1
4  5  4
```
Since the df data frame is small, the top portion of the data frame is the entire data frame. The head() method is most useful when the data frame is larger.

2.1.6 Exercises

Create a data frame with three observations of two variables. Name the variables x1 and x2. Make up numbers for the values of the observed variables.
Using any of the functions/methods from the discourse, display the number of observations and variables of the data frame.