2.1 Data frames
2.1.1 Data concepts
2.1.1.1 Observations
An observation is a set of measures taken together on one subject/unit under study. To determine an observation, one needs to know what the unit or subject of the study is.
For example, consider a study designed to examining the effects of four different pesticides. The four pesticides will be used in fields that have been divided into four plots. The pesticides will be applied to all fields on the same day. The pesticides are randomly assigned to the plots of each field. There are fifteen fields, each of which will only have one of five different crops. The number of insects will be counted in three randomly chosen two square foot samples from each plot. These counts of insects will be done on all samples on the same day.
Here, the units of the study are plots of a field. The observation level of the study is the two square foot samples. There are multiple observations for each plot. This is called repeated measures. The data collected for an observation is field identifier, crop grown on field, pesticide applied to the plot, sample identifier, and the number of insects counted in sample.
2.1.1.2 Data frames
A data frame is a table with rows and columns. The rows of a data frame are observations. Each column of a data frame contains one of the collected measures of the study. These measures are commonly referred to as variables. The data is related both across rows and over columns. Both of these relationships will be used when cleaning and transforming the data.
2.1.2 Acquisition - Creating a data frame
Both R and Python provide a function to create a data frame. The first parameter to the function is the data used to populate the data frame. This data can come from already existing objects or it can be entered directly into your script. If the amount of data is small, directly entering the data is a reasonable option. As the amount of data gets larger, other methods for getting the data are preferred.
The examples will demonstrate constructing a small data frame from a small amount of data directly entered.
2.1.3 Explore - attributes of a data object
Objects in R and Python can have attributes. An attribute is information about the object. For example, data frames have attributes for the number of observations (rows) and number of variables (columns.) These attributes are also data objects, although they are not part of the data of the data frame object.
Functions and methods are used to inspect an attribute value. Functions to provide type/class and size information are commonly used in both languages.
2.1.4 Examples - R
2.1.4.1 Acquisition - Creating a data frame
Data frames can be created by the tibble()
function.
A tibble
is a tidyverse data frame.
When a data frame is constructed this way,
the data is given in columns as parameters to the tibble()
function.
This function is special in that the parameter names and positions are
not set in advance.
The parameter names of tibble()
are used as the
column names in the data frame.
The data for a parameter is the variable (column of the data frame).
A column of a data frame is typically a vector,
a one dimensional structure.
Each element of a vector must have the same type, such as numeric, character, etc.
All vectors in a data frame must have the same length.
A single value can be repeated for each observation.
The c()
function is used here to create a vector that is used as a variable
in the data frame.
The following code creates a data frame with 5 observations of 2 variables. The name given to this data set is
df
. The variables have been namedA
andB
. The observations have not been given names.df <- tibble( A = c(1, 2, 3, 4, 5), B = c(5, 3, 2, 1, 4) )
Note, the format of the code in the prior chunk. The data frame name and the assignment are on one line of code. The name of the function, here
tibble()
is on the next row and indented two spaces. The name for each column starts a new line of code and is indented an additional two spaces. This makes a nice list that one can read down to see the variables of the data frame. The command is completed with the closing parenthesis on its own line and aligned under the function name. This is done to make the code easier to read, R does not require this formatting. This formatting is meant to accomplish two goals. The first is to make it clear to the reader what code is nested together. The two columns of the data frame are at the same level of nesting and this nesting is within thetibble()
function. The other goal that the key information is easy to find on the left side of the code. The actions (functions and methods) and what is being acted on (the data) are highlighted by this approach. The details of the code are to the right.
2.1.4.2 Explore - Attributes of a data object
The class of an object is returned by the
class()
function. Note that the results ofclass()
are displayed without usingprint()
.class(df)
[1] "tbl_df" "tbl" "data.frame"
Note that the
df
object has three classes,tbl_df
,tbl
anddata.frame
. Thetbl_df
andtbl
classes are used by tidyverse functions. Thedata.frame
class is used by base R functions. This set of classes serves as a bridge between tidyverse objects and the rest of R. Almost all base R and tidyverse functions that take a data frame as a parameter will function with either adata.frames
ortbl_df
. We will continue to describe them by their function (data frame) rather than class, except where the class is needed.The dimensions of a data frame are returned by the
dim()
function. Note,dim()
only works on objects that have two or more dimensions.dim(df)
[1] 5 2
We can see that the dimensions of this data frame are five rows (observations) of two columns (variables). (Rows are ordered before columns in R.)
2.1.4.3 Explore - Initial look at a data frame
The
glimpse()
function displays the dimensions of the data frame, the type associated with each column, and the first few values of each column of the data frame.glimpse(df)
Observations: 5 Variables: 2 $ A <dbl> 1, 2, 3, 4, 5 $ B <dbl> 5, 3, 2, 1, 4
From this we can see that the type of both the A and B variable is numeric.
Since the
df
data frame is small, the top portion of the data frame is the entire data frame. Theglimpse()
function is most useful when the data frame is larger.The beginning of a data frame can be displayed in table format with the
head()
function. Thehead()
function will truncate the number of columns of atibble
, if there are more than can be displayed in the width of a page.
2.1.5 Examples - Python
2.1.5.1 Acquisition - Creating a data frame
Data frames can be specified by the DataFrame()
function.
This function is in the pandas package and the call needs to
be written as pd.DataFrame()
.
When a data frame is constructed this way,
the data is given in columns as parameters to the DataFrame()
function.
The parameter names of DataFrame()
are used as the
column names in the data frame and
the parameter data are the variables of the data frame (columns).
The following code creates a data frame with 5 observations of 2 variables. The name given to this data set is
df
. The variables have been namedA
andB
. The observations have not been given names.import pandas as pd
df = ( pd.DataFrame(data={ 'A': [1, 2, 3, 4, 5], 'B': [5, 3, 2, 1, 4]}))
Note, the format of the above code. The data frame name and the assignment are on one line of code. This line ends with an open parenthesis. The unclosed parenthesis allows the code to be continued over multiple physical lines. The name of the function, here
pd.dataFrame()
, is on the next row and indented four spaces. The name for each column starts a new line of code and is indented an additional four spaces. The closing parentheses are at the end of the last line. This is the coding style for Python. Nesting is shown using indents and the indents are always four spaces. This makes a nice list that one can read down to see the variables of the data frame. Python does not require this formatting. This formatting is meant to accomplish two goals. The first is to make it clear to the reader what code is nested together. The two columns of the data frame are at the same level of nesting and this nesting is within thepd.DataFrame()
function. The other goal is to make the key information easy to find on the left side of the code. The actions (functions and methods) and what is being acted on (the data) are highlighted by this approach. The details of the code are to the right.The data parameter is given inside of
{
and}
. This is a Python dictionary object. A dictionary is an object that matches names to objects. In this example it maps the namesA
andB
tolists
. Thelist
operators are[
and]
and the values of the list are given inside the brackets. Eachlist
is used as a column of the data frame.
2.1.5.2 Explore - Attributes of a data object
The class of an object is displayed by the Python
type()
function. This is not a pandas function and as such does not use thepd.
prefix.print(type(df))
<class 'pandas.core.frame.DataFrame'>
The dimensions of a data frame is returned by the
shape
attribute of a data frame.print(df.shape)
(5, 2)
We can see that the dimensions of this data frame is five rows (observations) of two columns (variables.)
The
dtypes
attribute of a data frame is the type associated with each column of the data frame.print(df.dtypes)
A int64 B int64 dtype: object
2.1.5.3 Explore - Initial look at a data frame
The beginning of a data frame can be displayed in table format with the
head()
data frame method.print(df.head())
A B 0 1 5 1 2 3 2 3 2 3 4 1 4 5 4
Since the
df
data frame is small, the top portion of the data frame is the entire data frame. Thehead()
method is most useful when the data frame is larger.
2.1.6 Exercises
Create a data frame with three observations of two variables. Name the variables x1 and x2. Make up numbers for the values of the observed variables.
Using any of the functions/methods from the discourse, display the number of observations and variables of the data frame.