5.5 Date and time variables
5.5.1 Data concepts - measures of time
There are several time standards. UTC (Coordinated Universal Time) is a world wide standard for measuring time. It is also called Greenwich Mean Time (GMT.) POSIX measures time in the number of second from 00:00:00 January 1st, 1970 UTC.
The POSIX scale works like any other scale. Intervals have a start and end time. Distance between points of time can be measured. Time periods can be added and subtracted.
5.5.2 Examples - R
The lubridate
package provides most of the date
functions and methods for the tidyverse.
These functions and methods integrate well with
the rest of the tidyverse with one exception:
the interval
and duration
classes are not
supported in a data frame (tibble.)
This does not cause much of an issue with most
wrangling tasks.
Date time in non-default format.
This first code block provides a non-default formated date variable to
read_csv()
. The default format for date is year before month and date.library(tidyverse) library(lubridate)
date_format_in <- read_csv( "id, date 1, 01-05-2017 05:01:07 2, 01-06-2017 14:15:35" ) head(date_format_in)
# A tibble: 2 x 2 id date <dbl> <chr> 1 1 01-05-2017 05:01:07 2 2 01-06-2017 14:15:35
The format of the date column is character. The
read_csv()
function did not automatically convert this format to a POSIX class.The
mdy_hms()
will convert a character variable formated as month, day, and year to POSIX.date_format <- date_format_in %>% mutate( date = mdy_hms(date) ) head(date_format)
# A tibble: 2 x 2 id date <dbl> <dttm> 1 1 2017-01-05 05:01:07 2 2 2017-01-06 14:15:35
There is a set of date and time functions like
mdy_hms()
that allow for different orders of the date and time components. See the documentation for theymd_hms()
function for a list.This example could have also been done with
parse_datetime()
function. The format of the data and time coding would need to specified using theformat
parameter. Theformat
is given as a string. Here we use%m-%d-%Y
to specify the date information is given as month, date, and year (four digit year.) The-
specifies that the month, day, and year are separated by-
. The time format is similarly given with:
separating the hours, minutes, and seconds. See the documentation forparse_datetime()
for further details.date_format <- date_format_in %>% mutate( date = parse_datetime(date, format = "%m-%d-%Y %H:%M:%S") ) head(date_format)
# A tibble: 2 x 2 id date <dbl> <dttm> 1 1 2017-01-05 05:01:07 2 2 2017-01-06 14:15:35
When there is no time component,
parse_date()
can be used. There are alsoas_datetime()
,as_date()
, andas.date()
(base R) functions that can be used. In most cases these will not be needed.
The remaining examples use the airAccs.csv
data set.
We begin by importing the csv file and setting variable names.
airAccs_path <- file.path("..", "datasets", "airAccs.csv") air_accidents_in <- read_csv(airAccs_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
air_accidents_in <- air_accidents_in %>% rename( date = Date, plane_type = planeType, dead = Dead, aboard = Aboard, ground = Ground ) air_accidents <- air_accidents_in %>% select(-X1) glimpse(air_accidents)
Observations: 5,666 Variables: 7 $ date <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19... $ location <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",... $ operator <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr... $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"... $ dead <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23... $ aboard <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23... $ ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Create a weekend (Saturday or Sunday) indicator variable.
The
date
column was imported as a typedate
. We do not need to convert its type.The
wday()
function is used to determine the day of the week for a POSIX variable. Thelabel
parameter specifies if the day is returned as a character variable.air_accidents <- air_accidents %>% mutate( day = wday(date, label=TRUE), weekend = day == "Sat" | day == "Sun" ) air_accidents %>% select(date, location, day, weekend) %>% tail()
# A tibble: 6 x 4 date location day weekend <date> <chr> <ord> <lgl> 1 2014-02-16 Sandhikhark, Nepal Sun TRUE 2 2014-02-21 Grombalia, Tunisia Fri FALSE 3 2014-02-26 Near Lanai Airport, Hawaii Wed FALSE 4 2014-03-07 South Indian Ocean Fri FALSE 5 2014-03-18 Seattle, Washington Tue FALSE 6 2014-03-22 Caboolture, Queensland, Australia Sat TRUE
How long (in days) was the largest period without an air accident.
The
lag()
function creates a variable that shifts the rows by one position. What was in row one is moved to row two, two, becomes three, and so on. The first row value becomesNA
, since there is no row zero and the last row value is not used. The laggeddate
variable aligns the prior accident time with the current stop time.The following code show a lagged
date
variable.air_accidents %>% arrange(date) %>% mutate( prior_date = lag(date) ) %>% select(date, prior_date) %>% print(n=3)
# A tibble: 5,666 x 2 date prior_date <date> <date> 1 1908-09-17 NA 2 1912-07-12 1908-09-17 3 1913-08-06 1912-07-12 # ... with 5,663 more rows
Note, the use of
arrange()
is needed for the lag to provide the prior dateand time.The
difftime()
function is used calculate the difference between two time values. Theunits
parameter is used to specify if the difference is to be measured in minutes, hours, days, etc.air_accidents <- air_accidents %>% arrange(date) %>% mutate( prior_date = lag(date), time_gap = difftime(date, prior_date, units = "days") ) air_accidents %>% select(date, prior_date, time_gap) %>% top_n(10, time_gap) %>% arrange(desc(time_gap)) %>% print(n=10)
# A tibble: 10 x 3 date prior_date time_gap <date> <date> <drtn> 1 1912-07-12 1908-09-17 1394 days 2 1915-03-05 1913-10-17 504 days 3 1913-08-06 1912-07-12 390 days 4 1916-07-28 1915-09-03 329 days 5 1915-09-03 1915-03-05 182 days 6 1918-04-07 1917-10-20 169 days 7 1919-05-25 1918-12-16 160 days 8 1920-03-10 1919-10-30 132 days 9 1918-12-16 1918-08-11 127 days 10 1923-08-27 1923-05-14 105 days
5.5.3 Examples - Python
Date time in non-default format.
This first code blocks creates and imports a data set with a date and time variable.
from pathlib import Path import pandas as pd import numpy as np
date_format_text = ( 'id,date\n1,01-05-2017 05:01:07\n2,01-06-2017 14:15:35') date_format_in = pd.read_csv(pd.compat.StringIO(date_format_text)) print(date_format_in.dtypes)
id int64 date object dtype: object
print(date_format_in)
id date 0 1 01-05-2017 05:01:07 1 2 01-06-2017 14:15:35
The format of the date column is character. The
read_csv()
function did not automatically convert this format to a date and time variable.The pandas
to_datetime()
function will convert a character variable to a date and time variable. Theinfer_datetime_format
parameter whenTrue
will try to determine the format of the date and time data in the variable. This is what is done in the following code.date_format = date_format_in date_format = ( date_format .assign(date = lambda df: pd.to_datetime(df['date'],infer_datetime_format=True))) print(date_format.dtypes)
id int64 date datetime64[ns] dtype: object
print(date_format)
id date 0 1 2017-01-05 05:01:07 1 2 2017-01-06 14:15:35
This example could have also been done with using the
format
parameter to specify how to determine the year, month, day, hours, etc. Theformat
is given as a string. Here we use%m-%d-%Y
to specify the date information is given as month, date, and year (four digit year.) The-
specifies that the month, day, and year are separated by-
. The time format is similarly given with:
separating the hours, minutes, and seconds. See the documentation forpd.to_datetime()
for further details.date_format = date_format_in date_format = ( date_format .assign(date = lambda df: pd.to_datetime(df['date'], format='%m-%d-%Y %H:%M:%S'))) print(date_format['date'].dtype)
datetime64[ns]
print(date_format)
id date 0 1 2017-01-05 05:01:07 1 2 2017-01-06 14:15:35
The remaining examples use the airAccs.csv
data set.
We begin by importing the csv file and setting variable names.
airAccs_path = Path('..') / 'datasets' / 'airAccs.csv' air_accidents_in = pd.read_csv(airAccs_path) air_accidents_in = ( air_accidents_in.rename( columns={ 'Date': 'date', 'planeType': 'plane_type', 'Dead': 'dead', 'Aboard': 'aboard', 'Ground': 'ground'})) air_accidents = air_accidents_in.copy(deep=True) print(air_accidents.head())
Unnamed: 0 date ... aboard ground 0 1 1908-09-17 ... 2.0 0.0 1 2 1912-07-12 ... 5.0 0.0 2 3 1913-08-06 ... 1.0 0.0 3 4 1913-09-09 ... 20.0 0.0 4 5 1913-10-17 ... 30.0 0.0 [5 rows x 8 columns]
print(air_accidents['date'].dtype)
object
Create a weekend (Saturday or Sunday) indicator variable.
We start by converting the
date
column to typedatetime
.air_accidents = ( air_accidents .assign(date = lambda df: pd.to_datetime(df['date'],infer_datetime_format=True))) print(air_accidents['date'].dtype)
datetime64[ns]
print(air_accidents.head())
Unnamed: 0 date ... aboard ground 0 1 1908-09-17 ... 2.0 0.0 1 2 1912-07-12 ... 5.0 0.0 2 3 1913-08-06 ... 1.0 0.0 3 4 1913-09-09 ... 20.0 0.0 4 5 1913-10-17 ... 30.0 0.0 [5 rows x 8 columns]
The
dt.weekday_name
property is used to determine the day of the week for a date and time variable.air_accidents = ( air_accidents .assign( day = lambda df: df['date'].dt.weekday_name, weekend = lambda df: df['day'].isin(['Saturday', 'Sunday']))) (air_accidents .loc[:, ['date', 'location', 'day', 'weekend']] .tail() .pipe(print))
date location day weekend 5661 2014-02-21 Grombalia, Tunisia Friday False 5662 2014-02-26 Near Lanai Airport, Hawaii Wednesday False 5663 2014-03-07 South Indian Ocean Friday False 5664 2014-03-18 Seattle, Washington Tuesday False 5665 2014-03-22 Caboolture, Queensland, Australia Saturday True
How long (in days) was the largest period without an air accident.
The
shift()
method creates a variable that shifts the rows by specified number of positions. Here we will use1
to lag by one row. What was in row one is moved to row two, two, becomes three, and so on. The first row value becomesNA
, since there is no row zero and the last row value is not used. The laggeddate
variable aligns the prior accident time with the current stop time.The following code show a lagged
date
variable.(air_accidents .sort_values(by=['date']) .assign( prior_date = lambda df: df['date'].shift(1)) .loc[:, ['date', 'prior_date', 'time_gap']] .head(3) .pipe(print))
date prior_date time_gap 0 1908-09-17 NaT NaN 1 1912-07-12 1908-09-17 NaN 2 1913-08-06 1912-07-12 NaN
The subtraction operator,
-
, is used calculate the difference between two time values.air_accidents = ( air_accidents .sort_values(by=['date']) .assign( prior_date = lambda df: df['date'].shift(1)) .assign( time_gap = lambda df: df['date'] - df['prior_date'])) print(air_accidents['time_gap'].dtype)
timedelta64[ns]
(air_accidents .loc[:, ['date', 'prior_date', 'time_gap']] .nlargest(10, 'time_gap') .pipe(print))
date prior_date time_gap 1 1912-07-12 1908-09-17 1394 days 5 1915-03-05 1913-10-17 504 days 2 1913-08-06 1912-07-12 390 days 7 1916-07-28 1915-09-03 329 days 6 1915-09-03 1915-03-05 182 days 19 1918-04-07 1917-10-20 169 days 23 1919-05-25 1918-12-16 160 days 29 1920-03-10 1919-10-30 132 days 22 1918-12-16 1918-08-11 127 days 78 1923-08-27 1923-05-14 105 days
5.5.4 Exercises
These exercises use the MplsStops.csv
data set
Import the
MplsStops.csv
file.Create a day of the week variable.
Create a variable that measures the amount of time that has passed between the prior stop and the current stop.
On September 8th, 2017 Minneapolis swore in new police chief (story.) Create an indicator variable that identifies observations that occurred on September 9th or later in the data frame.