SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.5 Date and time variables

5.5.1 Data concepts - measures of time

There are several time standards. UTC (Coordinated Universal Time) is a world wide standard for measuring time. It is also called Greenwich Mean Time (GMT.) POSIX measures time in the number of second from 00:00:00 January 1st, 1970 UTC.

The POSIX scale works like any other scale. Intervals have a start and end time. Distance between points of time can be measured. Time periods can be added and subtracted.

5.5.2 Examples - R

The lubridate package provides most of the date functions and methods for the tidyverse. These functions and methods integrate well with the rest of the tidyverse with one exception: the interval and duration classes are not supported in a data frame (tibble.) This does not cause much of an issue with most wrangling tasks.

  1. Date time in non-default format.

    This first code block provides a non-default formated date variable to read_csv(). The default format for date is year before month and date.

    library(tidyverse)
    library(lubridate)
    date_format_in <-
      read_csv(
        "id, date
        1, 01-05-2017 05:01:07
        2, 01-06-2017 14:15:35"
        )
    
    head(date_format_in)
    # A tibble: 2 x 2
         id date               
      <dbl> <chr>              
    1     1 01-05-2017 05:01:07
    2     2 01-06-2017 14:15:35

    The format of the date column is character. The read_csv() function did not automatically convert this format to a POSIX class.

    The mdy_hms() will convert a character variable formated as month, day, and year to POSIX.

    date_format <-
      date_format_in %>%
      mutate(
        date = mdy_hms(date)
        )
    
    head(date_format)
    # A tibble: 2 x 2
         id date               
      <dbl> <dttm>             
    1     1 2017-01-05 05:01:07
    2     2 2017-01-06 14:15:35

    There is a set of date and time functions like mdy_hms() that allow for different orders of the date and time components. See the documentation for the ymd_hms() function for a list.

    This example could have also been done with parse_datetime() function. The format of the data and time coding would need to specified using the format parameter. The format is given as a string. Here we use %m-%d-%Y to specify the date information is given as month, date, and year (four digit year.) The - specifies that the month, day, and year are separated by -. The time format is similarly given with : separating the hours, minutes, and seconds. See the documentation for parse_datetime() for further details.

    date_format <-
      date_format_in %>%
      mutate(
        date = parse_datetime(date, format = "%m-%d-%Y %H:%M:%S")
        )
    
    head(date_format)
    # A tibble: 2 x 2
         id date               
      <dbl> <dttm>             
    1     1 2017-01-05 05:01:07
    2     2 2017-01-06 14:15:35

    When there is no time component, parse_date() can be used. There are also as_datetime(), as_date(), and as.date() (base R) functions that can be used. In most cases these will not be needed.

The remaining examples use the airAccs.csv data set.

  1. We begin by importing the csv file and setting variable names.

    airAccs_path <- file.path("..", "datasets", "airAccs.csv")
    air_accidents_in <- read_csv(airAccs_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    air_accidents_in <-
      air_accidents_in %>%
      rename(
        date = Date,
        plane_type = planeType,
        dead = Dead,
        aboard = Aboard,
        ground = Ground
        )
    
    air_accidents <-
      air_accidents_in %>%
      select(-X1)
    
    glimpse(air_accidents)
    Observations: 5,666
    Variables: 7
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
  2. Create a weekend (Saturday or Sunday) indicator variable.

    The date column was imported as a type date. We do not need to convert its type.

    The wday() function is used to determine the day of the week for a POSIX variable. The label parameter specifies if the day is returned as a character variable.

    air_accidents <- 
      air_accidents %>%
      mutate(
        day = wday(date, label=TRUE),
        weekend = day == "Sat" | day == "Sun"
      )
    
    air_accidents %>%
      select(date, location, day, weekend) %>%
      tail()
    # A tibble: 6 x 4
      date       location                          day   weekend
      <date>     <chr>                             <ord> <lgl>  
    1 2014-02-16 Sandhikhark, Nepal                Sun   TRUE   
    2 2014-02-21 Grombalia, Tunisia                Fri   FALSE  
    3 2014-02-26 Near Lanai Airport, Hawaii        Wed   FALSE  
    4 2014-03-07 South Indian Ocean                Fri   FALSE  
    5 2014-03-18 Seattle, Washington               Tue   FALSE  
    6 2014-03-22 Caboolture, Queensland, Australia Sat   TRUE   
  3. How long (in days) was the largest period without an air accident.

    The lag() function creates a variable that shifts the rows by one position. What was in row one is moved to row two, two, becomes three, and so on. The first row value becomes NA, since there is no row zero and the last row value is not used. The lagged date variable aligns the prior accident time with the current stop time.

    The following code show a lagged date variable.

    air_accidents %>%
      arrange(date) %>%
      mutate(
        prior_date = lag(date)
      ) %>%
      select(date, prior_date) %>%
      print(n=3)
    # A tibble: 5,666 x 2
      date       prior_date
      <date>     <date>    
    1 1908-09-17 NA        
    2 1912-07-12 1908-09-17
    3 1913-08-06 1912-07-12
    # ... with 5,663 more rows

    Note, the use of arrange() is needed for the lag to provide the prior dateand time.

    The difftime() function is used calculate the difference between two time values. The units parameter is used to specify if the difference is to be measured in minutes, hours, days, etc.

    air_accidents <- 
      air_accidents %>%
      arrange(date) %>%
      mutate(
        prior_date = lag(date),
        time_gap = difftime(date, prior_date, units = "days")
      )
    
    air_accidents %>%
      select(date, prior_date, time_gap) %>%
      top_n(10, time_gap) %>%
      arrange(desc(time_gap)) %>%
      print(n=10)
    # A tibble: 10 x 3
       date       prior_date time_gap 
       <date>     <date>     <drtn>   
     1 1912-07-12 1908-09-17 1394 days
     2 1915-03-05 1913-10-17  504 days
     3 1913-08-06 1912-07-12  390 days
     4 1916-07-28 1915-09-03  329 days
     5 1915-09-03 1915-03-05  182 days
     6 1918-04-07 1917-10-20  169 days
     7 1919-05-25 1918-12-16  160 days
     8 1920-03-10 1919-10-30  132 days
     9 1918-12-16 1918-08-11  127 days
    10 1923-08-27 1923-05-14  105 days

5.5.3 Examples - Python

  1. Date time in non-default format.

    This first code blocks creates and imports a data set with a date and time variable.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    date_format_text = (
      'id,date\n1,01-05-2017 05:01:07\n2,01-06-2017 14:15:35') 
    
    date_format_in = pd.read_csv(pd.compat.StringIO(date_format_text))
    
    print(date_format_in.dtypes)
    id       int64
    date    object
    dtype: object
    print(date_format_in)
       id                 date
    0   1  01-05-2017 05:01:07
    1   2  01-06-2017 14:15:35

    The format of the date column is character. The read_csv() function did not automatically convert this format to a date and time variable.

    The pandas to_datetime() function will convert a character variable to a date and time variable. The infer_datetime_format parameter when True will try to determine the format of the date and time data in the variable. This is what is done in the following code.

    date_format = date_format_in
    
    date_format = (
        date_format
            .assign(date = lambda df:
                pd.to_datetime(df['date'],infer_datetime_format=True)))
    
    print(date_format.dtypes)
    id               int64
    date    datetime64[ns]
    dtype: object
    print(date_format)
       id                date
    0   1 2017-01-05 05:01:07
    1   2 2017-01-06 14:15:35

    This example could have also been done with using the format parameter to specify how to determine the year, month, day, hours, etc. The format is given as a string. Here we use %m-%d-%Y to specify the date information is given as month, date, and year (four digit year.) The - specifies that the month, day, and year are separated by -. The time format is similarly given with : separating the hours, minutes, and seconds. See the documentation for pd.to_datetime() for further details.

    date_format = date_format_in
    
    date_format = (
        date_format
            .assign(date = lambda df:
                pd.to_datetime(df['date'], format='%m-%d-%Y %H:%M:%S')))
    
    print(date_format['date'].dtype)
    datetime64[ns]
    print(date_format)
       id                date
    0   1 2017-01-05 05:01:07
    1   2 2017-01-06 14:15:35

The remaining examples use the airAccs.csv data set.

  1. We begin by importing the csv file and setting variable names.

    airAccs_path = Path('..') / 'datasets' / 'airAccs.csv'
    air_accidents_in = pd.read_csv(airAccs_path)
    air_accidents_in = (
        air_accidents_in.rename(
            columns={
                'Date': 'date',
                'planeType': 'plane_type',
                'Dead': 'dead',
                'Aboard': 'aboard',
                'Ground': 'ground'}))
    air_accidents =  air_accidents_in.copy(deep=True)
    
    print(air_accidents.head())
       Unnamed: 0        date  ... aboard ground
    0           1  1908-09-17  ...    2.0    0.0
    1           2  1912-07-12  ...    5.0    0.0
    2           3  1913-08-06  ...    1.0    0.0
    3           4  1913-09-09  ...   20.0    0.0
    4           5  1913-10-17  ...   30.0    0.0
    
    [5 rows x 8 columns]
    print(air_accidents['date'].dtype)
    object
  2. Create a weekend (Saturday or Sunday) indicator variable.

    We start by converting the date column to type datetime.

    air_accidents = (
        air_accidents
            .assign(date = lambda df:
                pd.to_datetime(df['date'],infer_datetime_format=True)))
    
    print(air_accidents['date'].dtype)
    datetime64[ns]
    print(air_accidents.head())
       Unnamed: 0       date  ... aboard ground
    0           1 1908-09-17  ...    2.0    0.0
    1           2 1912-07-12  ...    5.0    0.0
    2           3 1913-08-06  ...    1.0    0.0
    3           4 1913-09-09  ...   20.0    0.0
    4           5 1913-10-17  ...   30.0    0.0
    
    [5 rows x 8 columns]

    The dt.weekday_name property is used to determine the day of the week for a date and time variable.

    air_accidents = (
        air_accidents
            .assign(
                day = lambda df:
                    df['date'].dt.weekday_name,
                weekend = lambda df:
                    df['day'].isin(['Saturday', 'Sunday'])))
    
    (air_accidents
        .loc[:, ['date', 'location', 'day', 'weekend']]
        .tail()
        .pipe(print))
               date                           location        day  weekend
    5661 2014-02-21                 Grombalia, Tunisia     Friday    False
    5662 2014-02-26         Near Lanai Airport, Hawaii  Wednesday    False
    5663 2014-03-07                 South Indian Ocean     Friday    False
    5664 2014-03-18                Seattle, Washington    Tuesday    False
    5665 2014-03-22  Caboolture, Queensland, Australia   Saturday     True
  3. How long (in days) was the largest period without an air accident.

    The shift() method creates a variable that shifts the rows by specified number of positions. Here we will use 1 to lag by one row. What was in row one is moved to row two, two, becomes three, and so on. The first row value becomes NA, since there is no row zero and the last row value is not used. The lagged date variable aligns the prior accident time with the current stop time.

    The following code show a lagged date variable.

    (air_accidents
        .sort_values(by=['date'])
        .assign(
            prior_date = lambda df:
                df['date'].shift(1))
        .loc[:, ['date', 'prior_date', 'time_gap']]
        .head(3)
        .pipe(print))
            date prior_date  time_gap
    0 1908-09-17        NaT       NaN
    1 1912-07-12 1908-09-17       NaN
    2 1913-08-06 1912-07-12       NaN

    The subtraction operator, -, is used calculate the difference between two time values.

    air_accidents = (
        air_accidents
            .sort_values(by=['date'])
            .assign(
                prior_date = lambda df:
                    df['date'].shift(1))
            .assign(
                time_gap = lambda df:
                    df['date'] - df['prior_date']))
    
    print(air_accidents['time_gap'].dtype)
    timedelta64[ns]
    (air_accidents
        .loc[:, ['date', 'prior_date', 'time_gap']]
        .nlargest(10, 'time_gap')
        .pipe(print))
             date prior_date  time_gap
    1  1912-07-12 1908-09-17 1394 days
    5  1915-03-05 1913-10-17  504 days
    2  1913-08-06 1912-07-12  390 days
    7  1916-07-28 1915-09-03  329 days
    6  1915-09-03 1915-03-05  182 days
    19 1918-04-07 1917-10-20  169 days
    23 1919-05-25 1918-12-16  160 days
    29 1920-03-10 1919-10-30  132 days
    22 1918-12-16 1918-08-11  127 days
    78 1923-08-27 1923-05-14  105 days

5.5.4 Exercises

These exercises use the MplsStops.csv data set

  1. Import the MplsStops.csv file.

  2. Create a day of the week variable.

  3. Create a variable that measures the amount of time that has passed between the prior stop and the current stop.

  4. On September 8th, 2017 Minneapolis swore in new police chief (story.) Create an indicator variable that identifies observations that occurred on September 9th or later in the data frame.