8 Dates
There are several kinds of computation we typically want to do with dates:
- convert character vectors with date values into dates
- extract categories of time (year, month, day of the week)
- calculate elapsed time (differences between dates)
- increment or decrement dates (a month later, a week earlier)
8.1 Representing Dates
Dates (and times) can be awkward to work with. To begin with, we usually reference points on the calendar (a specific date) with a set of category labels - “year”-“month”-“day”. To compute with these, it is useful to translate them to a number line - each date is a point on one continuous time line. By thinking of calendar dates as points on a line, say \(a\) and \(b\), it becomes clear how they are ordered and how to measure the distance between two points: \(\lvert b-a \rvert\).
However, a second difficulty with dates is that our time units - the category labels “year”, “month”, and “day” - all vary in length. That is, some years have 365 days while others have 366. The length of a month varies from 28 to 31 days. And some days have 23 hours while others have 24 or 25 hours (switching from standard time to daylight savings and back). If two dates are 30 days apart, has more than a “month” passed, exactly a “month”, or not quite a “month”?
8.1.1 The Time Line
In R there are several different ways to solve the dilemmas posed by our measures of dates and times, with different assumptions and constraints. The simplest of these is the Date class (there are also two datetime classes). The Date class translates calendar dates to a time line of integers, where 0 is “1970-01-01”, 1 is “1970-01-02”, -1 is “1969-12-31”, etc. The fundamental unit is one day.
In the following example, we take a date given as a character string and convert it to numeric form. Numeric values with class Date print in a human-readable format. If we coerce a numeric date to a plain numeric class, we can see the underlying number.
x <- "1970-01-01"
y <- as.Date(x)
print(y)
[1] "1970-01-01"
class(y)
[1] "Date"
as.numeric(y)
[1] 0
Today’s date is
Sys.Date()
[1] "2024-03-14"
as.numeric(Sys.Date())
[1] 19796
In other words, today (when this document was last updated) is 19796 days after 1970-01-01.
8.1.2 Date Formats
When converting labeled dates to numeric dates, an initial problem is the huge variety of ways in which we record dates as character strings. You might encounter “2020-11-03” (international standard), “11/03/2020” (a typical American representation), or even “November 3, 2020” (another typical American representation), all of which label the same point on the calendar.
The international standard is the R default, so it needs no special handling. Typical American date representations require you to specify a format to make the conversion to a Date.
In this context, a format is a character string that specifies
the template for reading dates. We can see help(strptime)
to find formatting codes, which start with %
. For example, %m
is “Month as decimal number (01–12)”m while %B
is “Full month name in the current locale.” Between these codes, we can use spaces, slashes, commas, or whatever other symbols are used in the dates. If our dates follow the default format of YYYY-MM-DD, we do not need a format code.
as.Date("2020-11-03") # default format, %Y-%m-%d
[1] "2020-11-03"
as.Date("11/03/2020", format = "%m/%d/%Y")
[1] "2020-11-03"
as.Date("November 3, 2020", format = "%B %e, %Y")
[1] "2020-11-03"
Taking another look at the final line above, notice that the separators (here, spaces and a comma) are
included when specifying the format. %B
is a complete month name (November), %e
is a day of the month (3) preceded by a space and followed by a comma and a space, and %Y
is a four-digit year (2020).
Specifying the format manually as we just did gives us control over exactly how dates are to be interpreted. We can also have R parse the date string for us by ordering y
, m
, and d
into a function name, such as mdy()
for month-day-year dates. For this approach, the formats are allowed to vary within our vector. If R cannot figure out the date, it will return NA
. Just be sure that it does not try to interpret anything unexpected in the data! To use these y-m-d functions, load the lubridate
package first.
library(lubridate)
mdy(c("11/03/2020", "November 3, 2020", "11032020"))
[1] "2020-11-03" "2020-11-03" "2020-11-03"
mdy(c("feb 29 2021", "hello", "2020-11-03"))
Warning: 3 failed to parse.
[1] NA NA NA
ymd("2020-11-03")
[1] "2020-11-03"
In the first set of dates, notice that we can supply one of our date parsers with multiple formats. In the second set, see how all three dates simply return NA
, but for different reasons - February 29 does not exist in 2021, “hello” is clearly not a date, and “2020-11-03” is ymd
and not mdy
. This last value is correctly handled by ymd()
in the third example.
8.2 Extracting Date Categories
The same formats are used when we want to extract category
labels - months or years - from a Date. We use the strftime()
function to convert from a numeric Date to a category label.
In this example we extract the year part of several dates.
dates <- c("04/10/1964", "06/18/1965", "09/21/1966")
ndates <- as.Date(dates, format="%m/%d/%Y")
strftime(ndates, format="%Y")
[1] "1964" "1965" "1966"
Notice that these are returned as character values!
There a several ways we might label months: with a full name, with an abbreviated name, or with a numeral. Each of these has its own format code.
strftime(ndates, format="%b")
[1] "Apr" "Jun" "Sep"
strftime(ndates, format="%m")
[1] "04" "06" "09"
Again, the result is a vector of character values.
lubridate
gives us the option of extracting numeric values rather than character values with aptly named functions such as year()
, month()
, day()
, and quarter()
. Beyond these, we can extract the day of the year (yday()
), quarter (qday()
), and week (wday()
, where 1 is Monday), as well as the week of the year (week()
). For even more, see help(day)
.
year(ndates)
[1] 1964 1965 1966
month(ndates)
[1] 4 6 9
day(ndates)
[1] 10 18 21
quarter(ndates)
[1] 2 2 3
yday(ndates)
[1] 101 169 264
qday(ndates)
[1] 10 79 83
wday(ndates)
[1] 6 6 4
week(ndates)
[1] 15 25 38
8.3 Elapsed Time
Storing dates as numeric values makes it easy to compute elapsed times: you just subtract one date from another. The difference is the number of days that have passed.
How many days have passed since January 1, 2000?
daysgoneby <- Sys.Date() - as.Date("2000-01-01")
daysgoneby
Time difference of 8839 days
The result is numeric data, but with a new class, difftime
.
The largest time unit supported by difftimes is days (actually, weeks, but these are just seven days), since the larger units (months and years) vary in length. Sometimes, we have a good reason for dealing with these ambiguous units, such as when we want to calculate ages from birth dates.
To do this, we should use objects with class Interval
rather than difftime
, and pass these objects to lubridate
’s time_length()
function. When we give intervals to time_length()
, it will account for varying month and year lengths and give us the results we would expect, whereas with difftimes, it will assume years are all 365.25 days and all months are 30.4375 (365.25/12) days long.
We can give two dates to interval()
function, and then pass the result to time_length()
and specify unit = "years"
or unit = "months"
. Note that interval()
calculates the difference as the second date minus the first date, rather than the second date minus the first date as with difftime()
, so the sign on the result is reversed if the order is the same.
The first example uses a “regular” non-leap year with 365 days. difftime()
returns a difference of -365/365.25 = -0.999 years, while interval()
returns 1. In the second example with a leap year, difftime()
gives us an answer slightly over 1 (366/365.25) while interval()
still calculates it as one year.
time_length(difftime(as.Date("2019-01-01"), as.Date("2020-01-01")), unit = "years")
[1] -0.9993155
time_length(interval(as.Date("2019-01-01"), as.Date("2020-01-01")), unit = "years")
[1] 1
time_length(difftime(as.Date("2020-01-01"), as.Date("2021-01-01")), unit = "years")
[1] -1.002053
time_length(interval(as.Date("2020-01-01"), as.Date("2021-01-01")), unit = "years")
[1] 1
This means that, if we are working with whole dates (no hours, minutes, seconds, etc.) and single years, time_length()
will never return a whole number when working with a difftime()
object.
The same is true of months, since time_length()
will assume 30.4375 days per month with difftimes. We can observe this if we pass months of 28, 29, 30, and 31 days to either difftime()
or interval()
when calculating the time_length()
in months:
df <- data.frame(start_date = ymd(20210201, 20200201, 20200401, 20210301),
end_date = ymd(20210301, 20200301, 20200501, 20210401))
df$n_days <- as.numeric(df$end_date - df$start_date)
df$length_difftime <- time_length(difftime(df$end_date, df$start_date), unit = "months")
df$length_interval <- time_length(interval(df$start_date, df$end_date), unit = "months")
df
start_date end_date n_days length_difftime length_interval
1 2021-02-01 2021-03-01 28 0.9199179 1
2 2020-02-01 2020-03-01 29 0.9527721 1
3 2020-04-01 2020-05-01 30 0.9856263 1
4 2021-03-01 2021-04-01 31 1.0184805 1
8.4 Incrementing and Decrementing Dates
Another limitation of the Date class is that incrementing or decrementing by units greater than days is awkward - again the ambiguity of months and years is an obstacle.
Suppose we wanted to increment some dates by one month. We could try
dates <- as.Date(c("2004-02-10", "2005-06-18", "2007-07-21"))
dates + 30
[1] "2004-03-11" "2005-07-18" "2007-08-20"
The first and third values here are probably not what we had in mind!
We usually think of retaining the same day, but incrementing (or decrementing) the month category. This can be accomplished with lubridate
’s add_with_rollback()
function. Provide the function with a date and a period (a pluralized date component: years()
, months()
, weeks()
, or days()
).
We can add one month to each date in our dates
vector with months(1)
:
add_with_rollback(dates, months(1))
[1] "2004-03-10" "2005-07-18" "2007-08-21"
Giving a negative number to the period function allows us to subtract that period from the date:
add_with_rollback(dates, years(-1))
[1] "2003-02-10" "2004-06-18" "2006-07-21"
add_with_rollback(dates, months(-2))
[1] "2003-12-10" "2005-04-18" "2007-05-21"
add_with_rollback(dates, weeks(-3))
[1] "2004-01-20" "2005-05-28" "2007-06-30"
add_with_rollback(dates, days(-4))
[1] "2004-02-06" "2005-06-14" "2007-07-17"
When adding or subtracting months or years to dates, we are forced to deal with uneven month lengths. What is one month after January 31? What is one year after February 29?
As the function name suggests, add_with_rollback()
will subtract days until the date is legitimate. Adding one month to January 31 will return the last day of February, and adding one year to Febuary 29 will result in February 28 of the following year.
add_with_rollback(ymd(20210131), months(1))
[1] "2021-02-28"
add_with_rollback(ymd(20200229), years(1))
[1] "2021-02-28"
If we want to instead end up with March 1 in either case above, the first day of the next month, add the argument roll_to_first = TRUE
, which is FALSE
by default.
add_with_rollback(ymd(20210131), months(1), roll_to_first = TRUE)
[1] "2021-03-01"
add_with_rollback(ymd(20200229), years(1), roll_to_first = TRUE)
[1] "2021-03-01"
8.5 Exercises
Date formats: Other software uses other conventions for labeling date values. SAS and Stata both print dates as “10apr2004” by default. Convert the following SAS/Stata dates to R Dates:
10apr2004 18jun2005 21sep2006 12jan2007
Extracting date categories: Using the
extract
vector of dates below, extract the years, months, days, and days of the week. How many are Wednesdays?extract <- ymd("2013-06-11", "2015-03-10", "2017-08-13", "2011-05-29", "2010-12-13")
Elapsed time: Calculate your age in years, months, and days, as of today (use
Sys.Date()
). Be sure to account for irregular month and year lengths.Selecting data based on a date cutoff: Given the following vector
x
, create an indicator showing which observations occur on or after July 1 (whether they fall in fiscal year 2021). How many of these observations are there? (Theset.seed()
function makes it so that if we give the same number to the function, we will produce the same random numbers forx
.)set.seed(112) x <- as.Date(sample(1:365, 10), origin="2020-01-01")
8.6 Advanced Exercises
Average and standard deviation of dates: Using the dates from the first exercise, calculate an average date. What class is the returned value? Calculate the standard deviation. What class is this? Why should the mean and standard deviation return values of different classes?
Dates from date components: Occasionally you will work with data where the month, day, and year components of dates are stored as separate variables. To convert these to dates, first paste them together. (Recall that, to reference a column in a dataframe, use
$
, as indf$day
.)df <- data.frame(day = c(10, 18, 22), month = c(4, 6, 9), year = c(2004, 2005, 2006))
Creating dates from integers: In the exercise using
sample()
above, R converts random integers into dates, provided that we specify the origin date. Most often this will be the same as the origin for Date values, “1970-01-01”.Convert the integers 0:5 to R dates, assuming the usual R origin.
Other software use other origins for their timelines. Date values in SAS and Stata use 01jan60 as their origin. Now assume the integers 0:5 are SAS/Stata date values, using their default origin. Then convert these to R dates. What values do they take?
Extracting day of the week: Using
strftime()
to get the day of the week (Sunday, Monday, etc.) for each observation of this vector from earlier:set.seed(112) x <- as.Date(sample(1:365, 10), origin="1970-01-01")