<- "1970-01-01"
x class(x)
[1] "character"
x
[1] "1970-01-01"
Data Wrangling in R
What would you expect from these date operations?
Today + 1
Today - 1
Today - Tomorrow
Today + Tomorrow
The mean of Yesterday and Today
The standard deviation of Yesterday and Today
Objective: To convert strings into dates and get date components from dates.
Why it matters: Dates can be encoded in a wide variety of formats, so a first step in using dates is converting them into a format that R recognizes. Some research questions involve a specific component of a date, such as weekday versus weekend, or before or after a certain year, so extracting date components is an important skill in working with dates.
Learning outcomes:
Fundamental Skills | Extended Skills |
|
|
Key functions:
as.Date()
Sys.Date()
mdy(), ymd(), etc.
year()
month()
day()
wday()
interval()
time_length()
In R, dates are integers (numeric vectors) with the Date
class.
Create a character vector, x
, with the date January 1, 1970:
Coerce it to a date with the as.Date()
function, and print it again:
It prints the same as when it was a character vector, but its new class gives it useful properties. The first thing we can do is coerce it to a numeric vector with as.numeric()
:
January 1, 1970, is the zero-point in R’s date timeline. Dates before this are negative numbers (e.g., December 31, 1969, is -1), and dates after this are positive (e.g., January 2, 1970, is 1).
Get today’s date with the Sys.Date()
function:
Coerce this to a numeric:
In other words, today (when this document was last updated) is 20283 days after 1970-01-01.
The fact that dates are just numbers with a special class will allow us to perform operations like calculating the difference between two dates (e.g., age in days given date of birth and date of observation, length of residence given date of immigration and date of observation) and add to and subtract from dates (e.g., add one month or year).
The complication with these two operations is that units of time vary in length. How long is a month? 28-31 days. How long is a year? 365-366 days. Thankfully, the lubridate
package from the tidyverse
has functions for handling units of varying length.
Load the tidyverse
.
An initial problem in working with dates is telling R that a string is a date. Dates can be encoded in a wide variety of formats. For example, August 3, 2022, could be written as:
and many, many more ways.
lubridate()
has a collection of functions with the letters y
, m
, and d
in different orders. y
stands for year, m
for month, and d
for day. If a date is in the year-month-day format, use the ymd()
function:
If the order is day/month/year, use dmy()
:
If the format is “Month Day, Year”, use mdy()
:
These functions handle various punctuation and delimiters without our need to tell the function where the commas and dashes and slashes are.
At times, the function will fail to parse an ambiguous format:
This issue can be resolved by inserting characters to separate the units:
To convert character vectors to dates, you may need to first work with them as characters. In the example above, we could convert “2283” to “22/8/3” like so:
This approach assumes a fixed four-digit format for the dates. If that is not the case, we would need to handle dates with different numbers of characters in different ways. That could be done by first using a regex to take a subset of x
(which we have expanded to include a five-digit date to test the regex) that has four characters:
Or better yet, four numbers:
See the previous chapter for more about working with character vectors and regex.
A date like “August 3, 2022” contains three pieces of information (month, day, and year), and it can be used to figure out more pieces of information such as the day of the week or the quarter of the year.
lubridate()
has a set of functions for extracting these pieces of information, or date components. First, create a vector of dates to work with. (Note we can use inconsistently formatted dates, as long as the order is constant!)
[1] "2022-08-03" "2025-02-14" "2023-09-15"
[1] 2022 2025 2023
[1] 8 2 9
[1] 3 14 15
[1] 3 1 3
[1] 215 45 258
[1] 34 45 77
[1] 4 6 6
[1] 31 7 37
These functions all return numeric vectors.
One especially problematic one is wday()
, which returns a number for the day of the week. What is the first day of the week? Answers will differ across cultures, so it is best to turn on the labels
argument to return the day names:
To calculate the difference between two dates, use the time_length()
and interval()
functions following this format:
time_length(interval(start_date, end_date, unit = "units"))
These functions account for units of varying lengths.
Take the difference between two pairs of January 1, one that spans a leap year and another that does not. Return the difference in days:
[1] 365
[1] 366
One is 365 days and the other 366, but requesting the difference in years returns a difference of one in both cases:
[1] 1
[1] 1
We can also specify unit = "months"
to handle months of varying lengths. Create two vectors, start_dates
and end_dates
:
The lengths of these months range 28-31 days:
But they are all one “month” long:
Units of varying lengths also complicate adding to and subtracting from dates.
Imagine you are conducting a study where you need to follow-up with participants exactly one month after their intake survey, and again after exactly one year. When would you schedule the follow-up meetings for participants with these intake dates?
We usually think of retaining the same day but adding to the month, and then adjusting year as necessary.
For these operations, use the add_with_rollback()
function. This function takes a date vector and the units we want to add and how many of those units. Adding one month is done with months(1)
, and we can add one year with years(1)
:
intake <-
ymd("2023-12-10",
"2024-01-15",
"2024-01-31",
"2024-02-29")
add_with_rollback(intake, months(1))
[1] "2024-01-10" "2024-02-15" "2024-02-29" "2024-03-29"
[1] "2024-12-10" "2025-01-15" "2025-01-31" "2025-02-28"
As the function name suggests, add_with_rollback()
will subtract days until the date is legitimate. Adding one month to January 31 will return the last day of February, and adding one year to February 29 will result in February 28 of the following year.
If we want to instead end up with March 1 in either case above, the first day of the next month, add the argument roll_to_first = TRUE
, which is FALSE
by default.
[1] "2024-01-10" "2024-02-15" "2024-03-01" "2024-03-29"
[1] "2024-12-10" "2025-01-15" "2025-01-31" "2025-03-01"
To subtract from dates, simply supply a negative number to the unit function:
Other software uses other conventions for labeling date values. SAS and Stata both print dates as “10apr2004” by default. Convert the following SAS/Stata dates to R Dates:
10apr2004
18jun2005
21sep2006
12jan2007
Occasionally you will work with data where the month, day, and year components of dates are stored as separate variables. To convert these to dates, first paste them together. For extra credit, vary the order and try it with and without a separator.
Using the extract
vector of dates below, extract the years, months, days, and days of the week. How many are Tuesdays?
Calculate your age in years, months, and days, as of today (Sys.Date()
).
Using the last day of this month, add one, two, and three months. If the day does not exist in a month, make it roll back to the last day of the month. Then, make it roll forward to the first day of the next month.