7  Character Vectors

7.1 Warm-Up

Your dataset has times formatted in HHMM, without the colon (:) separator (i.e., 1234 instead of 12:34). How would you go about converting these numbers into times?

952
956

How about these numbers?

1000
1004

Did your answer differ for the two sets? What would you do if they were all in a single variable?

952
956
1000
1004

7.2 Outcomes

Objective: To combine, separate, clean, and compare character vectors.

Why it matters: In your datasets, you may find that a single variable is spread across multiple columns (vectors), or that a single column contains multiple variables. At other times, you may need to clean character values by removing symbols or standardizing capitalization, or produce an indicator variable for the presence of a certain string.

Learning outcomes:

Fundamental Skills Extended Skills
  • Combine multiple character vectors into one.

  • Separate single character values into multiple values.

  • Clean character vectors by removing characters and standardizing case.

  • Compare strings with basic regular expressions.

  • Use regular expressions to clean and compare character vectors.

Key functions:

paste()
paste0()
str_sub()
str_split()
str_split_i()
str_to_lower()
str_to_upper()
str_replace()
str_remove()
str_detect()
str_subset()

7.3 Enter the tidyverse

To work with character data, we will use the stringr package, which is installed and loaded with the tidyverse. We will need other tidyverse packages for wrangling date and categorical vectors, importing data, and working with dataframes.

Install the tidyverse with:

install.packages("tidyverse")

Run library(tidyverse) to load stringr and several other packages all at once:

library(tidyverse)

7.4 Character Data

Character data, or strings, is text. It can vary in length and complexity, from individual letters to Likert scale responses to social media posts to multi-page PDFs.

Character data is valuable because it can take on an infinite number of values, but this presents unique challenges for data wrangling.

7.5 Combining

Sometimes we need to combine multiple variable into a single variable. This may happen if we have three columns in a dataset, each giving the year, month, or day. We need to first combine these values together in order to work with them as dates, so that we can then perform useful operations like extracting the day of the week or calculating elapsed time. (Learn how to work with dates in the next chapter.)

Create numeric vectors with the year, month, and day of three past presidential elections:

year <- c(2016, 2020, 2024)
month <- c(11, 11, 11)
day <- c(8, 3, 5)

Combine them with the paste() function. This function takes any number of comma-separated arguments. Inputs are coerced to character before combining.

paste(year, month, day)
[1] "2016 11 8" "2020 11 3" "2024 11 5"

paste() by default space-separates its output. We can change this to another separator with the sep argument:

paste(year, month, day, sep = "-")
[1] "2016-11-8" "2020-11-3" "2024-11-5"

If we want no separator, specify sep = "":

paste(year, month, day, sep = "")
[1] "2016118" "2020113" "2024115"

A shorthand for pasting with no separator is paste0():

paste0(year, month, day)
[1] "2016118" "2020113" "2024115"
Tip

Thinking ahead for when we work with dates in R, or if we just want to show this output to another human, it would be good to include some separator in this data.

If there is no separator, how will we know if 2016118 is 2016-11-8 or 2016-1-18?

The vectors we provide to paste() or paste0() can be of different lengths, in which case the shorter vector(s) would be recycled. Paste together the letter “Q” and the numbers 1-5 to create variable names for a survey:

paste0("Q", 1:5)
[1] "Q1" "Q2" "Q3" "Q4" "Q5"

7.6 Separating

At other times, we may have a single variable that contains multiple variables, so that we need to separate the values into multiple values. We can do this by position (number of characters) or delimiter (symbol or string separating values).

7.6.1 By Position

Separate strings into pieces with a fixed number of characters with the str_sub() function. This function takes three arguments:

  • string: the vector to operate on
  • start: the position of the first character to extract
  • stop: the position of the last character to extract

Create a vector of times with hours, minutes, and am/pm:

x <- c("1:08am", "8:42am", "11:26am", "12:10pm")

Extract just the first character to see how the function works:

str_sub(x, 1, 1)
[1] "1" "8" "1" "1"

To get the hours, we need to sometimes get the first character only, and sometimes the first two characters. The start and stop arguments can take on negative values to count from the right instead of the left. We still want to start at the first character, but we need to stop at the sixth character from the right. For “1:08am” this will return just “1”, while for “11:26am” it will return “11”:

str_sub(x, 1, -6)
[1] "1"  "8"  "11" "12"

Extract the hours and minutes:

str_sub(x, 1, -3)
[1] "1:08"  "8:42"  "11:26" "12:10"

Extracting just the am/pm requires starting from the second character from the right, so give a negative number to start as well:

str_sub(x, -2, -1)
[1] "am" "am" "am" "pm"
On Your Own

Revisit the warm-up and extract just the hours, and then just the minutes.

y <- c(952, 956, 1000, 1004)

7.6.2 By Delimiter

We can also split character vectors by some symbol or string that is one or more characters. The str_split() function takes a character vector and a delimiter. Separate the times in x by the colon:

str_split(x, ":")
[[1]]
[1] "1"    "08am"

[[2]]
[1] "8"    "42am"

[[3]]
[1] "11"   "26am"

[[4]]
[1] "12"   "10pm"

If we just want the hours instead of all the pieces, we can use the str_split_i() function. This takes one more argument, the number of the element we want to extract. If, after separating the vector, we just want the first element, we would run:

str_split_i(x, ":", 1)
[1] "1"  "8"  "11" "12"

This also simplifies our output from a list to a vector.

If we want the second element, we could put 2. Or, if we want the last element (which here is the second element), we can again count from the right with a negative number:

str_split_i(x, ":", -1)
[1] "08am" "42am" "26am" "10pm"

stringr includes some built-in character vectors we can use to experiment: words, fruit, and sentences.

Get the first three elements of sentences and separate them by spaces:

str_split(sentences[1:3], " ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

Get the first word of each sentence:

str_split_i(sentences[1:3], " ", 1)
[1] "The"  "Glue" "It's"

Or the last word of each sentence:

str_split_i(sentences[1:3], " ", -1)
[1] "planks."     "background." "well."      

Now, create a character vector with the addresses for Sewell, Memorial Union, and Memorial Library:

address <- 
  c("Sewell Social Sciences, 1180 Observatory Dr, Madison, WI 53706",
    "Memorial Union, 800 Langdon St, Madison, WI 53706",
    "Memorial Library, 728 State St, Madison, WI 53706")

Our task is to separate them into separate vectors with the building, street address, city, state, and ZIP code.

The first three are simple enough. Use ", " (with the space after the comma) to separate address and then extract the first, second, and third elements.

building <- str_split_i(address, ", ", 1)
street <- str_split_i(address, ", ", 2)
city <- str_split_i(address, ", ", 3)

State is separated from city by a comma, but by ZIP code by a space. First, create a vector that is state and ZIP code together:

state <- str_split_i(address, ", ", 4)
state
[1] "WI 53706" "WI 53706" "WI 53706"

Now, extract the ZIP code into its own object, which is the second element after separating by space:

zip <- str_split_i(state, " ", 2)
zip
[1] "53706" "53706" "53706"

Now that the ZIP code has been extracted, write over the existing state with just the first element after separating by space:

state <- str_split_i(state, " ", 1)
state
[1] "WI" "WI" "WI"

Combine all the addresses into a dataframe:

data.frame(building,
           street,
           city,
           state,
           zip)
                building              street    city state   zip
1 Sewell Social Sciences 1180 Observatory Dr Madison    WI 53706
2         Memorial Union      800 Langdon St Madison    WI 53706
3       Memorial Library        728 State St Madison    WI 53706

7.7 Cleaning

Wherever individuals are allowed to input data, whether free response data on a survey or manually entered student records, there will be variations in case and spelling that need to be standardized before using the data in any meaningful way.

Tip

When administering a survey, use multiple choice as much as possible. Determine likely choices through pilot studies or focus groups. A little more effort in survey design will save you a lot of effort in data cleaning.

7.7.1 Standardizing Capitalization

Create a vector with responses to the question, “What is your favorite color?”

colors <- c("red", "RED", "red", "blue", "bLue")

What is the most popular color?

table(colors)
colors
blue bLue  red  RED 
   1    1    2    1 

The most popular color is red, followed by a three-way tie between blue, bLue, and RED.

You can see the problem. R is case-sensitive not just in data object names, but it also considers character values with different capitalization to be unique values.

This can be corrected if we simply standardize the case. We can make everything uppercase with str_to_upper():

str_to_upper(colors)
[1] "RED"  "RED"  "RED"  "BLUE" "BLUE"

Or lowercase with str_to_lower():

str_to_lower(colors)
[1] "red"  "red"  "red"  "blue" "blue"

Tabulating either one of these yields correct counts:

table(str_to_lower(colors))

blue  red 
   2    3 

7.7.2 Removing and Replacing Characters

The previous example was too optimistic. No respondents included extra characters or spaces, and they all spelled the colors correctly.

Create a slightly more realistic set of responses:

colors2 <- c("red ", "RED!", "red", "blue!", "bLeu")

Simply making everything lowercase no longer works:

table(str_to_lower(colors2))

 bleu blue!   red  red   red! 
    1     1     1     1     1 

It is not immediately obvious that one of the “red” values has an extra space. Return the unique values as a character vector with unique():

unique(str_to_lower(colors2))
[1] "red "  "red!"  "red"   "blue!" "bleu" 

"red " (note the trailing space) and "red" are considered unique values.

To clean up this vector of color names, our to-do list is as follows:

  • standardize case
  • correct misspellings (“bleu” to “blue”)
  • remove exclamation points and spaces

First, standardize case:

colors2 <- str_to_lower(colors2)
colors2
[1] "red "  "red!"  "red"   "blue!" "bleu" 

Second, standardize spellings with str_replace(). This function takes our vector, the pattern we want to replace, and what we want to replace that pattern with:

colors2 <- str_replace(colors2, "bleu", "blue")
colors2
[1] "red "  "red!"  "red"   "blue!" "blue" 

Last, remove the exclamation points and spaces. The str_remove() function can be used here, or we can continue to use str_replace(). To remove something is to replace it with nothing:

colors2 <- str_replace(colors2, "!", "")
colors2
[1] "red " "red"  "red"  "blue" "blue"

Now remove the spaces, but this time with str_remove(). The only difference in syntax is that we do not need to specify the replacement value because it is assumed to be "":

colors2 <- str_remove(colors2, " ")
colors2
[1] "red"  "red"  "red"  "blue" "blue"
Note

This approach only removes the first occurrence of the space within each value. To remove all occurrences, use str_remove_all(). This function would also remove internal spaces, changing " hello there " to "hellothere". stringr provides other functions for these kinds of situations:

  • str_trim() removes spaces from the beginning and end of strings
    • e.g., " hello there " to "hello there"
  • str_squish() removes spaces from the beginning and end of strings, and it reduces multiple internal spaces into a single space
    • e.g., " hello there " to "hello there"

7.8 Comparing

At other times, we may want to create variables that indicate whether a certain string is present in the values of a character vector.

For this, use str_detect(). This returns a logical vector whether some string (second argument) is present in each element of the first argument.

Create an indicator whether the word “apple” appears in sentences:

apples <- str_detect(sentences, "apple")

Now, use this logical vector to index sentences to find sentences with “apple” in them:

sentences[apples]
[1] "The fruit of a fig tree is apple shaped."
[2] "Hedge apples may stain your hands green."
[3] "The big red apple fell to the ground."   

A shortcut for this indexing process can be done with the str_subset() function:

str_subset(sentences, "apple")
[1] "The fruit of a fig tree is apple shaped."
[2] "Hedge apples may stain your hands green."
[3] "The big red apple fell to the ground."   

Thinking ahead, the apples vector could be used in a statistical model or plot as an indicator variable. With more useful data, we could imagine checking whether a collection of social media posts mentioned some topic by name, such as immigration or welfare.

7.9 Regular Expressions

Up to this point, we have only removed or detected literal strings and select special characters (like !).

All of the pattern arguments in the functions we have been using can also take regular expressions. Regular expressions, or regex for short, are a powerful tool for specifying patterns in strings.

For example, if we want to find values of fruit that do not just contain the letters “st,” but start with these letters, we would use the regex ^st. The carat ^ denotes the start of a string, so we only find fruits with “st” at the beginning:

str_subset(fruit, "^st")
[1] "star fruit" "strawberry"

Compare that to searching for the literal string “st”:

str_subset(fruit, "st")
[1] "purple mangosteen" "star fruit"        "strawberry"       

This returns “purple mangosteen”, which contains but does not start with “st”.

Another useful regex operator is $, which marks the end of a string. Find fruits that end with the letters “ant” with ant$:

str_subset(fruit, "ant$")
[1] "blackcurrant" "currant"      "eggplant"     "redcurrant"  

And compare that to the vector of fruits that just contain the letters “ant”, which returns those above in addition to cantaloupe:

str_subset(fruit, "ant")
[1] "blackcurrant" "cantaloupe"   "currant"      "eggplant"     "redcurrant"  

Several symbols have special meaning in the context of regular expressions:

. + * ? ^ $ | \ ( ) [ ] { }

What if these characters are part of a string that we want to compare or clean?

This oddly formatted phone number has the +, (, and ) symbols:

phone <- "+1 (608) 262.9917"

Try to remove the addition sign +:

str_remove(phone, "+")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`+`)

Or change the period . to a hyphen -:

str_replace(phone, ".", "-")
[1] "-1 (608) 262.9917"

A period is a wildcard for any single character, so the first character was replace with a hyphen -. Use str_replace_all() to replace all occurrences of any character:

str_replace_all(phone, ".", "-")
[1] "-----------------"

Even the spaces were replaced!

To replace a literal character, we need to use the fixed() function around a pattern.

Try again to drop the +:

str_remove(phone, fixed("+"))
[1] "1 (608) 262.9917"

Or change the period to a hyphen:

str_replace_all(phone, fixed("."), "-")
[1] "+1 (608) 262-9917"

Regex can be used for much more than finding characters at the beginning or end of strings. Real-world data wrangling tasks that use regex include

  • Iterating over a large number of files: get a vector of file names (see list.files()) with some pattern, such as containing a four-digit year between 1990 and 2010
  • Web scraping faculty profile pages: find the line with their doctorate (starting with “PhD” or “Ph.D.”) and then extract the school name and year
  • Parsing text from PDFs: detect, validate, and extract email addresses (pattern is 1+ characters, @, 1+ characters, ., 1+ characters) and phone numbers (some number of digits, with or without parentheses or spaces or dashes, with or without country code, and not starting with or containing certain patterns such as xxx-555-xxxx)

To learn more about regular expressions, see https://www.regular-expressions.info/

7.10 Exercises

7.10.1 Fundamental

  1. Combine the single-letter vectors letters and LETTERS into a two-letter vector that goes “Aa”, “Bb”, etc.

  2. Separate this vector of dates formatted as month/day into two vectors called month and day:

    dates <- c("1/23", "5/12", "11/24")
  3. Clean this vector of state names so that table(states) returns 3 MN and 2 WI.

    states <- c("Minnesota", "MN", "Minesota", "Wisconsin", "wisc")
  4. Some of the state names in state.name consist of two words. Find those that are two words and extract the first word. Use table() to tabulate the first words. Which is most common?

  5. Find a sentence in sentences that has one or more commas. Split it by commas.

7.10.2 Extended

  1. The prices vector has special characters, so coercing it to numeric returns missing values: Clean the vector prices so that as.numeric(prices) returns three numeric values.

    prices <- c("$10", "$11.99", "$1,011.01")
    as.numeric(prices)
    Warning: NAs introduced by coercion
    [1] NA NA NA

    Clean it so that coercing it to numeric returns this output:

    [1]   10.00   11.99 1011.01
  2. Some countries use a comma rather than a period to separate the decimal, and a period to as a delimiter. For example, instead of writing one thousand two hundred thirty-four dollars and fifty-six cents as $1,234.56, they may write it as $1.234,56. The currency symbol may also be placed after the amount, such as 20$ rather than $20. Use the currency vector below:

    currency <- c("$1.234,56", "20$", "$12,99", "5.555 $")

    Reformat it so that coercing to numeric returns this output:

    [1] 1234.56   20.00   12.99 5555.00