7 Character

A third fundamental type of data is character data (also called string data). In R, character vectors may be used as data for analysis, and also as names of data objects or elements. (R is its own macro language, and we use the same functions to manipulate language elements as we use to manipulate data values for analysis. In R, it is all “data”.)

As data for analysis, character values can signify categories (see Chapter 9). An example might be a variable that classifies people as “Democrat”, “Green”, “Independent”, “Libertarian”, or “Republican” (American political affiliations).

affiliations <- c("Dem", "Dem", "Rep", "Rep", "Ind", "Lib")
table(affiliations)
affiliations
Dem Ind Lib Rep 
  2   1   1   2 

A single character value might also represent multiple categorical variables. The FIPS code “55025” is a combination of state (“55” for Wisconsin) and county (“025” for Dane) codes. And the date “2020-12-23” is a combination of a year, a month, and a day code (see Chapter 8).

You will want to be able to combine character values into a single value, and to separate a single value into parts.

Another aspect of working with character data is that they may represent the raw input from multiple people. If you have ever used social media you will appreciate that people’s views of acceptable capitalization, spelling, and punctuation vary enormously. Cleaning raw data is another important part of working with character data.

7.1 Combining Character Values

One basic task when working with character data is to combine elements from two or more vectors. This is useful whenever you need to construct a single variable to represent a value identified by multiple other variables. For example you might have data about calendar dates given as separate month, day, and year variables. To combine these into a single vector, use the paste() function (see help(paste)).

month <- c("Apr", "Dec", "Jan")
day   <- c(3, 13, 23)
year  <- c(2001, 2009, 1997)

date_str <- paste(year, month, day, sep="-")
date_str
[1] "2001-Apr-3"  "2009-Dec-13" "1997-Jan-23"

The paste() operation is vectorized in much the same way that numeric operations are. Notice that the results are character values.

The sep argument specifies a character value to place between the data elements being combined. The default separator is a space. To have nothing added between the elements being combined, we can either specify a null string, sep="" (quotes with NO space between), or we can use the paste0() function.

You might also use this if you were constructing a set of variable names with a common prefix. Notice the recycling in this example.

paste("Q", 1:4, sep="")
[1] "Q1" "Q2" "Q3" "Q4"
paste0("Q", 3, c("a", "b", "c"))
[1] "Q3a" "Q3b" "Q3c"

paste() and paste0() recycles each argument so that it matches the length of the longest argument, and then it concatenates element-wise. In the paste() statement above, the longest argument (1:4) is four elements long, so all others (here, just "Q") are recycled to length four (c("Q", "Q", "Q", "Q")). In the paste0() statement, the longest argument (c("a", "b", "c")) has three elements, so the others ("Q" and 3) are recycled until they are three elements long (c("Q", "Q", "Q") and c(3, 3, 3)). Then, they are concatenated element-by-element (the first element of each vector, the second element of each, and so on). Note that paste() will recycle an argument a non-whole number of times without a warning. Try paste0(c("a", "b", "c"), 1:2, "z") and notice how 1:2 is recycled to c(1, 2, 1) to have a length of three.

7.2 Working Within Character Values

A character value is an ordered collection of characters drawn from some alphabet. R is capable of working within a “local” alphabet, converting locales, or working in Unicode (a universal alphabet). The details of switching alphabets gets complicated quickly, so we will skip that here.

The most basic manipulations of character data values are selecting specific characters in a value (matching), removing selected characters, or adding characters.

Matching can be done either by position within a value, or by character.
In the character value “12:08pm” we could operate on the fourth and fifth characters (to find “08”), or we can specify that we want to operate on the character pair “08” (finding the fourth position). We are either looking for an arbitrary character that occupies a specific position, or we are looking for an arbitrary position occupied by a specific character.

7.3 Position Indexing

The substr() and substring() functions use positions and return characters values. The regexpr() function matches characters and returns starting positions (it is an index function).

x <- c("12:08pm", "12:10pm")
substr(x, start=4, stop=5)
[1] "08" "10"
regexpr("08", x)
[1]  4 -1
attr(,"match.length")
[1]  2 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

Here the character string sought is found at the fourth position in the first value, and not at all (-1) in the second value. (Everything after the first line of output is metadata, which makes this output hard to read.)

Although this example of regexpr() is very simple, a word of warning. Character matching functions is R rely on regular expressions, a system of specifying character patterns that includes literal characters (as in the example above), wildcards, and position anchors. We’ll come back to this, below.

To work with positions, if is often useful to know the length of a character value, for which we have the nchar() function.

If we wanted the last two characters of each character value, we could specify

x <- c("12:08pm", "12:10pm", "1:08pm")
x_len <- nchar(x)
substr(x, start=x_len-1, stop=x_len)
[1] "pm" "pm" "pm"

By using substr() on the left-hand side of the assignment operator, we can substitute in new substrings by position.

substr(x, start=x_len-1, stop=x_len) <- "am"
x
[1] "12:08am" "12:10am" "1:08am" 

7.4 Character Matching

While some character value manipulations are easily handled by position indexing, many others are handled more gracefully through character matching.

7.4.1 Global Wildcards

You may already be familiar with the concept of wildcards to specify patterns - computer operating systems all allow wildcards in searching for file names. These are sometimes referred to as global or glob wildcards.

For example, on a Windows computer you could open a Command window and type the following command

dir /b 2*.csv

Or on Mac or Linux, run this command in a terminal:

ls 2*.csv

to get a list of all the CSV files in your current folder beginning with the character “2” (if there are any):

The asterisk (*) wildcard matches any characters (zero or more) after the literal “2” at the beginning of file names, while the “.csv” literal matches only files which end with that file extension.

Similarly, a question mark matches a single arbitrary character.

With global wildcards, the pattern to match always specifies a string from beginning to end. So

  • "02*" matches any character string beginning with “02”
  • "*02" matches any string ending with “02”
  • "*02*" matches any string containing “02”.

R does not use global wildcards directly, but the glob2rx function can translate this type of wildcard into a regular expression for you.

glob2rx("02*")
[1] "^02"
glob2rx("*02")
[1] "^.*02$"
glob2rx("*02*")
[1] "^.*02"

7.4.2 Regular Expression Wildcards

Regular expressions expand on the concept of wildcards, and allow us to match elements of arbitrary character vectors with more precise patterns than global wildcards. We expand the concept of a wildcard by separating

  • what characters to match
  • how many characters to match
  • where to match (what position)

A single arbitrary character is specified as a period, “.”, much like the global question mark, “?”. For example, one way to get a vector of column names that are at least four letters long using a regular expression would be

cars <- mtcars
grep("....", names(cars), value=TRUE)
[1] "disp" "drat" "qsec" "gear" "carb"

The grep() function searches a character vector for elements that match a pattern. It returns position indexes by default, or values that contain a match with the value=TRUE argument. The grepl() (grep logical) function returns a logical vector indicating which elements matched. These two functions give us all three methods of specifying indexes along a vector.

In addition to wildcard characters, we can also match literal characters, and literal substrings.

grep("a", names(cars), value=TRUE)
[1] "drat" "am"   "gear" "carb"
grep("ar", names(cars), value=TRUE)
[1] "gear" "carb"

7.4.2.1 Position

In contrast to global wildcards, these patterns match anywhere within a character value - they are position-less. To specify positions We have two regular expression anchors that we can specify - tying a pattern to the beginning (“^”) or the end (“$”) of a string.

grep("m", names(cars), value=TRUE)  # any m
[1] "mpg" "am" 
grep("^m", names(cars), value=TRUE) # begins with m
[1] "mpg"
grep("m$", names(cars), value=TRUE) # ends with m
[1] "am"

Although we have added position qualifiers to our patterns, notice that we are still specifying partial strings, not whole strings. To specify a complete string, we use both anchors! One way to find column names that are exactly two characters long would be

grep("^..$", names(cars), value=TRUE)
[1] "hp" "wt" "vs" "am"

Without both anchors, this example would find all column names at least two characters long, including those with three and four characters.

7.4.2.2 Repetition

So far we have been specifying one character at a time, but the regular expression syntax also includes the concept of repetition. There are six ways to specify how many matches are required:

  • a question mark, “?”, matches zero or one time, making a character specification optional
  • an asterisk, “*“, matches zero or more times, a character is optional but also may be repeated
  • a plus, “+”, matches one or more times, a character is required and may be repeated
  • braces with a number, “{n}” matches exactly n times
  • braces with two numbers, “{n,m}”, matches at least n times and no more than m times
  • no repetition qualifier means match exactly once

So another way to get two-letter column names would be to specify

grep("^.{2}$", names(cars))
[1] 4 6 8 9

While the global wildcard “?” is replaced by the dot in regular expressions, the global wildcard “” is replaced with the regular expression ”.”.

7.4.2.3 Character Class

So far we have introduced arbitrary matches and literal matches, but regular expressions are able to work between these two extremes as well. We can specify classes (sets) of characters to match, and we can do this by itemizing the whole class, or using a shortcut name for some classes.

Square brackets, “[ ]”, are used to itemize classes, and to specify shortcut names. As an arbitrary example, column names that begin with “a” or “b” or “c” could be specified

grep("^[abc]", names(cars), value=TRUE)
[1] "cyl"  "am"   "carb"

Notice that this interacts with with the repetition qualifiers. To require the first two characters to belong to the same character class we would specify

grep("^[abc]{2}", names(cars), value=TRUE)
[1] "carb"

The twelve shortcut names that are predefined in R are documented on the help("regex") page, and they include

  • [:alpha:], alphabetic characters
  • [:digit:], numerals
  • [:punct:], punctuation marks

These shortcuts are specific to the use of regular expressions in R, and must themselves be used within class brackets. Contrast the first (correct) example with the second (incorrect) example. Why does the second line pick out “def”?

grep("[[:digit:]]", c("abc", "123", "def"), value=TRUE)
[1] "123"
grep("[:digit:]", c("abc", "123", "def"), value=TRUE)
[1] "def"

7.4.2.4 Regular Expression Metacharacters

We can match arbitrary characters, specified classes of characters, and most literal characters. However, we are using some characters as metacharacters with special meaning in regular expression patterns. The dot (period, .), asterisk (*), question mark (?), plus sign (+), caret (^), dollar sign ($), square brackets ([, ]), braces ({, }), dash (-), and a few more we haven’t discussed, all have a non-literal meaning. What if we want to use these as literal characters?

There are generally two ways to take a metacharacter and use it as a literal. We can specify it within a square bracket class, or we can escape it. Either method comes with caveats.

To “escape” a character - ignore it’s special meaning and use it as a literal - we typically think of preceding it with a backslash, ” \ “. However, it turns out that a backslash is also a regular expression metacharacter (that we have not discussed so far), so to use it as an escape character in regular expressions we double it. That is, to use an escape character, we need to first escape the escape character!

For example, to find a literal dollar sign contrast the correct specification with two mistakes.

grep("\\$", c("$12.95", "40502"), value=TRUE) # correct
grep("$", c("$12.95", "40502"), value=TRUE)  # wrong: no slash = end of string
grep("\$", c("$12.95", "40502"), value=TRUE) # wrong: error
Error: '\$' is an unrecognized escape in character string (<text>:3:8)

An alternative is to write (most) metacharacters within a class.

grep("[$]", c("$12.95", "40502"), value=TRUE) # correct
[1] "$12.95"

The caveat here is that the caret, dash, and backslash all have special meaning within character classes.

(A third approach is to turn off regular expression matching and use only literal matching. Use the fixed=TRUE argument.)

7.4.2.5 More

There is more to regular expression specification: negation, alternation, substrings, and other less used elements. Whole books have been written about regular expressions. To learn more, a useful reference is

Regular Expressions Info

7.5 Substitution

Simply identifying matching values is useful for creating indicator variables (grepl) or for creating sets of variable names (grep). But for data values, we often want to manipulate the values we identify. One of the main tools we will use for this is substitution (sub) and repeated substitution (gsub).

Substitution where a regular expression pattern appears at most once in each value is straightforward.

Returning to the time example, which we previously solved by positional substitution, we can use the very simple regular expression “pm” to identify matching characters to replace with “am”.

x <- c("12:08pm", "12:10pm", "1:08pm")
sub("pm", "am", x)
[1] "12:08am" "12:10am" "1:08am" 

Substitution also works as a method of deleting matched characters, when the replacement is the null string (quotes with no space).

7.6 Exercises

  1. Percent to proportion: Given a character vector with values in percent form, convert these to numerical proportions, values between 0 and 1.

    x <- sprintf("%4.2f%%", runif(5)*100)
    x
    [1] "72.91%" "29.23%" "91.91%" "86.25%" "65.51%"
  2. Currency is sometimes denoted with both a currency symbol and commas. Convert these to numeric values.

    x <- c("$10", "$11.99", "$1,011.01")
  3. Inconsistent capitalization is a problem with some alphabets. The output of table(colors2) indicates that we have five unique values, and one occurrence of each. Standardize the capitalization with either tolower() or toupper() so that table() correctly tabulates our values.

    colors2 <- c("Red", "blue", "red", "blue", "RED")

7.7 Advanced Exercises

  1. Some countries use a comma rather than a period to separate the decimal, and a period to as a delimiter. For example, instead of writing one thousand two hundred thirty-four dollars and fifty-six cents as $1,234.56, they may write it as $1.234,56. The currency symbol may also be placed after the amount, such as 20$ rather than $20. Convert these alternative currency expressions into numeric values:

    currency <- c("$1.234,56", "20$", "$12,99", "5.555 $")
  2. Translating wildcard patterns

    The glob2rx() function translates character strings with wildcards (* for any string, ? for a single character) into regular expressions. We can translate “Merc *” (a string starting with “Merc” and a space, followed by anything) into “^Merc”. Combining this with grep() allows us to select rows from our mtcars dataframe. (Note that value = TRUE returns values, while the default value = FALSE returns positions.)

    glob2rx("Merc *")
    [1] "^Merc "
    grep(glob2rx("Merc *"), row.names(mtcars), value=TRUE)
    [1] "Merc 240D"   "Merc 230"    "Merc 280"    "Merc 280C"   "Merc 450SE"  "Merc 450SL"  "Merc 450SLC"
    grep(glob2rx("Merc *"), row.names(mtcars))
    [1]  8  9 10 11 12 13 14
    mtcars[grep(glob2rx("Merc *"), row.names(mtcars)), ]
                 mpg cyl  disp  hp drat   wt qsec vs am gear carb
    Merc 240D   24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
    Merc 230    22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
    Merc 280    19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
    Merc 280C   17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
    Merc 450SE  16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
    Merc 450SL  17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
    Merc 450SLC 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3

    Now, try selecting rows from mtcars where the row name…

    • starts with “Toyota” and a space
    • starts with any four characters and then a space
    • ends with 0
    • ends with a space and then any three characters