SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.2 Character variables

5.2.1 Data concepts

Character strings are ordered one dimensional data. Like other one dimensional data objects, they can be indexed and subsetted.

5.2.1.1 Concatenation

Concatenation is joining two ordered data objects by placing the one after the other. For example, the concatenation of abc with xyz would be abcxyz.

5.2.1.2 Patterns in text

Finding and changing patterns in character strings is done using regular expressions, a syntax for describing patterns of string to be matched. This section provides some of the commonly used matching tools from regular expressions. There is far more to regular expressions than what is covered here.

In regular expressions you create a character description of the pattern you want to match. If you know the exact character string you want to find you would provide that as the search string. Often the pattern we are interested in has some variability in it. for example, if we wanted to find all subsets that started with w and ended with d, we would need a way to represent one or more unknown characters. Regular expression provides . for any character and + to repeat what precedes it one or more times. Using these allows us to write w.+d to search for subsets that start with w and ended with d.

The following tables lists some of the commonly used regular expressions symbols. These symbols fall into four groups.

The first group is used to identify a particular set of characters to match.

operator decription
. Matches any character
\w Matches any word character
\d Matches any digit
\s Matches any white space
\S Matches everything except any white space
[ ] Used to create a list of possible characters to match
[^ ] Used to create a list of possible characters to not match
| Or operator, allows matching a set of strings

A few examples searching the string carrots are vegetables

  • car matches a c followed by an a and then an r. All three have to be present and be in exactly that order. This would match the car of carrots.

  • rac would not find a match. All of these characters are in the string. But, not in this exact order anywhere.

  • [rac] matches for an r, a, or c. It will match the c of carrots. It can only match one character. So it would not match any characters after the c.

  • [^rac] matches any character that is not an r, a, or c. This would match the o of carrots, the first non r, a, or c character.

  • ab|ar matches ab, or ar. It will match only one of these two strings of characters. This would match the ar of carrots, since it appears before the ab of vegetables.

  • e\w matches an e followed by another word character. This would match the eg of vegetables, since it appears before the et of vegetables. It would not match the e of are since the blank character in not a word character.

  • \s matches a space, tab, line break, or form feed. It will match only one of these two strings of characters. This would match the between carrots and are.

The second group is used indicate how many matches of a particular character or matches to be made from a set of characters.

operator decription
+ Match the prior character one or more times
* Match the prior character zero or more times
? Match the prior character at most one time

A few examples searching the string carrots are vegetables

  • ar+o matches an a followed by one or more r characters and then an o. This would match the arro of carrots.

  • ar?o matches an a followed by at most one r characters and then an o. This would not find a match.

  • t[able]*s matches an t followed by any sequence that contains a, b, l, and e characters and then an s. This would match the ts of carrots, since there are zero a, b, l, and e characters between the t an s

The third group is used to identify where in a string to start or end a match.

operator decription
\b Used to identify empty string at either edge of a word
^ Used to identify the start of the character string
$ Used to identify the end of the character string

A few examples searching the string carrots are vegetables

  • s\b matches an s followed by a non-word characters. This would match the s of carrots.

  • s$ matches an s at the end of the string. This would match the s of vegetables.

The fourth set is use to reference the characters matched inside the parentheses. The parentheses are greedy. That is they will include as large of a string that can be matched to the pattern inside of the (). More than one subset of characters can be matched by using additional sets of (). The reference to one of the matched sub-strings is done using \n, where n is an integer identifying the ordered number of the () to be used. This form of referencing is useful when you want to include part of what was matched in a replacement string.

operator decription
() Used to back reference a string

A few examples searching the string carrots are vegetables

  • \s(ar) matches an a followed by r, if the a is preceded by a space. This would match the ar of are. But, not the ar of carrots.

  • \s(\w+)$ matches the last word of a string, provided the string ends with a word. This would match vegetables.

In R and Python, regular expressions are specified as a character string. The backslash, \, is a special character in strings and is used for things such as \n for new line. To use a backslash in a string, it needs to be escaped. That is two backslashes together in a string equals one backslash character. For example, to use \b in a pattern you have have to write \\b. The first \ is the escape and tells the sting to consider the next character a normal character, as apposed to a character with special meaning in the string.

5.2.2 Examples - R

These examples use the TitanicSurvival.csv data set.

  1. We begin by loading the tidyverse, importing the csv file, and setting variable names.

    library(tidyverse)
    titanic_path <- file.path("..", "datasets", "TitanicSurvival.csv")
    titanic_in <- read_csv(titanic_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    titanic <- 
      titanic_in %>%
      rename(
        name = X1,
        class = passengerClass
      )
    
    glimpse(titanic)
    Observations: 1,309
    Variables: 5
    $ name     <chr> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hu...
    $ survived <chr> "yes", "yes", "no", "no", "no", "yes", "yes", "no", "...
    $ sex      <chr> "female", "male", "female", "male", "female", "male",...
    $ age      <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 6...
    $ class    <chr> "1st", "1st", "1st", "1st", "1st", "1st", "1st", "1st...

    The types of all the variables appear to be correct. That is, the column with numbers is numeric and the text columns are type character.

  2. Check for duplicates.

    I think that the name values will uniquely identify each passenger. I will check for duplicate names and display name, age, and class.

    dups <-
      titanic %>%
      filter(
        duplicated(titanic[,"name"]) |
          duplicated(titanic[,"name"], fromLast=TRUE)
        ) %>%
      select(name, age, class)
    
    glimpse(dups)
    Observations: 0
    Variables: 3
    $ name  <chr> 
    $ age   <dbl> 
    $ class <chr> 

    The dups data frame is empty. There are no duplicates.

  3. Using concatenation to change a variable.

    Here we concatenate the string " class" to the values of the class variable.

    The str_c() function will be used to concatinate.

    titanic <- mutate(titanic, class = str_c(class, " class"))
    
    titanic %>%
      select(name, class) %>%
      head()
    # A tibble: 6 x 2
      name                            class    
      <chr>                           <chr>    
    1 Allen, Miss. Elisabeth Walton   1st class
    2 Allison, Master. Hudson Trevor  1st class
    3 Allison, Miss. Helen Loraine    1st class
    4 Allison, Mr. Hudson Joshua Crei 1st class
    5 Allison, Mrs. Hudson J C (Bessi 1st class
    6 Anderson, Mr. Harry             1st class
  4. Create a variable for first name and one for last name.

    This example uses the separate() function. The separate() function looks for a delimiting string, the sep parameter, in a character variable and divides the variable into separate columns. The sep parameter is a string containing a regular expression. You provide the names of the new columns to separate(). The extra parameter defines what to do if there is a mismatch in the number new columns defined and the number of separate values found for a row.

    The name variable uses a comma to separate a person's last name from their title and first name. There may be more than one comma in some of the names. We use the merge option of the extra parameter here to include any extra strings with the last string.

    The first variable is then separated to separate the persons title from their first name.

    titanic <-
      titanic %>%
      separate(
        name,
        c("last", "first"),
        sep = ",",
        extra = "merge",
        remove = FALSE
        ) %>%
      separate(first, c("title", "first"), sep = ". ", extra = "merge")
    
    glimpse(titanic)
    Observations: 1,309
    Variables: 8
    $ name     <chr> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hu...
    $ last     <chr> "Allen", "Allison", "Allison", "Allison", "Allison", ...
    $ title    <chr> " Miss", " Master", " Miss", " Mr", " Mrs", " Mr", " ...
    $ first    <chr> "Elisabeth Walton", "Hudson Trevor", "Helen Loraine",...
    $ survived <chr> "yes", "yes", "no", "no", "no", "yes", "yes", "no", "...
    $ sex      <chr> "female", "male", "female", "male", "female", "male",...
    $ age      <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 6...
    $ class    <chr> "1st class", "1st class", "1st class", "1st class", "...

    If you look closely at the new names variables you will find observations were this process did not fully identify the person's name correctly. This is not uncommon when wrangling data. More clean up work would be needed. This example is enough to demonstrate the skills that would be needed to finish the clean up.

  5. Create a variable that identifies both class and gender.

    The unite() function is the inverse of the separate() function. It concatenates strings from multiple columns into a single column.

    We will use unite() here to create a new class_sex variable that has the gender value separated from the class value by a space character.

    titanic <-
      titanic %>%
      unite(class_sex, class, sex, sep = " ", remove = FALSE)
    
    titanic %>%
      select(name, class_sex, class, sex) %>%
      head()
    # A tibble: 6 x 4
      name                            class_sex        class     sex   
      <chr>                           <chr>            <chr>     <chr> 
    1 Allen, Miss. Elisabeth Walton   1st class female 1st class female
    2 Allison, Master. Hudson Trevor  1st class male   1st class male  
    3 Allison, Miss. Helen Loraine    1st class female 1st class female
    4 Allison, Mr. Hudson Joshua Crei 1st class male   1st class male  
    5 Allison, Mrs. Hudson J C (Bessi 1st class female 1st class female
    6 Anderson, Mr. Harry             1st class male   1st class male  
  6. Create a variable to identify the observations that are women using only the title column. This is a more advanced example.

    This is an artificial task (since there is already a sex variable) to demonstrate using patterns.

    This example uses the regular expression symbols to identify patterns that identify a row as being a woman. There are two common designations for women in the title variable, Mrs. and Miss. Both begin with an M and end with a s. None of the other titles meet both of these conditions. The matching patern is then M.+s$, a pattern that starts with M followed by at least one other character and ending with an s.

    There are a few other titles used by women. They are Mme., Lady, Mlle., and the Countess. of.

    The str_detect() function returns a logical variable indicating if the searched for string was found.

    titanic <-
      titanic %>%
      mutate(woman = str_detect(title, "M.+s$|M.+e$|Lady$|Countess"))
    
    titanic %>%
      select(title, woman) %>%
      head()
    # A tibble: 6 x 2
      title     woman
      <chr>     <lgl>
    1 " Miss"   TRUE 
    2 " Master" FALSE
    3 " Miss"   TRUE 
    4 " Mr"     FALSE
    5 " Mrs"    TRUE 
    6 " Mr"     FALSE

    There are many more tidyverse functions that operate on character variables. Google the tidyverse package stringr to read about these other useful functions.

5.2.3 Examples - Python

These examples use the TitanicSurvival.csv data set.

  1. We begin by loading the packages, importing the csv file, and renaming the variables

    from pathlib import Path
    import pandas as pd
    titanic_path = Path('..') / 'datasets' / 'TitanicSurvival.csv'
    titanic_in = pd.read_csv(titanic_path)
    titanic_in = (
        titanic_in
            .rename(
                columns={
                    'Unnamed: 0': 'name',
                    'passengerClass': 'pass_class'}))
    titanic = titanic_in.copy(deep=True)
    
    print(titanic.dtypes)
    name           object
    survived       object
    sex            object
    age           float64
    pass_class     object
    dtype: object

    The types of all the variables appear to be correct. That is, the column with numbers is numeric and the text columns are type object.

    Note, the passengerClass variable was renamed to pass_class and not class, since class is a reserved word in Python.

  2. Check for duplicates.

    I think that the name values will uniquely identify each passenger. I will check for duplicate names and display name, age, and pass_class.

    dup_rows = titanic.duplicated(subset=['name'], keep=False)
    dups = titanic.loc[dup_rows.values, ['name', 'age', 'pass_class']]
    
    print(dups.head(10))
    Empty DataFrame
    Columns: [name, age, pass_class]
    Index: []

    The dups data frame is empty. There are no duplicates.

  3. Using concatenation to change a variable.

    The plus operator, +, does concatination when used with string variables. We use it here to concatenate the string ' class' to the pass_class variable.

    titanic = (
        titanic
            .assign(pass_class=lambda df: df['pass_class'] + ' class'))
    
    (titanic
        .loc[:, ['name', 'pass_class']]
        .head()
        .pipe(print))
                                  name pass_class
    0    Allen, Miss. Elisabeth Walton  1st class
    1   Allison, Master. Hudson Trevor  1st class
    2     Allison, Miss. Helen Loraine  1st class
    3  Allison, Mr. Hudson Joshua Crei  1st class
    4  Allison, Mrs. Hudson J C (Bessi  1st class
  4. Create a variable for first name and one for last name.

    This example uses the str.extract() method to subset each of the character strings in the name variable. This method uses a regular expression to identify the characters that will be subsetted from each the string in the variable.

    Python strings can be specified in raw form by placing an r in front of the opening quote. This allows backslashes to be used as backslashes without needing to be escaped, so, r'\d' would be the same as '\\d'. A raw string can not include special string characters, such as \n for new line. The string for the pat paramaeter, the first parameter of str.extract(), seems to accept strings in raw form without the use of r preceding the opening quote. This is not documented as a feature and I would avoid relying on this behavior. I would use the double backslash for a backslash or explicitly use a raw string by placing the r before the opening quote.

    The name variable uses a comma to separate a person's last name from their title and first name. We will extract the last name using () to identify which characters in the match are to be subsetted. We expect one or more characters, +, in the name. These characters of the name will not be a comma, [^,].

    The title is expected to be one or more characters, +, that is not a period, [^\.] that follows a comma and space, , and is followed by a period, \..

    The first name is then everything that follows the period.

    titanic = (
        titanic
            .assign(
                last=lambda df: 
                    df['name'].str.extract('([^,]+),', expand=True),
                title=lambda df: 
                    df['name'].str.extract(', ([^\\.]+)\\.', expand=True),
                first=lambda df: 
                    df['name'].str.extract('\\. (.+)', expand=True)))
    
    (titanic
        .loc[:, ['name', 'title', 'first', 'last']]
        .head()
        .pipe(print))
                                  name   title               first     last
    0    Allen, Miss. Elisabeth Walton    Miss    Elisabeth Walton    Allen
    1   Allison, Master. Hudson Trevor  Master       Hudson Trevor  Allison
    2     Allison, Miss. Helen Loraine    Miss       Helen Loraine  Allison
    3  Allison, Mr. Hudson Joshua Crei      Mr  Hudson Joshua Crei  Allison
    4  Allison, Mrs. Hudson J C (Bessi     Mrs   Hudson J C (Bessi  Allison

    If you look closely at the new names variables you will find observations were this process did not fully identify the person's name correctly. This is not uncommon when wrangling data. More clean up work would be needed. This example is enough to demonstrate the skills that would be needed to finish the clean up.

  5. Create a variable that identifies both class and gender.

    We will use the concatination operator here to create a new class_sex variable that has the gender value separated from the pass_class variable by a space character.

    titanic = (
        titanic
            .assign(class_sex=lambda df: df['pass_class'] + ' ' + df.sex))
    
    (titanic
        .loc[:, ['name', 'class_sex', 'pass_class', 'sex']]
        .head()
        .pipe(print))
                                  name         class_sex pass_class     sex
    0    Allen, Miss. Elisabeth Walton  1st class female  1st class  female
    1   Allison, Master. Hudson Trevor    1st class male  1st class    male
    2     Allison, Miss. Helen Loraine  1st class female  1st class  female
    3  Allison, Mr. Hudson Joshua Crei    1st class male  1st class    male
    4  Allison, Mrs. Hudson J C (Bessi  1st class female  1st class  female
  6. Create a variable to identify the observations that are women using only the title column. This is a more advanced example.

    This is an artificial task (since there is already a sex variable) to demonstrate using patterns.

    This example uses regular expressions to identify patterns that identify a row as being a woman. There are two common designations for women in the title variable, Mrs. and Miss. Both begin with an M and end with a s. None of the other titles meet both of these conditions. The matching patern is then M.+s$, a pattern that starts with M followed by at least one other character and ending with an s.

    There are a few other titles used by women. They are Mme., Lady, Mlle., and the Countess. of.

    The str.contains() method returns a logical variable indicating if the searched for string was found.

    titanic = (
        titanic
            .assign(
                woman=lambda df: 
                    df['title'].str.contains('M.+s$|M.+e$|Lady$|Countess')))
    
    (titanic
        .loc[:, ['name', 'title', 'woman']]
        .head()
        .pipe(print))
                                  name   title  woman
    0    Allen, Miss. Elisabeth Walton    Miss   True
    1   Allison, Master. Hudson Trevor  Master  False
    2     Allison, Miss. Helen Loraine    Miss   True
    3  Allison, Mr. Hudson Joshua Crei      Mr  False
    4  Allison, Mrs. Hudson J C (Bessi     Mrs   True
  7. Other pandas string methods.

    There are many more pandas string functions that operate on character variables. Google the pandas str methods to read about these other useful functions. A few of the more useful methods are str.replace(), str.slice(),and str.pad().

5.2.4 Exercises

These exercises use the mtcars.csv data set.

  1. Import the mtcars.csv data set.

  2. Divide the column that has the car name into columns that contain the make and model of the car.

  3. Do all observations have a make and model value? If there are missing values, can you fix them? (Hint, use google to help you.)

  4. Some car companies have more than one make. In this data Chrysler, Plymouth, and Dodge were all made by Chrysler. Likewise Cadillac and Pontiac are made by GM and Lincoln and Ford are both made by Ford. Create a company variable based on the data in the make variable

  5. Create a name for use in displaying results that is a character string composed of make, a space character, if the company name is not the same as the make then the company in parentheses (), and model.