5.2 Character variables
5.2.1 Data concepts
Character strings are ordered one dimensional data. Like other one dimensional data objects, they can be indexed and subsetted.
5.2.1.1 Concatenation
Concatenation is joining two ordered data objects by placing the
one after the other.
For example, the concatenation of abc
with xyz
would be
abcxyz
.
5.2.1.2 Patterns in text
Finding and changing patterns in character strings is done using regular expressions, a syntax for describing patterns of string to be matched. This section provides some of the commonly used matching tools from regular expressions. There is far more to regular expressions than what is covered here.
In regular expressions you create a character description of the pattern
you want to match.
If you know the exact character string you want to find
you would provide that as the search string.
Often the pattern we are interested in has some variability
in it.
for example, if we wanted to find all subsets that started with
w
and ended with d
,
we would need a way to represent one or more unknown characters.
Regular expression provides .
for any character and +
to repeat what
precedes it one or more times.
Using these allows us to write w.+d
to search for subsets
that start with w
and ended with d
.
The following tables lists some of the commonly used regular expressions symbols. These symbols fall into four groups.
The first group is used to identify a particular set of characters to match.
operator | decription |
---|---|
. | Matches any character |
\w | Matches any word character |
\d | Matches any digit |
\s | Matches any white space |
\S | Matches everything except any white space |
[ ] | Used to create a list of possible characters to match |
[^ ] | Used to create a list of possible characters to not match |
| | Or operator, allows matching a set of strings |
A few examples searching the string carrots are vegetables
car
matches ac
followed by ana
and then anr
. All three have to be present and be in exactly that order. This would match thecar
ofcarrots
.rac
would not find a match. All of these characters are in the string. But, not in this exact order anywhere.[rac]
matches for anr
,a
, orc
. It will match thec
ofcarrots
. It can only match one character. So it would not match any characters after thec
.[^rac]
matches any character that is not anr
,a
, orc
. This would match theo
ofcarrots
, the first nonr
,a
, orc
character.ab|ar
matchesab
, orar
. It will match only one of these two strings of characters. This would match thear
ofcarrots
, since it appears before theab
ofvegetables
.e\w
matches ane
followed by another word character. This would match theeg
ofvegetables
, since it appears before theet
ofvegetables
. It would not match thee
ofare
since the blank character in not a word character.\s
matches a space, tab, line break, or form feed. It will match only one of these two strings of characters. This would match thebetween
carrots
andare
.
The second group is used indicate how many matches of a particular character or matches to be made from a set of characters.
operator | decription |
---|---|
+ | Match the prior character one or more times |
* | Match the prior character zero or more times |
? | Match the prior character at most one time |
A few examples searching the string carrots are vegetables
ar+o
matches ana
followed by one or morer
characters and then ano
. This would match thearro
ofcarrots
.ar?o
matches ana
followed by at most oner
characters and then ano
. This would not find a match.t[able]*s
matches ant
followed by any sequence that containsa
,b
,l
, ande
characters and then ans
. This would match thets
ofcarrots
, since there are zeroa
,b
,l
, ande
characters between thet
ans
The third group is used to identify where in a string to start or end a match.
operator | decription |
---|---|
\b | Used to identify empty string at either edge of a word |
^ | Used to identify the start of the character string |
$ | Used to identify the end of the character string |
A few examples searching the string carrots are vegetables
s\b
matches ans
followed by a non-word characters. This would match thes
ofcarrots
.s$
matches ans
at the end of the string. This would match thes
ofvegetables
.
The fourth set is use to reference the characters matched
inside the parentheses.
The parentheses are greedy.
That is they will include as large of a string that can
be matched to the pattern inside of the ()
.
More than one subset of characters can be matched
by using additional sets of ()
.
The reference to one of the matched sub-strings is done using \n
,
where n
is an integer identifying the ordered number of the
()
to be used.
This form of referencing is useful when you want to include
part of what was matched in a replacement string.
operator | decription |
---|---|
() | Used to back reference a string |
A few examples searching the string carrots are vegetables
\s(ar)
matches ana
followed byr
, if thea
is preceded by a space. This would match thear
ofare
. But, not thear
ofcarrots
.\s(\w+)$
matches the last word of a string, provided the string ends with a word. This would matchvegetables
.
In R and Python, regular expressions are specified as a
character string.
The backslash, \
, is a special character in strings and
is used for things such as \n
for new line.
To use a backslash in a string, it needs to be escaped.
That is two backslashes together in a string equals one
backslash character.
For example,
to use \b
in a pattern you have have to write
\\b
.
The first \
is the escape and tells the sting to consider the next
character a normal character,
as apposed to a character with special meaning in the string.
5.2.2 Examples - R
These examples use the TitanicSurvival.csv
data set.
We begin by loading the tidyverse, importing the csv file, and setting variable names.
library(tidyverse)
titanic_path <- file.path("..", "datasets", "TitanicSurvival.csv") titanic_in <- read_csv(titanic_path, col_types = cols())
Warning: Missing column names filled in: 'X1' [1]
titanic <- titanic_in %>% rename( name = X1, class = passengerClass ) glimpse(titanic)
Observations: 1,309 Variables: 5 $ name <chr> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hu... $ survived <chr> "yes", "yes", "no", "no", "no", "yes", "yes", "no", "... $ sex <chr> "female", "male", "female", "male", "female", "male",... $ age <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 6... $ class <chr> "1st", "1st", "1st", "1st", "1st", "1st", "1st", "1st...
The types of all the variables appear to be correct. That is, the column with numbers is numeric and the text columns are type character.
Check for duplicates.
I think that the
name
values will uniquely identify each passenger. I will check for duplicate names and displayname
,age
, andclass
.dups <- titanic %>% filter( duplicated(titanic[,"name"]) | duplicated(titanic[,"name"], fromLast=TRUE) ) %>% select(name, age, class) glimpse(dups)
Observations: 0 Variables: 3 $ name <chr> $ age <dbl> $ class <chr>
The dups data frame is empty. There are no duplicates.
Using concatenation to change a variable.
Here we concatenate the string
" class"
to the values of theclass
variable.The
str_c()
function will be used to concatinate.titanic <- mutate(titanic, class = str_c(class, " class")) titanic %>% select(name, class) %>% head()
# A tibble: 6 x 2 name class <chr> <chr> 1 Allen, Miss. Elisabeth Walton 1st class 2 Allison, Master. Hudson Trevor 1st class 3 Allison, Miss. Helen Loraine 1st class 4 Allison, Mr. Hudson Joshua Crei 1st class 5 Allison, Mrs. Hudson J C (Bessi 1st class 6 Anderson, Mr. Harry 1st class
Create a variable for first name and one for last name.
This example uses the
separate()
function. Theseparate()
function looks for a delimiting string, thesep
parameter, in a character variable and divides the variable into separate columns. Thesep
parameter is a string containing a regular expression. You provide the names of the new columns toseparate().
Theextra
parameter defines what to do if there is a mismatch in the number new columns defined and the number of separate values found for a row.The
name
variable uses a comma to separate a person's last name from their title and first name. There may be more than one comma in some of the names. We use themerge
option of theextra
parameter here to include any extra strings with the last string.The
first
variable is then separated to separate the persons title from their first name.titanic <- titanic %>% separate( name, c("last", "first"), sep = ",", extra = "merge", remove = FALSE ) %>% separate(first, c("title", "first"), sep = ". ", extra = "merge") glimpse(titanic)
Observations: 1,309 Variables: 8 $ name <chr> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hu... $ last <chr> "Allen", "Allison", "Allison", "Allison", "Allison", ... $ title <chr> " Miss", " Master", " Miss", " Mr", " Mrs", " Mr", " ... $ first <chr> "Elisabeth Walton", "Hudson Trevor", "Helen Loraine",... $ survived <chr> "yes", "yes", "no", "no", "no", "yes", "yes", "no", "... $ sex <chr> "female", "male", "female", "male", "female", "male",... $ age <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 6... $ class <chr> "1st class", "1st class", "1st class", "1st class", "...
If you look closely at the new names variables you will find observations were this process did not fully identify the person's name correctly. This is not uncommon when wrangling data. More clean up work would be needed. This example is enough to demonstrate the skills that would be needed to finish the clean up.
Create a variable that identifies both class and gender.
The
unite()
function is the inverse of theseparate()
function. It concatenates strings from multiple columns into a single column.We will use
unite()
here to create a newclass_sex
variable that has the gender value separated from theclass
value by a space character.titanic <- titanic %>% unite(class_sex, class, sex, sep = " ", remove = FALSE) titanic %>% select(name, class_sex, class, sex) %>% head()
# A tibble: 6 x 4 name class_sex class sex <chr> <chr> <chr> <chr> 1 Allen, Miss. Elisabeth Walton 1st class female 1st class female 2 Allison, Master. Hudson Trevor 1st class male 1st class male 3 Allison, Miss. Helen Loraine 1st class female 1st class female 4 Allison, Mr. Hudson Joshua Crei 1st class male 1st class male 5 Allison, Mrs. Hudson J C (Bessi 1st class female 1st class female 6 Anderson, Mr. Harry 1st class male 1st class male
Create a variable to identify the observations that are women using only the title column. This is a more advanced example.
This is an artificial task (since there is already a
sex
variable) to demonstrate using patterns.This example uses the regular expression symbols to identify patterns that identify a row as being a woman. There are two common designations for women in the
title
variable,Mrs.
andMiss
. Both begin with anM
and end with as
. None of the other titles meet both of these conditions. The matching patern is thenM.+s$
, a pattern that starts withM
followed by at least one other character and ending with ans
.There are a few other titles used by women. They are Mme., Lady, Mlle., and the Countess. of.
The
str_detect()
function returns a logical variable indicating if the searched for string was found.titanic <- titanic %>% mutate(woman = str_detect(title, "M.+s$|M.+e$|Lady$|Countess")) titanic %>% select(title, woman) %>% head()
# A tibble: 6 x 2 title woman <chr> <lgl> 1 " Miss" TRUE 2 " Master" FALSE 3 " Miss" TRUE 4 " Mr" FALSE 5 " Mrs" TRUE 6 " Mr" FALSE
There are many more tidyverse functions that operate on character variables. Google the tidyverse package
stringr
to read about these other useful functions.
5.2.3 Examples - Python
These examples use the TitanicSurvival.csv
data set.
We begin by loading the packages, importing the csv file, and renaming the variables
from pathlib import Path import pandas as pd
titanic_path = Path('..') / 'datasets' / 'TitanicSurvival.csv' titanic_in = pd.read_csv(titanic_path) titanic_in = ( titanic_in .rename( columns={ 'Unnamed: 0': 'name', 'passengerClass': 'pass_class'})) titanic = titanic_in.copy(deep=True) print(titanic.dtypes)
name object survived object sex object age float64 pass_class object dtype: object
The types of all the variables appear to be correct. That is, the column with numbers is numeric and the text columns are type object.
Note, the
passengerClass
variable was renamed topass_class
and notclass
, sinceclass
is a reserved word in Python.Check for duplicates.
I think that the
name
values will uniquely identify each passenger. I will check for duplicate names and displayname
,age
, andpass_class
.dup_rows = titanic.duplicated(subset=['name'], keep=False) dups = titanic.loc[dup_rows.values, ['name', 'age', 'pass_class']] print(dups.head(10))
Empty DataFrame Columns: [name, age, pass_class] Index: []
The dups data frame is empty. There are no duplicates.
Using concatenation to change a variable.
The plus operator,
+
, does concatination when used with string variables. We use it here to concatenate the string' class'
to thepass_class
variable.titanic = ( titanic .assign(pass_class=lambda df: df['pass_class'] + ' class')) (titanic .loc[:, ['name', 'pass_class']] .head() .pipe(print))
name pass_class 0 Allen, Miss. Elisabeth Walton 1st class 1 Allison, Master. Hudson Trevor 1st class 2 Allison, Miss. Helen Loraine 1st class 3 Allison, Mr. Hudson Joshua Crei 1st class 4 Allison, Mrs. Hudson J C (Bessi 1st class
Create a variable for first name and one for last name.
This example uses the
str.extract()
method to subset each of the character strings in thename
variable. This method uses a regular expression to identify the characters that will be subsetted from each the string in the variable.Python strings can be specified in
raw
form by placing anr
in front of the opening quote. This allows backslashes to be used as backslashes without needing to be escaped, so,r'\d'
would be the same as'\\d'
. A raw string can not include special string characters, such as\n
for new line. The string for thepat
paramaeter, the first parameter ofstr.extract()
, seems to accept strings in raw form without the use ofr
preceding the opening quote. This is not documented as a feature and I would avoid relying on this behavior. I would use the double backslash for a backslash or explicitly use a raw string by placing ther
before the opening quote.The
name
variable uses a comma to separate a person's last name from their title and first name. We will extract the last name using()
to identify which characters in the match are to be subsetted. We expect one or more characters,+
, in the name. These characters of the name will not be a comma,[^,]
.The title is expected to be one or more characters,
+
, that is not a period,[^\.]
that follows a comma and space,,
and is followed by a period,\.
.The first name is then everything that follows the period.
titanic = ( titanic .assign( last=lambda df: df['name'].str.extract('([^,]+),', expand=True), title=lambda df: df['name'].str.extract(', ([^\\.]+)\\.', expand=True), first=lambda df: df['name'].str.extract('\\. (.+)', expand=True))) (titanic .loc[:, ['name', 'title', 'first', 'last']] .head() .pipe(print))
name title first last 0 Allen, Miss. Elisabeth Walton Miss Elisabeth Walton Allen 1 Allison, Master. Hudson Trevor Master Hudson Trevor Allison 2 Allison, Miss. Helen Loraine Miss Helen Loraine Allison 3 Allison, Mr. Hudson Joshua Crei Mr Hudson Joshua Crei Allison 4 Allison, Mrs. Hudson J C (Bessi Mrs Hudson J C (Bessi Allison
If you look closely at the new names variables you will find observations were this process did not fully identify the person's name correctly. This is not uncommon when wrangling data. More clean up work would be needed. This example is enough to demonstrate the skills that would be needed to finish the clean up.
Create a variable that identifies both class and gender.
We will use the concatination operator here to create a new
class_sex
variable that has the gender value separated from thepass_class
variable by a space character.titanic = ( titanic .assign(class_sex=lambda df: df['pass_class'] + ' ' + df.sex)) (titanic .loc[:, ['name', 'class_sex', 'pass_class', 'sex']] .head() .pipe(print))
name class_sex pass_class sex 0 Allen, Miss. Elisabeth Walton 1st class female 1st class female 1 Allison, Master. Hudson Trevor 1st class male 1st class male 2 Allison, Miss. Helen Loraine 1st class female 1st class female 3 Allison, Mr. Hudson Joshua Crei 1st class male 1st class male 4 Allison, Mrs. Hudson J C (Bessi 1st class female 1st class female
Create a variable to identify the observations that are women using only the title column. This is a more advanced example.
This is an artificial task (since there is already a
sex
variable) to demonstrate using patterns.This example uses regular expressions to identify patterns that identify a row as being a woman. There are two common designations for women in the
title
variable,Mrs.
andMiss
. Both begin with anM
and end with as
. None of the other titles meet both of these conditions. The matching patern is thenM.+s$
, a pattern that starts withM
followed by at least one other character and ending with ans
.There are a few other titles used by women. They are Mme., Lady, Mlle., and the Countess. of.
The
str.contains()
method returns a logical variable indicating if the searched for string was found.titanic = ( titanic .assign( woman=lambda df: df['title'].str.contains('M.+s$|M.+e$|Lady$|Countess'))) (titanic .loc[:, ['name', 'title', 'woman']] .head() .pipe(print))
name title woman 0 Allen, Miss. Elisabeth Walton Miss True 1 Allison, Master. Hudson Trevor Master False 2 Allison, Miss. Helen Loraine Miss True 3 Allison, Mr. Hudson Joshua Crei Mr False 4 Allison, Mrs. Hudson J C (Bessi Mrs True
Other pandas string methods.
There are many more pandas string functions that operate on character variables. Google the pandas str methods to read about these other useful functions. A few of the more useful methods are str.replace(), str.slice(),and str.pad().
5.2.4 Exercises
These exercises use the mtcars.csv
data set.
Import the
mtcars.csv
data set.Divide the column that has the car name into columns that contain the make and model of the car.
Do all observations have a make and model value? If there are missing values, can you fix them? (Hint, use google to help you.)
Some car companies have more than one make. In this data
Chrysler
,Plymouth
, andDodge
were all made byChrysler
. LikewiseCadillac
andPontiac
are made byGM
andLincoln
andFord
are both made byFord
. Create a company variable based on the data in themake
variableCreate a name for use in displaying results that is a character string composed of
make
, a space character, if the company name is not the same as the make then the company in parentheses()
, andmodel
.