Data Wrangling with R
March 2024
1 Data Objects
The examples in these materials were run with R version 4.3.2. To ensure that the code runs properly, be sure to update your R to at least this version.
Data objects in R have three main characteristics:
- Type
- Structure
- Class
A data object may be created, used, and discarded again all in the same step (anonymous data), or a data object may be saved for use in a later step.
1.1 Giving Names to Data
To store a data object, you give it a name. This is called assignment: we assign data values a name.
Assignment may be written several ways. The two most
common are the “left arrow”, <-
, and the single
equals, =
. The left arrow is generally preferred,
because it is unambiguous. (The single equals sign is also
used in function arguments, such as rnorm(n = 10, mean = 5
.)
But the equals symbol
is often used by people coming to R from other
programming languages, so its use is fairly common.
(See help("assignOps")
.)
x <- 7
y = 8
x + y
[1] 15
Assigning a name to data stores the data in your computer’s memory. You can list the objects available in your current session with
ls()
[1] "x" "y"
This will match what you have in the Environment pane in the top right corner of RStudio.
1.1.1 Good Names
There are a lot of naming conventions and advice to be found on the internet. Our free advice:
- Use a consistent naming scheme. This will help you and anyone else who might need to read your code in the future.
- Meaningful names are helpful. For instance use “age” rather than “x”.
- Short names are easier to read and type. Long names make your code harder to scan. There is a trade-off between having meaningful names and having short names.
Some naming rules:
- Capitalization matters. (
age
,Age
,AGE
, andaGe
are all different.) - Begin a name with a character. (Try
income2000
instead of2000income
.) - Keep names compact - no spaces.
- Avoid most special characters (!@#$%^&*). The main exceptions
are periods (
.
) and underscores (_
), which are helpful in creating easy-to-read multi-word variables, likePetal.Length
orbirth_weight
.
(You will eventually encounter names that violate these rules,
non-syntactic names. For example, the base R function lm
(regress) assigns the name (Intercept)
to a coefficient.
Non-syntactic names just make life more difficult.)
1.2 Removing Data
It is a good idea to clean up your workspace as you go, removing data objects that are no longer needed. This makes it easier to keep track of the key data objects you want to work with.
To clean up your workspace, use
remove(x)
ls()
[1] "y"
A common alias for remove()
is rm()
. You can use these
functions to remove several objects at once. (See help("remove")
.)
1.3 Reusing Names
When you reuse an object name for assignment, you are throwing out the old data. This is considered a routine action, so there is no warning or error.
y
[1] 8
y <- c("red", "green", "blue")
y
[1] "red" "green" "blue"
1.4 Exercises
Assign the value 3 the name “v”, and the value 2 the name “w”. Then calculate \(v + w\).
Bad names: assign the value 7 the name “one”, and the value 2 the name “three”. Calculate \(one^{three}\). Never write code that looks like this!
Capitalization: assign 0 to “a” and 2 to “A”. Calculate the mean of “a” and “A”. Note that if we want to give
mean()
two numbers, we need to combine them into a single vector withc()
:mean(c(a, A))
.Here, “a” and “A” look different enough that this might be acceptable. However, “x” and “X” look similar enough that they would be a poor choice of names here.
1.5 Advanced Exercises
Try assigning a data value to the name “1$” (without the quotes). Try the name “1a” (again, no quotes). It is interesting that this gives two different error messages - any idea why?
Remarkably, R allows you to use names like these. However, such non-syntactic names require you to use backticks (back quotes). In principle you could use unicode symbols as names, like \(\mu\) or \(\sigma\). However, current R code editing software does not make this easy - maybe in the future?
Tidy up by removing all the data objects from your global environment.