2  Data Objects

2.1 Warm-Up

sample(1:10, 1) gives us a random number between 1 and 10.

Run the two blocks of code below in R.

  1. In the first block, after you print x to the console, is x + 1 what you expect?

  2. In the second block, after you print sample(1:10, 1), is sample(1:10, 1) + 1 what you expect?

Run both blocks several more times. Do either of your answers change? Why?

x <- sample(1:10, 1)
x
x + 1

sample(1:10, 1)
sample(1:10, 1) + 1

2.2 Outcomes

Objective: To create, modify, and remove data objects.

Why it matters: Almost all of your work in R will involve data objects, everything from importing datasets to creating plots to fitting statistical models. An understanding of basic object operations is foundational for all your work in R.

Learning outcomes:

Fundamental Skills Extended Skills
  • Create and modify objects through assignment.

  • List all objects in the environment.

  • Remove individual objects from the environment.

  • Apply style guidelines when naming objects.

  • Remove all objects from the environment.

Key functions and operators:

<-
ls()
rm()

2.3 Naming Data

Data objects in R have three main characteristics:

  • type
  • structure
  • class

A data object may be created, used, and discarded in a single step, as we did in the second block of code in the warm-up. This is called anonymous data. Or, we can save a data object for use in a later step, as we did with the first block of code in the warm-up.

To store a data object in memory for later use, give it a name. Naming a data object is called assignment and is done with the assignment operator: <-

x <- 2
y <- 3

After running that code, x and y are stored in our computer’s memory. You can list all data objects available in your current session with the ls() function:

ls()
[1] "x" "y"

This will match what you have in the Environment pane in the top-right corner of RStudio.

When we close RStudio, our computer deletes all objects in memory. This helps us achieve reproducibility and avoid hidden dependencies across projects.

Important

If your Environment has objects in it when you open RStudio, change your settings following the instructions here. If you want to preserve specific objects across R sessions or share objects with colleagues, learn how to save objects here.

2.3.1 Rules for Names

You can choose the names for your data objects, but you should follow a few conventions:

  • Use a consistent case. R is case-sensitive, so x and X are different, as are age and Age and AGE and aGe.

  • Begin names with a character. Try income2000 instead of 2000income.

  • Balance length and meaning. Shorter names are easier to read and type, but they are less informative to whoever reads your code after you (which is probably future you!). Longer names are more informative but also more prone to typos.

    • Compare the object names below. Which would you like to type today? Which would you like to read next year?

      x <- 25 # very short, very uninformative
      age <- 25
      age_start <- 25
      age_start_of_study <- 25
      age_at_beginning_of_data_collection <- 25 # very long, very informative
  • Combine multi-word names with camel case, dot case, or snake case. Spaces are not allowed within names, nor are most special characters. These can return all sorts of errors and unexpected results:

    birth year <- 2000
    Error in parse(text = input): <text>:1:7: unexpected symbol
    1: birth year
              ^
    x$year <- 2000
    Warning in x$year <- 2000: Coercing LHS to a list
    y-year <- 2000
    Error in y - year <- 2000: could not find function "-<-"

    Instead, create multi-word names with your favorite of these three strategies:

    birthYear <- 2000 # camel case
    birth.year <- 2000 # dot case
    birth_year <- 2000 # snake case

2.3.2 Non-Syntactic Names

You will encounter names that break these rules, called “non-syntactic” names. To work with these objects, you will need to surround their names with back ticks: `object_name`

2000income <- 50000
2000income
Error in parse(text = input): <text>:1:6: unexpected symbol
1: 2000income
         ^
`2000income` <- 50000
`2000income`
[1] 50000

Statistical models in R often contain an object called (Intercept) which you can reference with `(Intercept)`. You will also need this strategy to pull up the documentation for operators like %in%:

?`%in%`

2.3.3 Reusing Names

When you reuse a name by assigning new data to it, the old data is overwritten. This is considered a routine action, so there is no warning or error, nor a need to tell R you want to replace the old data.

y
[1] 3
y <- 10
y
[1] 10

R will even allow you to reuse the name with data of a different type or structure:

y <- "hello"
y
[1] "hello"
y <- anscombe # built-in dataset
y
   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89

2.4 Removing Data

Sometimes you will need to remove an object from your environment to free up memory or de-clutter your environment. One situation where you may need to do this is if you accidentally make multiple copies of a very large dataset.

Remove individuals objects with rm(). Remove the object x and then verify it is no longer in the list returned by ls():

rm(x)
ls()
[1] "2000income"                          "age"                                
[3] "age_at_beginning_of_data_collection" "age_start"                          
[5] "age_start_of_study"                  "birth_year"                         
[7] "birth.year"                          "birthYear"                          
[9] "y"                                  

R does not have an “undo” button, so after you remove an object or reuse its name, the only way to restore the data is to rerun your previous code.

Warning

An alternative strategy that I do not recommend is to create objects with different names at every stage as a form of version control, like data_raw then data_renamed then data_renamed_ver2 then data_cleaned then data_for_plotting then data_merged2_ver3_final_final. This approach often leads to confusion and mistakes.

Instead, break your scripts apart into smaller scripts that accomplish a single task. Your script should not be 3000 lines long and do everything from cleaning the data to plotting it to fitting statistical models. Shorter scripts are easier to manage, and they are easier to understand when you revisit your code years later. We will discuss project organization in First Steps with Dataframes.

2.5 Exercises

2.5.1 Fundamental

  1. Give x the value 3. Then give it the value 5. Print x after each command to check its value.

  2. List all objects in the environment.

  3. Remove x from the environment.

2.5.2 Extended

  1. Make these object names syntactic, consistent, and meaningful. There is no one right answer. You can apply your own style and make decisions about what a name might mean.

    income1
    INCOME2
    income2
    3income
    birth date
    y
    year_of_birth
    state$of$residence
  2. Run this code, which will create 26 objects in your environment. Then, remove all of them. Hint: see ?rm.

    for (i in 1:length(letters)) { 
      assign(letters[i], i) 
    }