Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?

1 Base R

This chapter is under construction.

1.1 Overview

This chapter provides an introduction to the following.

Using RStudio project, folders, and paths to organize your work and simplify your workflow.
Writing and testing code using the console, scripts and R Markdown files.
How to use functions, parameters, and their results.
Understanding vectorized functions
How to installing, loading, and using packages.

1.2 Introduction

R is a powerful programming language for statistical computing and graphics generation. It's flexibility, extensibility, and no cost have contributed to R's wide use in academic environments and among statisticians.

R is open source and is supported by an extensive user community. The R Development Core Team and CRAN are at the center of the user community. The core team oversees the evolution of the base set of functionality which is included when R is installed. CRAN, the Comprehensive R Archive Network, is a repository of additional functionality, called packages. A great deal of additional functionality is available through CRAN.

RStudio provides an integrated development environment (IDE) for R users. This IDE provides support for project organization, source control, and document generation. RStudio will help you write R code faster and more efficiently. The use of RStudio is included in this article.

1.3 The RStudio IDE

1.3.1 Opening RStudio

RStudio is installed on Winstat. RStudio is started on Winstat, or another Windows computer, similarly to other programs.

Click the Windows logo button in the lower left corner of the screen.
From the menu select All Programs. Then select the RStudio folder. Then select the RStudio program.
The navigation to RStudio is displayed in the following image.

Figure 1.1: RStudio in Window start menu

1.3.2 RStudio window

RStudio's window looks like the following. Its Window is divided into panes. If no source files are open, the top left pane will not be displayed.

Figure 1.2: RStudio window

The size of the panes can be adjusted by moving the gray lines which separate the four panes. The panes can be minimized or maximized using the icons on the right side of the gray bar at the the top of each pane.

The location of the four panes within the window can be changed using the Pane Layout window. This window is accessed by selecting Global Options from the Tools drop down menu. The tool tabs can be moved between the two tab panes using this window as well. Navigation to the Pane Layout window is shown below

Figure 1.3: Tools drop down menu for RStudio

Figure 1.4: OPtions window for RStudio

1.3.3 Console pane

The Console pane is on the left side on the bottom. This is where results are displayed. The console tab is opened when RStudio is opened. Additional tabs within this pane are opened if another program type is used, such as an R markdown file. This pane might be full height on the left side if no file is open in the source pane.

1.3.4 Source / Editor pane

The source pane is on the left side on top. This is where you will write and edit your R programs and documents. The pane will have a tab for each open file. This pane is only present if there are files opened in the editor.

1.3.5 The other panes; Environment, History, Git, Files, Packages, Plots, Help

There are two Tab panes on the right side, one on top and the other on the bottom. These panes contains tabs which allow quick access to additional tools. The tabs are for the following functions.

Environment displays data objects defined in the current R session.
History is a list of prior commands which have been executed.
Git is used for version control.
Files is a folder browser.
Plots displays plots you create.
Packages is where packages can be installed and loaded from.
Help is where help on R commands can be found.
Viewer is where web content can be viewed.

1.4 RStudio projects

A project in RStudio is a collection of work organized in a folder. RStudio provides tools that will help you manage your work on projects. Some of the many tools are,

RStudio remembers what files you had open and what tabs were displayed, when you close a project. When you open the project again, RStudio will open the same files and display the same tabs. This will allow you to quickly pick up your work again. A new R session is started when you open a project, so some previously executed commands may need to be run again.
Additional debugging tools, such as setting break points.
Integrated source control. This is an important part of reproducible code.
Integration of code results with documents.

Almost any R work can be structured as a project. You might consider creating a project for an individual class, your thesis, or a research project. While RStudio can work with individual files which are not in a project, you will likely find it quicker to develop R code using projects.

1.4.1 Creating a project

RStudio can create new projects using three different methods, New Directory, Existing Directory, or Version Control. All three of these method can be useful. New users to R will likely want to use the New Directory method to get started. As your R knowledge and skills increase the other methods may be useful. To create a new project, select New Project from the file drop down menu.

Figure 1.5: The file menu of Rstudio

The New Project window will allow you to select which method of project creation you want.

Figure 1.6: The file menu of Rstudio

The New Directory option is used when a new project is being started. This will create a .Rproj file. If you have git installed, you can also start source control for the project.

You will have the option to create an Empty Project or an R Package. Packages are written by experienced R programmers. R Packages allow you to share finished R code with others users.

The Empty Project is what most new R users will use.

Figure 1.7: The file menu of Rstudio
The Existing Directory option is used to create a project in a directory which already contains R programs. This will create the RStudio project (the .Rproj file), but does not set up source control for this project.
The Version Control option is used to start a local project folder from an existing project, provided the existing project is using Git. This is often a project which is shared by researchers.

1.4.2 Opening a project

An existing project can be opened by double clicking on the .Rproj file from a file explorer. The .Rproj file will have a icon of a blue cube with the letter R inside of it.

1.5 Console Basics

The basics of using the console are as follows.

> is the command prompt. R will not display the command prompt until it has completed running the prior command. If the prompt is not displayed, R is not ready for a new command.
+ is the prompt for the continuation of a command. If R reaches the end of a line and the current R command is not complete, R assumes the next line continues the prior line. Splitting some commands across multiple lines can improve the readability of you source code by allowing the structure of the command or data to be seen visually.
The escape key will end a command. This is handy if R thinks the current command is not finished and you see an error in what has already been entered.
The page up and page down keys are used to scroll through the history of prior commands. A prior command can be recalled from the history, edited if needed, and then run again.

1.6 Organizing projects

Organizing your work in folders will make you more productive, similarly to projects. Some organization approaches that are helpful.

Keep data sets in a separate folder from scripts and other documents.
Separate different analysis into their own folders.
Separate exploratory work into its own folder.
Separate folders for each document or presentation

For a student in a class this could be done by creating a folder for a class. Create a data folder inside the class folder, as well as a separate folder for each assignment or work product that is turned in.

The following image shows how this might look.

Figure 1.8: Column index of a data frame

1.7 Running your code

1.7.1 Scripts

An R script is a series of commands in a file. Scripts are ordinary text files with a file extension of .R. Commands can be sent to the console from a script using the Run icon at the top of the Source pane. An entire command can be sent, a part of a command can be run, or multiple commands can be run together. Running part of a command is a useful tool to debug nested functions.

The default working directory for a script is the project folder, when the script is part of a project. The working directory is where R will look for other files your script might need.

1.7.2 R Markdown files

An R Markdown file integrates R code and markdown code, a language used to create formatted documents from text files. The R code in R Markdown files can be inline, output from a single expression displayed in a paragraph, or in chunks, code run and displayed independent of a text paragraph.

Knitting an R Markdown file causes all the R code to be run and the output from the code is put into a markdown file. The markdown file is then run to format the final document, this includes formatting the R code results. This all happens behind the scene and you do not need to know the details about this. Each knit run the R code in a separate session, an instance of an R interpreter without any user created objects. The working directory for these session is the folder that contains the R Markdown file.

The chunks of code are where most of your development and testing time will be spent. All the code in chunk can be run using the Run Current Chunk button, a green triangle at the top right of the code chunk. All chunk before the current chunk using the Run All Chunks Above button, The gray triangle pointing down that is just to the left of Run Current Chunk button. Part of the code in a chunk can be run using the run drop down menu on the tool bar at the top of the source page.

1.8 Functions and parameters

Functions are a set of commands that have been given a name. The commands of a function are designed to accomplish a specific task. A function may need data to accomplish its task. Data objects that are passed to a function are called parameters. Functions can return a data object when they have completed their task. In summary, data is given to a function. The function does a specific task using this data, and returns data.

Most function have multiple parameters. We need to communicate with the function which parameter each value is to used for. This can be done by using the functions name for the parameters or by aligning the values in the right position in the list of parameters.

Examples

Passing parameters by name

The seq() function generates a sequence of numbers. We need to specify a starting number, ending number, and the value to count by, for seq() to give us the sequence we want. The names of these parameters are from, to, and by. They are also in this order in the parameter list.
```
seq(from = 1, to = 10, by = 2)
```
```
[1] 1 3 5 7 9
```
While this makes clear how each parameter is to be used by the seq() function, the use of from and to for sequence is clear with out the names. (sequence have start and end values.)
Passing parameter by position.

This example uses the same function and the same parameter values as the prior example.
```
seq(1, 10, 2)
```
```
[1] 1 3 5 7 9
```
Passing the parameters by postion is nice when the parameters are clearly understood, like the 1 and the 10 are. It is less clear what the 2 would be used for.

Some function have more than 10 parameters. counting commas to determine what parameter a value is to be given to would be burdensome.
Using both position and names for parameters

The convention in R is to pass the first and maybe second parameters by position, if it is clear what they are for. Parameter further in the list are passed by name.

The prior examples would be better parametertised as follows.
```
seq(1, 10, by = 2)
```
```
[1] 1 3 5 7 9
```

Function can return only one data object. When a function has more than one object to return, the objects are collected into a list and the list is returned. The calling code would then access the list as needed.

Many tasks require more than one function to be completed. This is by design in R. R provides functions that are building blocks that are use to accomplish tasks. There are several style of writing multiple functions. One is nesting the result of one function as parameter to another function. The other is to save the results of functions as intermediate values and then pass these intermediate values to the next function.

Examples

Nested functions.

The following code creates a vector of 15 random numeric values. This vector is then rounded to two significant digits, sorted in descending order, and then head() displays a few of the largest values.
```
set.seed(749875)
number_data <- runif(n = 15, min = 0, max = 1000)

head(sort(round(number_data, digit = 2), decreasing = TRUE))
```
```
[1] 997.62 813.26 797.96 733.98 732.67 675.45
```
To read the above base R code, one reads from the inner most parenthises to the outer most. This nesting of functions can make reading base R code challenging. The parameters can be separated from the function they are associated with.
Saving intermediate values.

This example does the same set of rounding, sorting, and head as in the prior example.
```
number_round <- round(number_data, digit = 2)
number_sort <- sort(number_round, decreasing = TRUE)
head(number_sort)
```
```
[1] 997.62 813.26 797.96 733.98 732.67 675.45
```
This provides a more natural order of the functions. It does require the intermediate results to be saved. These intermediate results are only used by the function on the next line in this example. So there is a lot code written to manage the intermediate values.

Neither of these coding styles is perfect. They both have advantages and disadvantages. Most programmers use a blend of both approaches using the style that makes their code as clear as possible.

The tidyverse uses the pipe operator which allows for the natural order of function in the intermediate value approach without the need to save intermediate files.

1.9 Vectorized functions

An R vector is a column of values. Each of the values of a vector have to be of the same type, number, character, etc. The values of a vector can be access based on the order of the values.

Many R functions parameters are vectors. Similarly, many operators operate on vectors. These functions and operators work on all the values of a column together. There are two common types of vector operations, transforming and aggregating. Aggregating is also know as summarizing.

Examples

Math with vectors

Adding two vectors is a transforming operation. The values of the two vectors are added element by element.
```
vec1 <- c(7, 5, 3, 1)
vec2 <- c(1, 2, 4, 8)

vec1 + vec2
```
```
[1] 8 7 7 9
```
Multiplying works similarly.
```
vec1 * vec2
```
```
[1]  7 10 12  8
```

Aggregating functions

The sum()` function adds all the values of a vector together.

sum(vec1)

[1] 16

sum(vec2)

[1] 15

sum(vec1 * vec2)

[1] 37

sum(vec1, vec2)

[1] 31

Some other aggregating functions are, mean(), sd(), and median()

mean(vec1)

[1] 4

median(vec2)

[1] 3

mean(vec1 * vec2)

[1] 9.25

sd(c(vec1, vec2))

[1] 2.642374

1.10 Paths and working directories

Data files have a name and are located in a folder. (A folders is the same a directory. You will see both of these names in common use.) The folder containing the file may be nested within another folder and that folder maybe in yet another folder and so on. The specification of the list of folders to travel and the file name is called a path. A path that starts at the root folder of the computer is called an absolute path. A relative path starts at a given folder and provides the folders and file starting from that folder. Using relative paths will make a number of things easier when writing programs and is considered a good programming practice.

A path is made up of folder names. If the path is to a file, then the path will ends with a file name. The folders and files of a path are separated by a directory separator. Different operating systems use different directory separator. R the function file.path() is used to fill in the directory separater. It know which separator to use for the operating system it is running on.

There are a few special directory names. A single period, ., indicates the current working directory. Two periods, .., indicates moving up a directory. The following image shows how .. would be used to get a data file in the folder structure used in the project organization section.

Figure 1.9: The path to a data file

When R starts a session, it has a location to look for other files. This path is called the current working directory, often this is shortened to the working directory. Relative paths in a program are specified as starting at the current working directory.

Examples

Getting the current working directory

The getwd() returns the absolute path to the current working directory.
```
getwd()
```
```
[1] "Z:/R/R_intros"
```
Creating a path to a file

The file.path() function will be used to create a relative path to a data file.
```
data_path <- file.path("..", "data", "data1.csv")
data_path
```
```
[1] "../data/data1.csv"
```
The data_path object can be used for functions such as read.csv() and read_csv() to import the data1.csv data.

1.11 Packages

The packages which make up the core functions and commands of R are loaded when R is started. There are many packages which extend R's capabilities beyond the core. These extension packages need to be loaded in each R session before you can use the functions they contain. The functions in these extensions range from widely used functions to obscure functions used by only a small number of people.

A package needs to be installed on your computer before you can load it into your session. R and RStudio manage a library of packages that have been installed on your computer. Winstat has a number of common packages installed for you. The packages installed in your library can be seen in the packages tab.

Figure 1.10: Column index of a data frame

To install a package on your computer, click the Install icon in the Package tab and then enter the package you want installed.

Example

Installing the tidyverse package.

The library() function is used to load a package into your R session. The name of the package to be loaded is the first parameter. There may be nothing displayed by this function. Other times the function will display infomation about what is loaded an any conflicts that exist.
```
library(tidyverse)
```
The tidyverse tells us that a number of packages are attached. Also, loading the tidyverse has caused a few conflicts with functions in other packages that were previously loaded. The functions of the tidyverse's are now available for use.

You may see code using require() instead of library(). The require() function is designed to be used as a conditional event. The result is that the script continues running even if the package did not load. (Yeah, the name of the function is really misleading.) This is an undesirable behavior and can cause issues when sharing your work. The library() function is the proper function to use to load a package and not require().

Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?