This article is part of the R for Researchers series. For a list of topics covered by this series, see the Introduction article. If you're new to R we highly recommend reading the articles in order.
This article will introduce you to R commands, R programs called scripts, and Git central repositories.
This introduction covers commands used to get help and prepare an R session for your work.
Git central repositories are useful to provide a backup or your project and to coordinate the work done on multiple computers. Eventually most of us will need to work with others on projects. These Git tools will allow you the freedom to work on your part independently while easily coordinating with the others.
You will get the most from this article if you follow along with the examples in RStudio. Working the exercises will further enhance your skills with the material. The following steps will prepare your RStudio session to run this article's examples.
R is an interpreted language. This means R commands can be entered and run individually, without being part of a compiled program. This allows great flexibility to interactively explore and analyze data. While exploring data interactively is effective, a researcher's work also needs to be reproducible. The workflow used in this article series allows for both interactive exploration and reproducibility. This is achieved by doing our exploration using scripts and source control of the scripts.
Our typical work flow will be
The R Markdown and source control steps are important for their roles in reproducibility. We will practice these steps through this article series to help you incorporate them into your work habits.
Some will find it more natural to interleave creating chunks of R code with writing the accompanying document text. Others will find it more natural to write all or most of the R code and then write the document. The nature of your research project might also influence when to create the markdown document. When your document is highly influenced by the analysis, delaying writing the document may be more optimal. When the document form is not dependent on the analysis, there may be benefits to writing the document more in parallel with the analysis. With either approach there are typically some adjustments to the R code and Markdown document at the end to clarify the results of the analysis in the paper.
The console is where R commands are entered, run, and text results displayed. Commands can be entered into the console at the prompt or sent to the console from a script or R markdown file. To support reproducibility, we will be entering our commands in a script and then send them to the console. Even though we will not be entering many commands at the console, you will need to know a little about how the console works to use it with your scripts.
The basics of using the console are as follows.
R commands are similar to commands from general computing languages like C++ or python. This is a little different from the syntax of languages such as Stata or SAS. R commands typically either assign values to an object or control which commands will get run. R objects will be covered in the next article. For now you can think of an object as a variable.
An R function is similar to a Stata and SAS command. A function performs some action and the action taken is adjusted based on the parameters given. In this series we use the term command in a very loose sense to refer to a function as well as to formal R commands.
Syntax and use of functions
functionName(parameterList)
functionName is the name that identifies the function in R.
parameterList is a list of parameters. Parameters in the list are separated by commas. Parameters can be identified by either their position in the list or by a name. In most instances using the parameter name enhances the readability of your code. We will primarily use parameter names in this article series. There is one case where we will drop the use of parameter names. This is when the function name makes it clear what the first parameter is. An example of this is provided below with the help function.
An R function returns an object as it's result. The returned object must be saved if it is to be used again.
The use of "()" directly following a name identifies the name as a function and the contents of the parentheses as parameters. The parentheses are needed for a function even if there are no parameters. Parentheses can also be used in an expression to identify order of operations. When used for order of operations, the parentheses will not directly follow a name. In the Data preparation article you will also see the use of the square brackets "[]". When "[]" follows a name, the name identifies a data object and not a function. The use of these brackets will be demonstrated further in this and the following lesson.
An Expression in R is any text which, when interpreted in R, results in a data object. Beneath this simple definition is one of the powerful constructs of R. An expression can also be used anywhere that a value is expected. This allows simple functions to be linked together to do much more sophisticated operations. This approach of building more specific results using this feature will be demonstrated in this article series as we learn more commands.
Expressions include numeric, logical, or character values with their associated operators. These values can be a variable, a constant, or the returned value from a function.
The numeric operators include +, -, *, /, ^, log(expression), exp(expression) which are addition, subtraction, multiply, divide, natural log, and the constant \(e\) raised to the power given by expression respectively.
The logical operators include: == the logical test for equality, < less than, <= less than or equal, > and >= similarly are greater, | logical or, & logical and.
The character operators are more special purpose and will not in general be covered. A few of them will be introduced where needed in subsequent articles.
The first commands you will need are provided below.
Implied display command. This is used to display an R object at the console.
expression
The value of expression will be displayed on the console.
Enter the following at the console.
(3 * 5 + 1) / 2
The following will be displayed at the console
[1] 8
Note, since the "()" did not follow a name, they were used as grouping operators and not as a means of identifying a parameter list.
Assignment command
object <- expression
Object is set to the value of expression.
Enter the following at the console.
x <- (3 * 5 + 1) / 2
Note, nothing was displayed. To see what was assigned to x, you would have to enter "x" at the console.
A Comment is text to remind yourself, and others, of how to use your code and how it works. Comments are ignored (not treated as commands) by R.
# reminder text
The help and examples commands provide assistance with an object. Help provides a description of an object. The object may be a function or dataset. Example shows an example of the use of a function.
help(topic)
example(topic)
Help() and examples() are functions. There are a number of parameters which could be used with these functions. We are only interested in the first parameter of these functions which is named topic.
We will use help to get some information on the read.table function
Enter the following at the command prompt in the console.
help(read.table)
Press the enter key to run the command.
Select the Help tab in the tools pane.
This could have also been entered as help(topic=read.table). The use of "topic=" does not enhance the readability of the code. This is one of the cases were we will identify the parameter by position.
A description of the function and it parameters will be displayed as is seen in the image below.
R's commands (functions) were written by many individuals over a large number of years. No central authority exists which controls naming conventions. This has lead to differences in names for similar objects (functions, parameters, etc.) This causes no problems for R. A little extra time is typically needed to learn R's parameter and function names. Use the help function to remind yourself as you learn the names in R.
An R script is a series of commands in a file. R scripts have a file extension of .R. Scripts are ordinary text files and can be written using any text editor. We will use RStudio's editor to write our scripts. Using R or RStudio's editors makes it easy to work interactively, running commands as they are written. The editors in R and RStudio do not automatically save changes made to a scripts. You will need to save on a regular basis when you use either of these editors.
Keeping your R scripts to a reasonable length makes them easier to work with. It is easier to find code if the file is smaller. Also you will likely want to use some, but not all, of your code in your R Markdown files. Segmenting your work with this in mind will also make creating your Markdown documents easier. Multiple scripts can be collected in a single script. This allows all your code to be run together. Running a script from another script will be demonstrated in the Data presentation article using of the source() function.
We are going to create a script for you to enter and run the example commands from this lesson.
Open a new R script file.
Save it as SalAnalysis.
An R session is started when R or RStudio is started. R creates a workspace for each session. The workspace contains the objects that R knows about. Your data and functions are added to the workspace as you create them in your session. R loads a set of core functions as part of starting a session.
There are two aspects to getting your session ready to use. The first is loading any non-core commands you need. R commands (functions) are grouped in packages. A package typically includes a set of related functions. It is typical to load at least a few packages when starting R. The second aspect of session preparation is setting the work directory, where R will look for files.
The packages which make up the core functions and commands of R are loaded when R is started. There are many packages which extend R's commands beyond the core commands. These extension packages need to be loaded in each R session before you can use the functions they contain. The functions in these extensions range from widely used functions to obscure functions used by only a small number of people.
A package needs to be installed on your computer before you can load it into your session. R and RStudio manage a library of packages that have been installed on your computer.
Winstat has a number of common packages installed for you. The packages installed in your library can be seen in the packages tab.
Installing a package is shown here with the ggplot2 package being installed.
Select the Packages tab. The tab should look like the image below with a list of packages already in your library.
Select the Install icon. An Install Packages window will open. Enter the package you want installed in the Packages box. The text will autocomplete once you have entered enough characters to distinguishing the package.
Leave the other boxes as the defaults.
Click the install button in the Install Packages window and the package will be installed.
The ggplot2 package is now installed and is in your library. The ggplot2 package can now be loaded in an R session.
The command to load a package into your R session is
library(packageName)
PackageName is the package which is to be loaded.
There a few packages we will use in this article series. We will add code to your script to load these packages.
Enter the following commands into your SalAnalysis script. (The lines can be copy and pasted from this file into your script. Remember to save after updating your script.)
#####################################################
#####################################################
##
## Demonstration from the R For Researchers series
##
## The focus of the analysis in these articles is
## on demonstrating the use of R functions in the
## analysis of data. These analyses are not
## complete analysis. They include only the steps
## needed to demonstrate the use of the R
## functions.
##
## Name Date
##
#####################################################
#####################################################
#####################################################
#####################################################
##
## Session Setup
##
#####################################################
#####################################################
library(faraway) # glm support
library(MASS) # negative binomial support
library(car) # regression functions
library(lme4) # random effects
library(ggplot2) # plotting commands
library(reshape2) # wide to tall reshaping
library(xtable) # nice table formatting
library(knitr) # kable table formatting
library(grid) # units function for ggplot
Click the run icon, which is on the right in the top row of icons in the source pane. The run icon has a green arrow pointing to the right. If you hover over the icon you will see the text "Run the current line or selection" as seen in the image below.
The results will be displayed in the console. The results should be similar to the following.
Attaching package: 'car'
The following objects are masked from 'package:faraway':
logit, vif
Loading required package: Matrix
Notice that these packages required other packages to be loaded. R loaded these packages automatically. If any of the packages do not load, the package likely needs to be installed on your computer.
Several comment blocks were added to the script with the library functions. The first is used to identify what this script is for. The second is used to identify the beginning of the setup section of the script. Comment blocks make it easier to find sections of functionality in scripts. Its a good practice to use comment blocks to separate sections of your scripts.
It is also a good practice to load packages at the beginning of a script or R markdown file. If while working you discover you need another package loaded, add the library command for it at beginning of the file with the other library commands. By keeping the package loading at the beginning of the files, commands can be used anywhere in the file. This avoids having to look through a script to see if you have loaded a package for a function you need.
Setting the work directory allows you to reference files without giving a full path name to the file. There are several advantage to setting your working directory. The most important advantage occurs when the working directory is the same a the project folder. If you move the project folder, you have one line in your script to change to point to the new folder. The script and project becomes more portable with this approach. Setting the work directory will save you typing in your scripts, since you will not need to enter the full path to the file. RStudio sets the work directory to the project directory when a project is opened. If a script will be run outside of the project, the work directory will need to be set.
We are going to set our work directory in our script.
Enter the following commands into your scripts and run them.
saveDir <- getwd() # get the current working directory
saveDir # show me the saved directory
wd <- "u:/RFR" # path to my project
setwd(wd) # set this path as my work directory
Your console should display a similar working path (Note your path will be different than the displayed path.)
[1] "u:/RFR"
Your script now has the code needed to set up your session. This is a good time to commit the changes to SalAnalysis.
Create a new script titled AlfAnalysis. This script will be used for the exercises in this article series.
Set up the session for the AlfAnalysis. Load the same packages as were used in SalAnalysis.
Commit your changes to AlfAnalysis.
Git has a number of commands to access prior versions of your source code. RStudio has implemented only one of these functions, which RStudio calls revert. This function changes the working directory file to match the state of the head. The choice of the name "revert" for this function is unfortunate in that this is not what a Git revert does. The RStudio revert is like a Git reset with the hard option. This article series will refer to this function as RStudio's revert to distinguish it from a Git Revert.
It is important to recognize that the intent of an RStudio revert is to overwrite files in the working directory. The overwritten changes in the working directory would not have been committed and as such are not part of the project's history. The overwritten changes are permanently removed from the working directory and the project's history. Rstudio's revert needs to be used with great care.
There are times when RStudio's revert is what is needed. An example would be trying a new approach to a calculation for an analysis and determining it is not as good as the prior approach. Going back to the prior version of the calculation and not saving the failed improvement code might be what you want.
We recommend the use of a Git GUI to access source files from commits prior to the branch's head, the last commit.
We will make a meaningless change to our script and use RStudio's revert to restore the file to its committed state.
Add the following line to your SalAnalysis script.
# Silly comment use to test RStudio's revet
From the tools menu in the Git tab, Select revert.
The comment line has been removed from the file.
This example would have been easier to do using the undo function in the editor. There are times when the editor undo is either not an option or would be difficult to use. For example when a file has been closed undo is no longer available, or when many changes involving multiple files have been made.
A central repository is a repository which is used solely to store the project. No development is done in a central repository. Development is done in local repositories. Our RFR repository is a local repository. Central repositories are useful as a backup for a local repository or to coordinate work done in multiple local repositories.
A central repository is a remote repository. It is remote because its remote with respect to our local repository. Remote does not necessarily mean far away. The remote repository might be saved to the same storage device as the local repository.
We will create two folders, named cen and home, to demonstrate the use of a central repository. The cen folder will be where our central repository will be stored. The home folder will hold a second local repository for the RFR project. These two additional repositories will be stored on the U drive as is your current RFR project, all on the same device. This is done for convenience of this example. In practice, additional development repositories, local repositories, would likely be stored on different devices.
We need to create the central repository for RFR and then connect our local repository to the central repository. RStudio does not support this functionality. We will do these steps using the shell.
Select the Shell option from the tools drop down menu in the Git tab.
A shell window will open. The prompt shows the current folder. This is the folder where commands will be executed in. The prompt should include the path /u/RFR in the shell.
We need to change our folder to where the central repository will be stored and then create the repository.
Enter the following commands in the shell.
cd ../cen
git init --bare RFR.git
The init --bare command and parameter tells Git that the new repository will not be a development repository. RFR.git is used to identify the name of the repository. We will call all three of the repositories RFR. We will use the repositories location to distinguish them and not their name.
We now need to connect our local RFR repository to the central repository.
Enter the following commands in the shell.
cd ../RFR
git remote add central ../cen/RFR.git
git push -u central master
The remote add command and parameter tell git to add a path to a remote repository. The added path will be named central in the RFR repository and the path to it is ../cen/RFR/git.
The push -u command and parameter copies the local RFR repository to the central repository.
The shell should now look similar to the following image.
Close the shell window by clicking the red X in the upper right corner.
The Pull and Push icons, in the top row of icons on the Git tab, should be fully displayed and not greyed out. You may need to click the refresh button in the Git tab to fully see the push and pull icons.
Click on the history icon in the Git tab.
In the Review Changes window you should see that the most recent commit, the top one, now has three identifiers associated with it. The first identifier HEAD, indicates this commit is the head of our currently checked out branch. The second identifier central/master, indicates this commit is the head of the master branch in the central repository. The third identifier master, indicates this commit is the head of the master branch in this repository. When there is only one branch in a repository, the HEAD and master will point to the same commit.
Clicking the Push button will now move to the central repository any commits in the local repository which are not in the central repository.
If your computer is connected to the SSCC network, we recommend that you use one of the network drives for your project, such as U, V, etc. If you are working on a computer not connected to the network, we recommend you set up a central repository on a network drive and push to the repository on a regular basis.
A central repository is also useful for coordinating work in multiple local repositories. One situation for this would be a project in which you work both on a University computer connected to the network and a computer not typically connected to the network, such as a laptop or home computer. This situation will be demonstrated in the next example. The procedure would be similar if the repositories were associated with different members of the team.
We will set up another repository to work on the RFR project. We will set up this second repository in the home folder you created above.
Open a second instance of RStudio.
From the File drop down menu, select New Project.
Select Version Control from the Create project menu in the New Project window.
Select Git from the Create project from Version Control menu in the New Project window.
Enter "file:///U:/cen/RFR.git" in the Repository URL box. The path to the central repository needs to include "file:///" to address some issues with network drives when cloning a repository. This prefix is not needed after the repository has been cloned.
Enter "RFR" in the Project directory name box.
Navigate to "U:/home" in the Create project as a subdirectory of box.
Click the Create Project icon at the bottom of the New Project window.
The path "U:/home/RFR - master" should now be displayed in the upper left of the RStudio window next to the RStudio icon.
Select the Git tab in the tools pane.
Click on the history icon.
The Git log can be seen in the Review changes window.
All of the commits we made in our primary repository can be seen in the new home/RFR repository. The name of the remote repository here is origin. This is the same repository that is named central in our primary RFR repository.
Close the Review Changes window.
There are now three RFR repositories on your U drive. Two of these repositories are local repositories. The first is our primary working repository, U:/RFR, and second is the home repository, U:/home/RFR. Work done in one of these repositories can be shared with the other by pushing and pulling through the central repository.
As an example of sharing project work, we will make a change in the home repository and move the change to our primary working repository.
You should have both of the local RFR projects open. If they are not both open, open them. The path displayed in the upper left corner of the RStudio window will identify which project the window is associated with. The home RFR project will display the path "U:/home/RFR".
Open the SalAnalysis.R script in the home RFR project.
Add the following comment line to the SalAnalysis.R script in the home RFR project. Add this line after the other content in SalAnalysis.
# Comment added in home/RFR
Save the SalAnalysis file
Commit the change to SalAnalysis file with the commit message "Added comment in home project".
Open the Git history in the home RFR project and the log will show this new commit in the U:/home/RFR repository and not in the origin repository, which is our central repository.
Open the Git history in our primary RFR project and the log will not show this new commit in the repository.
We will now move the new commit from the home RFR repository to the central repository. Click the Push icon in the Git tab of the U:/home/RFR project.
Click the close button in the Git Push window.
Click the refresh icon in the Review changes window for the U:home/RFR project.
The U:/home/RFR Git log shows that the origin repository now contains the new commit.
Clicking on the refresh icon in the Review changes window for our primary RFR project will show that this repository has not changed. Also note that there is no indication that there are changes in the central repository waiting to be pulled into this repository.
We will now move the new commit from the central repository to primary RFR project. Click the Pull icon in the Git tab of the U:/RFR project.
Click the close button in the Git Pull window.
Click the refresh icon in the Review changes window for the U:/RFR project.
The commit which was done in the U:/home/RFR project is now seen in the log of the U:/RFR repository.
Click on the SalAnalysis.R file. You will see that the the change made to this file in the U:home/RFR project is now in this file.
The changes made in the U:/home/RFR project have been applied to the U:/RFR project.
This example worked this easily because there was no conflicts in the file pulled from the central repository. If there are conflicts, you would need to resolve the conflicts. There are tools to support this process. We will not cover these tool in this article series. While the tools can help, the heart of the conflict resolution process is you deciding what changes will be made in your project.
Next: Data preparation
Previous: R Markdown
Last Revised: 11/24/2015