SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

1.1 Programming basics

This discourse introduces you to some of the basics of programming. This is meant for both those new to programming and those new to R or Python. For experienced programmers who are new to R or Python, the examples will provide instructions on the basics in these language and the discourse instructions can be skipped.

1.1.1 Data concepts - Objects

Programming languages store and interact with data using constructs similar to named files and folders. A stored data item in a program is given a name, like a file is on a computer. This name is what allows one to communicate to the program what data is to be manipulated by any command. Without a name, data created or manipulated by a command will be lost. In R and Python named data items are objects. Objects can also contain other objects similar to how a folder can contain additional folders and files. This is a means to organize data in a program like folders do on a computer.

Objects have types and classes. The type and class of an object provides information to the programming language on what can be done with the object. For example if the object is some kind of numeric data, then multiplication can be done. If the object is instead some kind of character value, then multiplication should not be allowed on the object. In this regard, the type and class of an object is like a file extension. For example, .xlsx files are edited by Excel and not a text editor.

1.1.2 Programing skills

1.1.2.1 Scripts

Languages such as R, Python, and Stata are interpreted. This means that commands are run as they are entered. Results, if there are any, are given immediately after a command in an interpreted language. Not all computer languages work in this way. Some languages require that all commands be entered and compiled prior to running any of the commands.

A script is a file that contains commands for an interpreted programming language. R, RStudio, IPython, and Stata provide development environments that include an editor to write scripts. These built in editors have tools that make it easy to send commands to the console (more on the console coming soon) and also provide editing features, such as highlighted syntax and auto complete. Any raw text editor, such as notepad, can also be used to write scripts.

Large programs can be split between multiple script files. For example, you may have one script that imports a project’s data sets and prepares the data for analysis and another script for the analysis of the data.

1.1.2.2 Console

The console is where commands are entered and text results are displayed. Commands can be entered directly at the console or can be sent to the console from a script. While commands can be entered directly at the console, you should strive to do all of your work in scripts. Script are an important way for you to document your work and to be able to recreate your work when needed. This does not mean you will never enter code directly into the console. Directly entered console commands are commonly used to debug errors in your scripts. The corrected commands are then put into the script.

1.1.2.3 Assignment command

One of the most basic commands in a programming language is the assignment command. It is used to name an object in a program. It is this act of naming that gives the object permanence in your program.

The syntax for an assignment is:

    <name> <assignment operator> <data>

The name is on the left and the code to create the object, <data>, is on the right side of the assignment operator. This syntax for an assignment is used in many programming languages. The assignment operator differs between programming languages.

The assignment operator for R is <- and for Python it is =. Using the assignment command is,

    <name> <- <data>

or

    <name> = <data>

The object that is associated with a name can be changed. This is also done by assigning a new object to the name. A name is only associated with the last object assigned to it.

When the name of an object is used in a program, the program operates on the object and not the name. The name is only a reference to the object.

1.1.2.4 Implied print

An object can be displayed to the console by referencing it. That is, by entering only the name of the object on a line, without an assignment being done. This implied print is supported in most interpreted languages.

1.1.2.5 Paths

Data files have a name and are located in a folder. (A folder is the same a directory. You will see both of these names in common use.) The folder containing the file may be nested within another folder and that folder maybe in yet another folder and so on. The specification of the list of folders to travel and the file name is called a path. A path that starts at the root folder of the computer is called an absolute path. A relative path starts at a given folder and provides the folders and file starting from that folder. Using relative paths will make a number of things easier when writing programs and is considered a good programming practice.

When a program starts, it has a location to look for other files. This path is called the current working dirrectory, often this is shortened to the working directory. Relative paths in a program are specified as starting at the current working directory.

The current working directory of a program can changed, if needed.

1.1.2.6 Projects

A project is a collection of work that accomplishes a particular task. Wrangling projects are typically organized in a folder. This folder is called the project folder. The project folder contains the program scripts, any needed data files, and hopefully papers/reports.

The examples in this book is designed to be used in an RStudio project. This choice offers a number of conveniences. A few of which are

  • RStudio supports both R and Python.

  • Sets your working directory to your project folder.

  • Allows interpreted execution of individual commands within a code notebook.

1.1.3 Examples - R

1.1.3.1 Console

Open your DWE RStudio project.

In the console you will see the greater than character, >, on the last line. This is the R prompt. When the prompt is displayed, the program is ready for a new command. When a prompt is not displayed, prior commands are is still being working on. Some commands can take a while to run and no prompt will be displayed while they are running.

  1. R will be used as a calculator to demonstrate the operation of the console. The following calculates \((3 + 9) / 2\).

    > (3 + 9) / 2
    [1] 6

    The displayed result was not assigned to a name, so the calculation of (3 + 9) / 2 created an object with no name. This is called an anonymous object. Since there was no assignment, this is an implied print and results are displayed at the console. The prompt is given again after the calculation is completed. The program is ready for the next command.

  2. Continuing a command across multiple lines.

    If R reaches the end of a line and the command is not complete, R assumes the next line continues the command on the prior line. When this occurs, R uses the plus symbol, +, to prompt for the continuation of the command. Splitting some commands across multiple lines can improve the readability of you source code by allowing the structure of the command or data to be seen visually. This will be demonstrated when we know more of the R language.

    The following example calculates the same expression as the prior example. Here, the 2 that follows the division operator is not given on the same line as rest of the expression. R knows that the command is not complete because the divisor has not been given. The + prompt is seen at the start of then next line. No divisor is given to complete the command on the next line and R gives the + prompt again on the third line of the example. The 2 is provided here and the calculation is completed.

    > (3 + 9) /
    +   
    + 2
    [1] 6

    Splitting this expression across multiple lines did not make this code more readable. It was done only to demonstrate the continuation prompt.

    You may get the continuation prompt and realize that you made an error and want to stop the command instead of finishing the command. This is done by pressing Esc.

  3. Recalling commands in the console.

    Prior commands can be recalled, edited if needed, and then run again. This is done using the Page Up and page Pown Keys to scroll through the history of prior commands. Prior commands can also be recalled from RStudio's history tab.

1.1.3.2 Assignment

  1. The following example assigns to the name x the value 7. An implied print is used to display the data object x references.

    > x <- 7
    > x
    [1] 7
  2. This example assigns to y the data object that x references (not the character value "x".) The object displayed by the reference of y is also is the value 7.

    > y <- x
    > y
    [1] 7
  3. This example assigns the character value "x" to the name y, you quote the x to make it a character value. Without the quotes, R assumes that x is the name of an object.

    > y <- "x"
    > y
    [1] "x"
  4. This example attempts to add y to 5. The data that y references is a character value and R will issue an error because addition is not defined for character values.

    > 5 + y
    Error in 5 + y: non-numeric argument to binary operator
  5. In this example the addition works because the data object that x references is a numeric value.

    > 5 + x
    [1] 12

1.1.3.3 Paths

  1. This example displays the current working directory. In R one gets the working directory using getwd().

    > getwd()
    [1] "U:/projects/DWE/book"

    The path displayed here is the book sub-folder of the the RStudio project that created this book. If you run this command in your console, you will see the path is the folder of your DWE project. Relative paths in your scripts will use this as their starting point.

1.1.4 Examples - Python

1.1.4.1 Console

If you are using RStudio, open your DWE RStudio project.

  1. Entering Python mode in the R the console.

    RStudio's native language is R. It supports other languages as well, one of which is Python. To get RStudio ready to run Python commands, the following library() call is needed. (library calls are explained in the next section.) For now, just include this library call when you want to use Python in RStudio.

    library(reticulate)

    The console is still in R mode after the library(reticulate) call. The following is used to put the console in Python mode.

    repl_python()

    In the console you will see three greater than characters, >>>, on the last line. This is the Python prompt. When the prompt is displayed, the program is ready for a new command. When a prompt is not displayed, prior commands are still being working on. Some commands can take a while to run and no prompt will be displayed while they are running.

If you are using Spyder (or another Python IDE) open it now.

  1. Python will be used as a calculator to demonstrate the operation of the console. The following calculates \((3 + 9) / 2\).

    >>> (3 + 9) / 2
    6.0

    The results of the calculation of (3 + 9) / 2 created an object with no name. This is called an anonymous object. This anonymous object is passed to the print function and the results are displayed at the console.

    The prompt is given again after the calculation is completed. The program is ready for the next command.

    If Python reaches the end of a line and the command is not complete, It will issue an error. To continue a Python command to the next line, the backslash character, \, is added to the end of the current line. When a command is continued to the next line, Python uses three periods, ..., to prompt for the continuation of the command. Splitting some commands across multiple lines can improve the readability of you source code by allowing the structure of the command or data to be seen visually.

  2. This example calculates the same expression as the prior example. Here, the 2 that follows the division operator is not given on the same line as rest of the expression. Python knows that the command is not complete because of the \. The ... prompt is seen at the start of then next line. The ... prompt is given until the command is complete. The 2 is provided on the third line and the command is completed, since there is no \ on the line.

    >>> (3 + 9) / \
    ... \
    ... 2
    6.0

    Splitting this expression across three lines did not make the code more readable. It was done only to demonstrate the continuing a command across multiple lines.

    The escape key, often labeled esc on a keyboard, is used to stop console commands. The escape will also exit you out of Python mode in the console. You will need to enter repl_python() to start Python mode again.

  3. Recalling commands in the console.

    Prior commands can be recalled, edited if needed, and then run again. This is done using the page up and page down keys to scroll through the history of prior commands. Prior commands can also be recalled from RStudio's history tab.

  4. If you are using RStudio as the IDE, Python mode is exited by entering exit.

1.1.4.2 Assignment

  1. This example assigns to the name x the value 7. An implied print is used to display the data object x references.

    >>> x = 7
    ... x
    7
  2. This example assigns to y the data object that x references. This is not the character value 'x'. The object displayed by the reference of y is also the value 7.

    >>> y = x
    ... y
    7
  3. This example assigns the character value 'x' to the name y; you would quote the x to make its type character. Without the quotes, Python assumes that x is the name of an object.

    >>> y = 'x'
    ... y
    'x'
  4. This example attempts to add y to 5. The data that y references is a character value and Python will issue an error because addition between character and numeric values is not defined.

    >>> 5 + y
    Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: unsupported operand type(s) for +: 'int' and 'str'
    
    Detailed traceback: 
      File "<string>", line 1, in <module>
  5. In this example the addition works because the data object that x references is a numeric value.

    >>> 5 + x
    12

1.1.4.3 Paths

  1. This example displays the current working directory. In Python one gets the working directory using getcwd(). The getcwd() function is in the os library so we need to tell Python to look for it there. (Functions and libraries will be introduced in the next section.)

    >>> import os
    ... print(os.getcwd())
    U:\projects\DWE\book

    The path displayed here is the book sub-folder of the the RStudio project that created this book. If you run this command in your console, you will see the path is the folder of your DWE project. Relative paths in your scripts will use this as their starting point.

1.1.5 Exercises

  1. Create a object with the value i referenced by the name first_object.

  2. Create a object with the value 52 referenced by the name second_object.