2.7 Code Examples: R & Python

Below we present some examples of equivalent code in R and Python for easier comparison. We note that the subsections dedicated to R and Python should be studied beforehand to get the general overview of the programming languages as this chapter basically summarizes the fuctionality and provides a side-by-side comparison for select operations. Again, this is not an exhaustive comparison and any additional operations, which may be needed, will be covered in their relevant topics.

You can get additional information on many functions in either R of Python:

  • in R:

You can use ?function_name or help(function_name) to get the help about any function, for example:

## Print Values
## 
## Description:
## 
##      'print' prints its argument and returns it _invisibly_ (via
##      'invisible(x)').  It is a generic function which means that new
##      printing methods can be easily added for new 'class'es.
## 
## Usage:
## 
##      print(x, ...)
##      
## ...
  • in Python:

You can use help(function_name) to get help about any function, for example:

## Help on built-in function print in module builtins:
## 
## print(...)
##     print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
##     
##     Prints the values to a stream, or to sys.stdout by default.
##     Optional keyword arguments:
##     file:  a file-like object (stream); defaults to the current sys.stdout.
##     sep:   string inserted between values, default a space.
##     end:   string appended after the last value, default a newline.
##     flush: whether to forcibly flush the stream.

Note that the documentation quality will vary (e.g. some functions may have a more detailed explanations and examples in R compared to similar ones in Python and vice versa).

2.7.1 Working with data arrays

We begin by examining a data array.

We can create a simple vector with some values:

We can select specific elements. Note that in Python, the values start at index 0, while in R they start at index 1:

Note that when selecting multiple values, in R the index range is 1:3 = {1, 2, 3}, while in Python the index range 0:3 = {0, 1, 2} - i.e. the last value is not included. Similarly, we can also create a range of values:

We can get the length of the data array:

We can loop through our data array and print the elements:

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8

We can select every 3nd element, starting from the 2nd element:

Or every 2nd element starting from the 1st:

We can add two vectors together:

In this case, we use zip to create pairs from the two lists:

## [(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)]

Then we iterate through the pairs and sum the elements in each pair.

Note that the addition of two vectors my_vec_1 + my_vec_2 in R has a different meaning in Python:

## vec. 1 & 2:  1 2 3 4 5 5 4 3 2 1
## vec. 1 & 2:  [1, 2, 3, 4, 5, 5, 4, 3, 2, 1]

If we use numpy.array instead of a list, then we can add the two vectors together in Python like we do in R:

## vec. 1 + 2:  [6 6 6 6 6]

We can multiply the vector elements by a constant:

## 3 6 9 12 15
## [3, 6, 9, 12, 15]

Again, for a list in Python, the * has a different meaning:

## 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

Though if we use numpy.array, then it works in a similar way as in R:

## [ 3  6  9 12 15]

2.7.2 Working with strings

Similarly, we can create character strings. R has specific functions to modify strings, while Python has these modules implemented in the string class itself, so we always know which functions are available by using the dot . operator and pressing Tab in the editor to get the full list of modules for a specific class:

2.7.3 Variables containing different data type values

It is often desireable to have a variable contain different types of information - integer, string, boolean values.

  • in R:
## List of 3
##  $ name        : chr "Joe"
##  $ grades      : num [1:3] 8 7 9
##  $ has_attended: logi TRUE
  • in Python:
## <class 'dict'>
## {'name': 'Joe', 'grades': [8, 7, 9], 'has_attended': True}
## List of 1
##  $ name: chr "Joe"
## $name
## [1] "Joe"
##  chr "Joe"
## [1] "Joe"
## List of 1
##  $ grades: num [1:3] 8 7 9
## List of 1
##  $ has_attended: logi TRUE

As we can see, these variales are able to house value of different types. In R we can select the relevant values in a number of different ways.

Often, however, we have more than one observation with different properties (e.g. a database of people with names, unique ID’s, email addreses, indicator value if it is a new member, etc.) and we want to have a matrix-like structure (i.e. a table) to house those values.

## 'data.frame':    3 obs. of  3 variables:
##  $ name       : chr  "John" "Sam" "Tim"
##  $ wage       : num  800 600 700
##  $ is_employed: logi  TRUE TRUE FALSE
##   name wage is_employed
## 1 John  800        TRUE
## 2  Sam  600        TRUE
## 3  Tim  700       FALSE

Note, the stringsAsFactors is required so that the name column would be a character vector instead of a factor (a factor is a vector of integer values with a corresponding set of character values to use when the factor is displayed).

## <class 'pandas.core.frame.DataFrame'>
##    name  wage  is_employed
## 0  John   800         True
## 1   Sam   600         True
## 2   Tim   700        False
## Index(['name', 'wage', 'is_employed'], dtype='object')

The columns = my_dataset.keys() is required to preserve the order of the columns.

P.S. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s.

Values can be accessed directly:

##   name
## 1 John
## 2  Sam
## 3  Tim
## [1]  TRUE  TRUE FALSE
## 0    John
## 1     Sam
## 2     Tim
## Name: name, dtype: object
## 0     True
## 1     True
## 2    False
## Name: is_employed, dtype: bool

Note that using $ in R returns a vector, while specifying ["name"] returns a data.frame with one columns (as evident by the output format).

2.7.4 Defining Functions

We can also define our own functions.

Let’s define a simple function, which compares two values:

Let’s say we want to define a custom addition function, which increases the values of two elements by one before adding them together:

We can also create a function which creates a summary of a data array:

We note the different data types - a list in R and a dictionary in Python. We can also use the print function with the whole variable:

## $min
## [1] -2
## 
## $max
## [1] 5
## 
## $average
## [1] 1.333333
## 
## $sum
## [1] 4
## {'min': -2, 'max': 5, 'average': 1.3333333333333333, 'sum': 4}

2.7.5 Lists in Python are mutable

The original values were changed in Python, even though we assigned the function output to a new variable!

In order to not modify the object we are passing, we can create a new reference inside our function:

Note that if we write b_list = a_list instead of b_list = a_list[:] (or instead of b_list = list(a_list)), then we will again modify the original variable!

Newertheless, it is still useful to pass by reference instead of by value if we do not want our function to return anything but still change the original values.

2.7.6 Creating Matrices

We will create the following matrix: \[ \mathbf{X} = \begin{bmatrix} 1 & 4\\ 2 & 5\\ 3 & 6 \end{bmatrix} \]

Note that we can transpose the matrix (so the column elements become the row elements and vice versa) in the following way - using t(...) in R and np.transpose(...) in Python:

##    [,1] [,2] [,3]
## x1    1    2    3
## x2    4    5    6
## Column-stacked lists:
##  [[1 4]
##  [2 5]
##  [3 6]]
## Transposed row-stacked lists:
##  [[1 4]
##  [2 5]
##  [3 6]]
## Row-stacked lists:
##  [[1 2 3]
##  [4 5 6]]
## Transposed column-stacked lists:
##  [[1 2 3]
##  [4 5 6]]

Note the different ways that the matrix is constructed in Python - using np.column_stack creates \(\mathbf{X}\), whereas using np.vstack creates \(\mathbf{X}^\top\).

We can access different elements from the matrix \(\mathbf{X}\):

We can multiply different matrices: \[ \mathbf{X}^\top \mathbf{X}= \begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 & 4\\ 2 & 5\\ 3 & 6 \end{bmatrix} = \begin{bmatrix} 14 & 32\\ 32 & 77\\ \end{bmatrix} \]

##    x1 x2
## x1 14 32
## x2 32 77
## [[14 32]
##  [32 77]]
## [[14 32]
##  [32 77]]

Note that using np.column_stack allows us to implement the formula as it is written, i.e. by transposing the first matrix.

2.7.7 Classes in R and Python

Note: there are so called S3, S4 and Reference classes in R, though their use depends on the individual package creator. For some tutorials on creating classes in R, see this page and a mini-examples for both S4 and Reference Classes.

We will restrict ourselves to a simple example for both R and Python:

  • Our class will have 3 variables: x, y and z;
  • Our class will have a function to print hello;
  • Our class will have a function to double the value(s) of y;
  • Our class will have a function which increases the value of the passed variable by 1;
## Reference class object of class "MyClass"
## Field "x":
## NULL
## Field "y":
##  [1]  1  2  3  4  5  6  7  8  9 10
## Field "z":
## [1] "a"  "b"  "cd"
## <__main__.MyClass object at 0x000000006B9EADC0>

Now we can access the values and functions in the initialized class objects:

## NULL
##  [1]  1  2  3  4  5  6  7  8  9 10
## [1] "a"  "b"  "cd"
## [1] "hello"
##  [1]  2  4  6  8 10 12 14 16 18 20
## [1] 6
## None
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
## ['a', 'b', 'cd']
## hello
## [ 2  4  6  8 10 12 14 16 18 20]
## 6

2.7.8 Plotting data

Plotting data is very similar in both R and Python:

Using plt.figure() we can number the plots in Python in order to separate them as well as edit them if we ever need to by specifying plt.figure(number), where the number is the ID of the figure that we want to edit (or create a new figure).

Note that matplotlib.pyplot.hist() returns a tuple with the value of histogram bins, n, the edges of the bins and a list of individual patches used to create the histogram. In some IDE’s, these returned values are not printed by default but for the notes in this book, we need to supress them manually.

If we need to access the plot figure that we created previously then, as long as we did not delete it from the working environment, we can select it using the plt.figure method:

We can also plot multiple figures:

We can specify a 1-row, 2-column layout using par(mfrow = c(1, 2)):

We can specify the 1-row, 2-column layout using add_subplot(1, 2, c) or add_subplot(12c) where c is the position (integer number) of the plot (c = 1 - plot in the first layout space, c = 2 - plot in the second layout space).

We can also plot an odd number of plots:

By using the layout function, we can specify a matrix layout of how we want our plots positioned, where the number indicates the plot number - the larger the matrix, the more precide our positioning can be. A 0 indicates to not plot at that position.

Using add_subplot(abc) we can specify either an odd or even number of plots by specifying a different subplot layouts (e.g. the same number of rows but different columns) for some of the plots, but their positions must not overlap!

2.7.9 Advanced plot libraries

There are also advanced plot capabilities in both R (e.g. ggplot2, plotly) and Python (e.g. ggplot2, Bokeh, plotly). Some examples for ggplot2 can be found here.

In R use install.packages("ggplot2") to install the package. For Python the package is called plotnine - in Anaconda Navigator use conda install -c conda-forge plotnine: