A programming primer for data science
This section aims to:
- Present the core programming concepts, used in various statistics, econometrics and data science tasks.
- Provide examples of both R and Python.
- Provide a convenient interface to toggle between R and Python to see the similarities and differences between the two programming languages.
- Highlight some language-specific concepts.
Packages
In general packages are collections of various functions, classes, sample datasets and their documentation. Packages usually focus on a specific topic, for example, a package may focus on plot creation, while another may be created for estimating various (or only a specific subset of) statistical models. In other cases, packages may contain executables, such as shiny for R (and Python), or streamlit for Python.
Some packages are available with the base installation of the programming language, while others need to be explicitly downloaded and installed (see install.packages in R and pip in Python for more).
Packages need to be loaded only once per project. It is recommended to load all of the required packages at the beginning of your script/notebook file.
An example of loading some installed packages is provided below.
The library()
function loads the specified library, if it exists. If such a library isn’t found - an error is raised.
Whenever we want to refer to an object from a specific library we can either:
- Load the whole package and call a specific function:
- Without loading the package, we can use the
::
notation (package_name::function_name
) to call a specific function:
# A function from the `ggplot2` package
ggplot2::ggplot()
As noted, the second example works when we don’t want to load the package (but we still need to have it installed).
An important caveat: if multiple libraries have the same function names, the last library loaded will override any functions with the same names. If you only need a few functions from a specific library - it may be best to use the ::
notation for those functions, intead of loading the whole library.
import pandas as pd
import plotnine as plt
import statsmodels.api as sm
The import
statement searches for the specified module (or package) and then it binds the results of that search to a name. If no name is specified, then the assigned name is the same as the module name. If no such module/package is found, then an error is raised. (Note: see package and module definitions for more specifics.)
- If the package is loaded, we can call specific functions/classes using the dot (
.
) notation:
import plotnine as plt
# A class from the `plotnine` package
plt.ggplot()
- Alternatively, we can choose to only load specific modules:
from plotnine import ggplot
# A class from the `plotnine` package
ggplot()
Note that in Python
functions are independent blocks of code that can be called from anywhere, while methods are tied to objects or classes and need an object or class instance to be invoked. For example: array() function in numpy and the fit() module for the OLS class in statsmodels
Additional packages can be found at the following repositories:
Operators
There are a number of operators available in R
and Python
, which are used in mathematical calculations, value comparisons and value assignments.
Operator | type | R | Python |
---|---|---|---|
addition | arithmetic | x + y |
x + y |
subtraction | arithmetic | x - y |
x - y |
multiplication | arithmetic | x * y |
x * y |
division | arithmetic | x / y |
x / y |
exponentiation (\(x^y\)) | arithmetic |
x^y (recommended) or x**y
|
x**y |
modulus (x mod y) | arithmetic | x %% y |
x % y |
integer division | arithmetic | x %/% y |
x // y |
matrix Multiplication | arithmetic | x %*% y |
x @ y |
equal | logical (comparison) | x == y |
x == y |
not equal | logical (comparison) | x != y |
x != y |
(x) is less than (y) | logical (comparison) | x < y |
x < y |
(x) is more than (y) | logical (comparison) | x > y |
x > y |
(x) is less than or equal to (y) | logical (comparison) | x <= y |
x <= y |
(x) is more than or equal to (y) | logical (comparison) | x >= y |
x >= y |
(x) and (y) | logical (comparison) | x & y |
x and y or (elementwise) numpy.logical_and(x, y)
|
(x) or (y) | logical (comparison) | x | y |
x or y or (elementwise) numpy.logical_or(x, y)
|
not (x) | logical (comparison) | !x |
not x |
containment test (which x values are in a set of y values) | other | x %in% y |
numpy.isin(x, y) |
assign value | assignment |
x <- 2 or x <<- 2 (global) or x = 2
|
x = 2 |
add y to x and assign to x
|
assignment | x <- x + y |
x += y or x = x + y
|
subtract y from x and assign to x
|
assignment | x <- x - y |
x -= y or x = x - y
|
multiply x by y and assign to x
|
assignment | x <- x * y |
x *= y or x = x * y
|
divide x by y and assign to x
|
assignment | x <- x / y |
x /= y or x = x / y
|
exponentiate \(x^y\) and assign to x
|
assignment | x <- x^y |
x **= y or x = x ** y
|
Note: we omit bitwise operators as they are less common in general data analysis and modelling.
A comprehensive list of operators is available in the Python documentation.
Data Types
There are a number of built-in (and library-specific) data types available in R
and Python
. Data types are used to represent specific values (or collections of values or objects) and have a pre-defined functionality for various operators.
Numbers
Numeric values can be values from \(\mathbb{Z}\) (integer), \(\mathbb{R}\) (real number) or \(\mathbb{C}\) (complex number) sets.
We use the assignment operator to assign values to variables:
x1 <- as.integer(1)
x2 <- 2
x3 <- complex(real = 3, imaginary = 1)
x4 <- 4 + 2i
= 1
x1 = 2.0
x2 = complex(real = 3, imag = 1)
x3 = 4 + 2j x4
We can print these values using the print()
function:
As well as check the types of our values:
We can add, subtract, multiply and divide multiple values together:
x5 <- x1 + x2 + 3
x6 <- x3 + x4
x7 <- x1 - x2 - 5
x8 <- x3 - x4
x9 <- x1 * x2 * (-1)
x10 <- x3 * x4
x11 <- x1 / 2
x12 <- x3 / x4
x13 <- x2^2
x14 <- x4^3
= x1 + x2 + 3
x5 = x3 + x4
x6 = x1 - x2 - 5
x7 = x3 - x4
x8 = x1 * x2 * (-1)
x9 = x3 * x4
x10 = x1 / 2
x11 = x3 / x4
x12 = x2**2
x13 = x4**3 x14
div
and mod
operations can also be carried out as follows:
print("Remainder of a division (mod): ", x15)
Remainder of a division (mod): 2
print(f"Integer division (div): {x16:02d}")
Integer division (div): 01
Here we use the 02d
format notation to specify a two-digit integer format.
Text/Strings/Characters
Strings can be any combination of various symbols.
s1 <- "This is a sentence"
s2 <- "cat"
s3 <- "1"
= "This is a sentence"
s1 = "cat"
s2 = "1" s3
Unlike numbers, strings do not have a clear definition for mathematical operations1:
s3 + 1
Error in s3 + 1: non-numeric argument to binary operator
+ 1 s3
can only concatenate str (not "int") to str
Nevertheless, we may need to modify various strings of characters in our data. To make this process easier, a number of functions are available in R
and Python
.
String transformations
Firstly, we may be interested in concatenating multiple strings together. We can do so as follows:
We may also want to change the capitalization of our text:
[1] "THIS IS A SENTENCE"
[1] "this is a sentence"
print(stringr::str_to_sentence(s2))
[1] "Cat"
print(stringr::str_to_title(s1))
[1] "This Is A Sentence"
print(s1.upper())
THIS IS A SENTENCE
print(s2.lower())
cat
print(s2.capitalize())
Cat
print(s1.title())
This Is A Sentence
We can also calculate the number of characters in our string:
We might be interested in extracting part of a string as follows:
We may also wish to split a string into separate segments:
print(s1.split(" "))
['This', 'is', 'a', 'sentence']
print(s1.split("a"))
['This is ', ' sentence']
print(s1.split("is"))
['Th', ' ', ' a sentence']
Regular expressions
A regular expression (regex) is a sequence of characters that specifies a pattern in text. We can use regular expresions to:
- Check if a specific sequence exists in a string;
- Replace a sequence with another one;
- Capturing portions of the match as placeholders and using them.
Additional regex syntax options can be found at Python
’s regex syntax docs, Python
’s Regular Expression HOWTO docs, as well as R’s regex docs.
r1 <- "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 321, 11,2, 3"
= "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 321, 11,2, 3" r1
We can check whether specific sequence of symbols exists in our string as follows:
import re
#
print(re.search('with letters and', r1))
None
print(re.search('with letters And', r1))
<re.Match object; span=(19, 35), match='with letters And'>
print(bool(re.search('with letters and', r1)))
False
print(bool(re.search('with letters And', r1)))
True
We can write more genral expressions using various special characters:
-
.
(dot) - matches any character except a newline; -
^
(caret) - matches the start of the string; -
$
(dollar sign) - matches the end of the string; -
*
(asterisk) - causes the resulting RE (regular expression) to match 0 or more repetitions of the preceding RE. For exampleab*
will matcha
followed by any zero or more repetitions ofb
, while.*
will search for zero or more repetitions of any character; -
+
- causes the resulting RE to match 1 or more repetitions of the preceding RE. For example,ab+
will matcha
followed by any non-zero number ofb
s; it will not match justa
; -
?
- causes the resulting RE to match 0 or 1 repetitions of the preceding RE. For example,ab?
will match eithera
orab
. -
{m}
- specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example,a{6}
will match exactly sixa
characters, but not five. -
{m,n}
- causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example,a{3,5}
will match from 3 to 5a
characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example,a{4,}b
will matchaaaab
or a thousanda
characters followed by ab
, but notaaab
; -
[]
- used to indicate a set of characters. For example, a set of characters[amk]
will matcha
,m
, ork
. Ranges of characters can be indicated by giving two characters and separating them by a ‘-’, for example[a-z]
will match any lowercase ASCII letter,[0-5][0-9]
will match all the two-digits numbers from00
to59
; -
()
- matches whatever regular expression is inside the parentheses. For example,(abc)
will matchabc
. -
|
- matches either one of two REs. For example,A|B
(whereA
andB
can be arbitrary REs), creates a regular expression that will match eitherA
orB
. Can also be used inside()
to match part of an output. For example,a(bc|d)e
will match eitherabce
orade
. -
\
(inPython
) or\\
(inR
) - escapes special characters. For example\-
will allow to match the symbol-
, same goes for\?
,\+
,\.
,\(
,\[
, etc. - We might want to capture the contents of one or more groups in
()
of the same number ordering. InPython
we can\number
(e.g.\1
,\2
, etc.), while inR
we would use\\number
(e.g.\\1
,\\2
, etc.). See the example at the end of this section.
More special characters are available in Python’s re
docs and R’s regex
docs
print(bool(re.search('this', r1)))
True
print(bool(re.search('^this', r1)))
False
print(bool(re.search('^This', r1)))
True
print(bool(re.search("2$", r1)))
False
print(bool(re.search("[0-9]$", r1)))
True
print(bool(re.search("^This.*3$", r1)))
True
print(bool(re.search('[:digit:]', r1)))
True
print(bool(re.search('numbers [0-9],', r1)))
False
print(bool(re.search('numbers [0-9]+,', r1)))
True
print(bool(re.search('[0-9]+.*[0-9].*[0-9]', r1)))
True
print(bool(re.search('^this', r1)))
False
print(bool(re.search('^This', r1)))
True
We can replace characters by substituting them with another set of characters:
[1] "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 000, 00,0, 0"
[1] "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 0, 0,0, 0"
[1] "0T0h0i0s0 0i0s0 0a0 0s0e0n0t0e0n0c0e0 0w0i0t0h0 0l0e0t0t0e0r0s0 0A0n0d0 0t0h0i0s0 0i0s0 0a0 0b0u0n0c0h0 0o0f0 0$0y0m0b0o0l0s0 0A0N0d0 0n0u0m0b0e0r0s0 0,0 0,0,0 0"
[1] "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 321, 11,2, 0"
print(re.sub('[0-9]', '0', r1))
This is a sentence with letters And this is a bunch of $ymbols ANd numbers 000, 00,0, 0
print(re.sub('[0-9]+', '0', r1))
This is a sentence with letters And this is a bunch of $ymbols ANd numbers 0, 0,0, 0
print(re.sub('[0-9]*', '0', r1))
0T0h0i0s0 0i0s0 0a0 0s0e0n0t0e0n0c0e0 0w0i0t0h0 0l0e0t0t0e0r0s0 0A0n0d0 0t0h0i0s0 0i0s0 0a0 0b0u0n0c0h0 0o0f0 0$0y0m0b0o0l0s0 0A0N0d0 0n0u0m0b0e0r0s0 00,0 00,00,0 00
print(re.sub('[0-9]$', '0', r1))
This is a sentence with letters And this is a bunch of $ymbols ANd numbers 321, 11,2, 0
print(re.sub('\$', 's', r1))
This is a sentence with letters And this is a bunch of symbols ANd numbers 321, 11,2, 3
print(re.sub('\\$', 's', r1))
This is a sentence with letters And this is a bunch of symbols ANd numbers 321, 11,2, 3
We can also chain multiple substitutions:
As well as search and replace specific repetitions of patterns:
[1] "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 00, 0,0, 0"
[1] "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 0, 0,2, 3"
[1] "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 01, 0,2, 3"
[1] "This is a sentence with letters And this is a bunch of $ymbols ANd numbers 0, 11,2, 3"
print(re.sub('[0-9]{1,2}', '0', r1))
This is a sentence with letters And this is a bunch of $ymbols ANd numbers 00, 0,0, 0
print(re.sub('[0-9]{2,3}', '0', r1))
This is a sentence with letters And this is a bunch of $ymbols ANd numbers 0, 0,2, 3
print(re.sub('[0-9]{2}', '0', r1))
This is a sentence with letters And this is a bunch of $ymbols ANd numbers 01, 0,2, 3
print(re.sub('[0-9]{3}', '0', r1))
This is a sentence with letters And this is a bunch of $ymbols ANd numbers 0, 11,2, 3
Finally, we can capture portions of text and re-use them. For example, we might want to midify everything else except a specific portion of text:
Regular expressions can be confusing at times (e.g. you might write a complex regular expression and later, after a couple of weeks, forget how it worked). Fortunately, there are various online resources (such as regexr.com) that provide helpful highlighting for specific parts of regular expressions. Furthermore, you can try to split a single larger regular expression into multiple smaller ones and carry out text cleaning/replacement in multiple lines of code, instead of one single (and complex) expression.
Boolean values
Boolean values can only have two values - true
(sometimes also represented by 1
) or flase
(sometimes also represented by 0
).
b1 <- TRUE
b2 <- FALSE
x1 <- 4
x2 <- 1
= True
b1 = False
b2 = 4
x1 = 1 x2
We can print the values and their logical negations:
If we perform numeric operations, then the true/false values are treated as numeric 1/0:
We can also use logical operators:
We can also chain multiple logical operators:
Finally, we can cast numeric values to boolean ones and vice versa (note the difference for:
print(as.logical(1))
[1] TRUE
print(as.logical(0))
[1] FALSE
print(as.logical(100))
[1] TRUE
print(as.logical(-100))
[1] TRUE
print(bool(1))
True
print(bool(0))
False
print(bool(100))
True
print(bool(-100))
True
Special values
Special values are usually reserved to represent missing or undefined values.
A special values in Python
are unavailable in the base installation but are defined in the numpy
and pandas
packages:
import numpy as np
import pandas as pd
The NA
(not available)
Used to define missing values.
Arithmetic operators are undefined for NA
values:
On the other hand, some of the logical operators have different results, depending on the logical conclusion of the comparison:
See the Python
tab for a discussion on the differences between the and
operator and the &
operator when dealing with missing values.
In Python the and
operator is not the same as the &
operator. The and
operator in Python
cannot be overridden, whereas the &
operator (also __and__
) can. Hence the choice the use &
in numpy
and pandas
packages.
print(pd.NA or pd.NA)
boolean value of NA is ambiguous
print(pd.NA and True)
boolean value of NA is ambiguous
x and y
triggers the evaluation of bool(x)
and bool(y)
, if x
evaluates to false
, then the value of bool(y)
is returned. If x
is a vector (i.e. contains multiple values) or NA
, then its true/false
value cannot be determined.
We also have a number of functions defined in order to check if it is a special value:
The NaN
(not a number)
Any numeric calculations with an undefined result. In general, a division by zero is undefined, however this ambiguity is presented differently in R
and Python
:
print(1 / 0)
division by zero
print(0 / 0)
division by zero
print(1 / np.float64(0))
inf
<string>:1: RuntimeWarning: divide by zero encountered in scalar divide
print(np.float64(0) / np.float64(0))
nan
<string>:1: RuntimeWarning: invalid value encountered in scalar divide
The Inf
(infinite)
Infinite values are also represented as special values:
print(pd.isnull(np.Inf))
False
print(np.isnan(np.Inf))
False
print(pd.isna(np.Inf))
False
print(np.isinf(np.Inf))
True
print(np.isinf(1e500))
True
The NULL/None
(undefined)
Note that undefined values are treated differently in R
and Python
:
print(pd.isnull(None))
True
print(np.isnan(None))
ufunc 'isnan' not supported for the input types, and the inputs could not be
safely coerced to any supported types according to the casting rule ''safe''
print(pd.isna(None))
True
print(np.isinf(None))
ufunc 'isinf' not supported for the input types, and the inputs could not be
safely coerced to any supported types according to the casting rule ''safe''
With the exception of the
+
operator for two strings inPython
.↩︎