8  Probability theory

8.1 Randomness

There are various ways to generate random variables and sets in Python. Throughout this chapter we will rely on basic concepts from probability theory such as probability density function (pdf) and cumulative density function (cdf).

Furthermore, we will be working with the normal, uniform and other distributions. Familiarity with those is assumed. Whenever you run into a mathematical or probabilistic aspect you do not understand, please look it up and otherwise ask the teacher about it.

We will use functionality from the random subpackage of NumPy. This subpackage is not to be confused with the random package of Python, which has a lot of similar functionality.

import numpy as np

8.1.1 Samples

The random subpackage has functions that can generate so-called pseudo-random numbers. Informally speaking, these numbers “behave” as random numbers when performing numerical experiments with them, but they are not truly random (which can be an issue, e.g., if you would use such numbers of cryptographic purposes).

We next discuss some functions to generate random numbers from well-known distributions.

np.random.rand(m,n): Generate an m \times n two-dimensional array with numbers uniformly drawn form the interval [0,1] (i.e., every number is equally likely).

m,n = 2, 4

M = np.random.rand(m,n)
print(M)
[[0.23136259 0.47881998 0.87221619 0.94563415]
 [0.22826329 0.78297817 0.44031319 0.54456063]]

When m and n are not inputted, a single random number is returned.

x = np.random.rand()
print(x)
0.8482055679428805

np.random.randn(m,n): Generates an m \times n two-dimensional array with numbers drawn from the normal distribution with mean \mu = 0 and standard deviation \sigma = 0.

m,n = 3, 5

M = np.random.randn(m,n)
print(M)
[[-1.51603105  0.47547762 -0.29631789 -0.03874392  0.99558761]
 [ 2.30047018 -0.6723288  -0.16862183 -2.05111568 -2.16306347]
 [ 2.03360073 -1.27258503  0.32191497 -1.00213027 -0.30535286]]

We conclude with sampling numbers from the discrete uniform distribution over a set \{a,a+1,\dots,b-1\}. Here every number i in this discrete interval is generated with equal probability 1/(b - a).

np.random.randint(a,b,(m,n)): Generates an m \times n two-dimensional array with numbers drawn from the discrete uniform distribution on \{a,a+1,\dots,b-1\}.

a, b = 4, 10
m,n = 3, 15

M = np.random.randint(a,b,(m,n))
print(M)
[[9 4 7 5 6 7 6 9 5 8 4 9 5 9 4]
 [7 7 7 8 6 5 9 9 8 6 4 4 4 4 6]
 [7 6 5 9 9 5 6 8 6 8 4 9 6 4 6]]

For almost every well-known probability distribution we can get samples from its distribution. See the documentation for a list of all the distributions.

8.1.2 Subsets

Next to the generation of random numbers, it is also possible to generate random subsets of elements of a given array using the choice() function. It takes as input an array from which we want to obtain a subset and we an set the size of the subset that we want to have using the size keyword argument.

k = 15
x = np.arange(0,11,1)

subset = np.random.choice(x,size=k)

print(subset)
[ 1  1  1  1  4  0  2  0  5  7  5 10  4  3 10]

As can be seen from the output above, some numbers appear twice in the subset, meaning that choice() samples a subset with replacement. If you want to sample without replacement, you can set the keyword replace to False.

k = 8
x = np.arange(0,11,1)

subset = np.random.choice(x,size=k,replace=False)

print(subset)
[5 1 9 8 6 2 0 4]

Finally, you can also specify the probability with which every element should be samples using the keyword argument p.

k = 8

# Set {0,1,2,...,9}
x = np.arange(0,10,1)

# Probabilities for elements in set
prob = np.array([1/3,1/3,0,0,0,0,0,0,0,1/3])

# Generating random subset
subset = np.random.choice(x,size=k,p=prob)

# Only 0, 1, 9 have positive probabilities
print(subset)
[1 9 1 0 0 1 1 0]

There is also the permutation() function that returns the elements in an array in a random order. That is, it creates a so-called random permutation of the elements in the array.

x_perm = np.random.permutation(x)

print(x_perm)
[5 3 6 2 1 9 4 0 8 7]

If you would apply this function on a two-dimensional array, it returns the same array in which the rows are randomly permuted (i.e., the inner lists are randomly permuted).

X = np.arange(0,18,1).reshape(3,6)

x_perm = np.random.permutation(X)

print(x_perm)
[[ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [ 0  1  2  3  4  5]]

8.1.3 Seed

When writing code that involves random numbers of objects, it can sometimes be useful to “fix” the randomness in the script, e.g., when debugging. This can be done by setting a so-called random seed using the seed() function.

For sake of comparison, let us first generate two random numbers from [0,1].

a = np.random.rand()
b = np.random.rand()

print(a,b)
0.4954377273158773 0.7597195722227699

If you rerun the code above it will give different outputs every time. Try this yourself by copying the code into Spyder.

We next do the same, but with a fixed seed s using seed(s). Different choices of s fix the randomness in a different way. If you copy the code below into Spyder and rerun it a couple of times, the output will always be the same.

#Set seed to be s = 3
np.random.seed(3)

a = np.random.rand()
b = np.random.rand()

print(a,b)
0.5507979025745755 0.7081478226181048

8.2 Probability distributions

The stats module of SciPy has many built-in probability distributions. Each distribution can be seen as an object on which various methods can be performed (such accessing its probability density function or summary statistics like the mean and median). You should think of an ‘object’ in the context of object oriented programming, see, e.g., here to recall the basics of this paradigm.

import scipy

In this section we will focus on continuous probability distributions. SciPy also has many built-in discete probability distributions.

A list of all continuous distributions that are present in the stats module can be found here; they are so-called stats.rv_continuous objects. We can instantiate a distributional object by using scipy.stats.dist_name where dist_name is the name of a built-in (continuous) probability distribution in the mentioned list.

Many distributions have input parameters scale and loc that model the scale and location of the distribution, respectively. Depending on the distribution that is considered, these parameters have different meanings.

As an example, the normal distribution has probability density function f(x) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{-\frac{(x-\mu)^2}{2\sigma^2}} which is parameterized by \mu and \sigma.

In Python \mu is the loc parameter, and \sigma the scale parameter. To figure out the function of the scale and loc parameter, you can check the documentation (which can be found here for the Normal distribution).

Documentation of the normal distribution

All distributions have default values for these parameters, which are typically loc=0 and scale = 1.

# Create normal distribution object with mu=0, sigma=1
dist_norm = scipy.stats.norm(loc=0, scale=1)

Once a distribution object has been instantiated, we can use methods (i.e., functions) to obtain various properties of the distribution, such as its probability density function (pdf), cumulative density function (cdf) and summary statistics such as the mean, variance and median (or, more general, quantiles).

We give a list of some common methods for a distribution object named dist_name. We start with common functions associated with a probability distribution.

  • dist_name.pdf(x): Value f(x) where f is the pdf of the distribution.
  • dist_name.cdf(x): Value F(x) where F is the cdf of the distribution.
  • dist_name.sf(x): Value S(x) where S is the survival function (1 - F) of the distribution.
  • dist_name.ppf(alpha): Returns x so that F^{-1}(x) = \alpha where \alpha \in (0,1).
x = 1
print(dist_norm.pdf(x))
0.24197072451914337
alpha = 1/2
print(dist_norm.ppf(alpha))
0.0

All the above functions are vectorized, in the sense that they can also handle higher-dimensional arrays as input. This is convenient, e.g., for visualizing these functions as the example below illustrates.

Show code generating the plot below
import matplotlib.pyplot as plt

# Define the x range of x-values
x = np.linspace(-10,10,600)

# Function values of pdf
y = dist_norm.pdf(x) # Computes pdf values of all elements in x

#Create the figure
plt.figure()

# Create the plot
plt.plot(x, y)

# Create labels for axes
plt.xlabel('x')
plt.ylabel('Function value')

# Fix the range of the axes
plt.xlim(-10,10)
plt.ylim(0,1)

# Add title to the plot
plt.title('Pdf of Normal distribution with $\mu=0$ and $\sigma=1$')

# Add grid to the background
plt.grid()

# Show the plot
plt.show()

We can also access various summary statistics:

  • dist_name.mean(): Returns mean of the distribution
  • dist_name.var(): Returns variance of the distribution
  • dist_name.median(): Returns median of the distribution
dist_norm = scipy.stats.norm(loc=0,scale=2)

mean =  dist_norm.mean()
variance = dist_norm.var()
median = dist_norm.median()

print("Mean of the distribution is", mean)
print("Variance of the distribution is", variance)
print("Median of the distribution is", median)
Mean of the distribution is 0.0
Variance of the distribution is 4.0
Median of the distribution is 0.0

Note that in the example above the standard deviation equals \sigma = 2; the variance is then \sigma^2 = 4.

Finally it is also possible to to access the support upper and lower bound of a distribution.

  • dist_name.support() : Returns values a,b of (smallest) interval [a,b] for which all probability mass is contained in it.

Let us first consider the uniform distribution.

Documentation of the uniform distribution

Here the scale and loc parameters result in a uniform distribution on the interval [loc, loc + scale].

#Uniform distribution on [3,3+4] = [3,7]
dist_unif = scipy.stats.uniform(3,4) 

a,b = dist_unif.support()

print(f"The distribution is supported on the interval [{a},{b}]")
The distribution is supported on the interval [3.0,7.0]

Some distributions have an unbounded support. In the case of the Normal distribution, we have a = -\infty and b = \infty. For the Exponential Distribution, we have a = 0 and b = \infty.

dist_norm = scipy.stats.norm(loc=0, scale=1)

a,b = dist_norm.support()

print(f"The Normal distribution is \
supported on the interval [{a},{b}]")
The Normal distribution is supported on the interval [-inf,inf]
dist_exp = scipy.stats.expon(loc=0,scale=1)

a,b = dist_exp.support()

print(f"The Exponential distribution is \
supported on the interval [{a},{b}]")
The Exponential distribution is supported on the interval [0.0,inf]

The value inf that b has in the example above is in fact a number within NumPy, namely np.inf. Let us check that the upper bound of the exponential distribution is indeed np.inf.

statement = (b == np.inf)

print(statement)
True

8.2.1 Distributions as input arguments

Distributional objects can also serve as input arguments of a function. In that case, you can access the methods of the object inside the function.

Suppose we want to write a function that outputs a message saying whether or not the mean or the median of a distribution is larger. We can do this as follows for an arbitrary distribution.

def mean_median(dist):
    if dist.mean() < dist.median():
        return "The mean is smaller than the median"
    elif dist.mean() == dist.median():
        return "The mean and median are equal."
    elif dist.mean() > dist.median():
        return "The mean is larger than the median."

In the function above, the input argument dist is a distribution object whose methods mean() and median() we access within the function.

dist_norm = scipy.stats.norm(loc=1,scale=2)

comparison  = mean_median(dist_norm)
print(comparison)
The mean and median are equal.