Search⌘ K
AI Features

Mathematical Background

Explore key mathematical principles foundational to data structures, including logarithms, exponential functions, big-Oh notation for algorithm complexity, factorials, binomial coefficients, and probability theory with expected values. This lesson equips you to understand efficiency and analyze the performance of data structure operations in Python effectively.

In this lesson, we review some mathematical notations and tools used throughout this course, including logarithms, big-Oh notation, and probability theory. This review will be brief and is not intended as an introduction.

Exponentials and logarithms

The expression bxb^x denotes the number bb raised to the power of xx.

If xx is a positive integer, then this is just the value of bb multiplied by itself x1x − 1 times:

bx=b×b××bxb^x = \underbrace{b \times b \times \cdots \times b}_{x}

When xx is a negative integer, bx=1/bxb^x =1 / b^{−x}. When x=0x = 0, bx=1b^x = 1. When bb is not an integer, we can still define exponentiation in terms of the exponential function exe^x (see below), which is itself defined in terms of the exponential series, but this is best left to a calculus text.

In this course, the expression logbk\log_{b}{k} denotes the basebbase–b logarithm of kk. That is, the unique value xx that satisfies

bx=kb^x = k

Most of the logarithms in this course are base 2 (binary logarithms). For these, we omit the base, so that logk\log_k is shorthand for log2k\log_{2}{k}.

An informal, but useful, way to think about logarithms is to think of logbk\log_{b}{k} as the number of times we have to divide kk by bb before the result is less than or equal to 11. For example, when one does binary search, each comparison reduces the number of possible answers by a factor of 22. This is repeated until there is at most one possible answer. Therefore, the number of comparison done by binary search when there are initially at most n+1n + 1 possible answers is at most log2(n+1)\lceil \log_2(n + 1) \rceil.

Another logarithm that comes up several times in this course is the natural logarithm. Here we use the notation lnk\ln k to denote logek\log_{e} {k}, where eeEuler’s constant—is given by

e=limn(1+1n)n2.71828e = \lim_{n\to\infty}\left(1+\frac{1}{n}\right)^n \approx 2.71828

The natural logarithm comes up frequently because it is the value of a particularly common integral:

1k1xdx=lnk\int_{1}^{k} \frac{1}{x} dx = \ln k

Two of the most common manipulations we do with logarithms are removing them from an exponent:

blogbk=kb^{\log_{b}k} = k

and changing the base of a logarithm:

logbk=logaklogab\log_{b}k = \frac {\log_{a} k} {\log_{a} b}

For example, we can use these two manipulations to compare the natural and binary logarithms.

lnk=logkloge=logk(lne)/(ln2)=(ln2)(logk)0.693147logk\ln k = \frac {\log k} {\log e} = \frac {\log k} {(\ln e) / (\ln 2)} = (\ln 2) (\log k) \approx 0.693147 \log k

Factorials

In one or two places in this course, the factorial function is used. For a non-negative integer nn, the notation n!n! (pronounced “nn factorial”) is defined to mean

n!=123nn! = 1 \cdot 2 \cdot 3 \cdot \cdots \cdot n

Factorials appear because n!n! counts the number of distinct permutations, i.e., orderings, of nn distinct elements. For the special case n=0n = 0, 0!0! is defined as 11.

The quantity n!n! can be approximated using Stirling’s Approximation:

n!=2πn(ne)neα(n)n! = \sqrt{2\pi n} \left(\frac {n}{e}\right)^n e^{\alpha(n)}

where

112n+1<α(n)<112n\frac{1}{12n + 1} < \alpha (n) < \frac{1}{12n}

Stirling’s Approximation also approximates ln(n!)\ln(n!):

ln(n!)=nlnnn+12ln(2πn)+α(n)\ln (n!) = n \ln n - n + \frac{1}{2} \ln (2\pi n) + \alpha(n)

Note: In fact, Stirling’s Approximation is most easily proven by approximating ln(n!)=ln1+ln2++lnn\ln (n!) = \ln 1 + \ln 2 + \cdots + \ln n by the integral 1nlnn dn=nlnnn+1\int_{1}^{n} \ln n\ dn = n \ln n - n + 1.

Related to the factorial function are the binomial coefficients. For a non-negative integer nn and an integer k{0,...,n}k ∈ \{0,...,n\}, the notation (nk)\binom{n}{k} denotes:

(nk)=n!k!(nk)!\binom{n}{k} = \frac {n!} {k!(n-k)!}

The binomial coefficient (nk)\binom{n}{k} (pronounced "nn choose kk") counts the number of subsets of an nn element set that have size kk, i.e., the number of ways of choosing kk distinct integers from the set {1,...,n}\{1,...,n\}.

Asymptotic notation

When analyzing data structures in this course, we want to talk about the running times of various operations. The exact running times will, of course, vary from computer to computer and even from run to run on an individual computer.

When we talk about the running time of an operation we are referring to the number of computer instructions performed during the operation. Even for simple code, this quantity can be difficult to compute exactly. Therefore, instead of analyzing running times exactly, we will use the so–called big–Oh notation: For a function f(n),O(f(n))f (n), O(f(n)) denotes a set of functions,

O(f(n))={g(n):there exists c>0, and n0 such thatg(n)c f(n) for all nn0}O(f(n)) = \begin{cases} \begin{rcases} g(n) : \text{there exists $c > 0$, and $n _0$ such that}\\ g(n) \le c\ \cdot f(n)\ \text{for all $n \ge n_0$} \end{rcases} \end{cases}

Thinking graphically, this set consists of the functions g(n)g (n) where cf(n)c \cdot f (n) starts to dominate g(n)g(n) when nn is sufficiently large.

We generally use asymptotic notation to simplify functions. For example, in place of 5nlogn+8n2005n\log n + 8n − 200 we can write O(nlogn)O(n\log n). This is proven as follows:

                                   5nlogn+8n2005nlogn+8n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 5 n \log n + 8 n - 200 \le 5 n \log n + 8 n

                                                        5nlogn+8n logn       for n2 (so that logn1)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \le 5 n \log n + 8 n \ log n\ \ \ \ \ \ \ \text{for}\ n \ge 2\ (\text{so that $\log n \ge 1$})

13nlogn              \le 13n \log n \ \ \ \ \ \ \ \ \ \ \ \ \ \

This demonstrates that the function f(n)=5nlogn+8n200f (n) = 5n \log n + 8n − 200 is in the set O(nlogn)O(n \log n) using the constants c=13c = 13 and n0=2n_0 = 2.

A number of useful shortcuts can be applied when using asymptotic notation. First:

O(nc1)O(nc2)O(n^{c_1}) \subset O(n^{c_2})

for any c1<c2c_1 < c_2. Second: For any constants a,b,c>0a,b,c > 0,

O(a)O(logn)O(nb)O(cn)O(a) \subset O(\log n) \subset O(n^b) \subset O(c^n)

These inclusion relations can be multiplied by any positive value, and they still hold. For example, multiplying by nn yields:

O(n)O(nlogn)O(n1+b)O(ncn)O(n) \subset O(n \log n) \subset O(n^{1+b}) \subset O(nc^n)

We will use f1(n)=O(f(n))f_1(n) = O(f (n)) and f1(n)O(f(n))f_1(n) \in O(f (n)) synonymously. So, we will write f1(n)=O(f(n))f_1(n) = O(f (n)) when what we really mean is f1(n)O(f(n))f_1(n) \in O(f (n)). We will also make statements like “the running time of this operation is O(f(n))O(f (n))” when this statement should be “the running time of this operation is a member of O(f(n))O(f (n)).” These shortcuts are mainly to avoid awkward language and to make it easier to use asymptotic notation within strings of equations.

A particularly strange example of this occurs when we write statements like

T(n)=2logn+O(1)T(n) = 2 \log n + O(1)

Again, this would be more correctly written as

T(n)2logn+[some member of O(1)]T(n) \le 2 \log n + [\text{some member of O(1)}]

The expression O(1)O(1) also brings up another issue. Since there is no variable in this expression, it may not be clear which variable is getting arbitrarily large. Without context, there is no way to tell. In the example above, since the only variable in the rest of the equation is nn, we can assume that this should be read as T(n)=2logn+O(f(n))T (n) = 2 \log n + O(f (n)), where f(n)=1f (n) = 1.

Big-Oh notation is not new or unique to computer science. It was used by the number theorist Paul Bachmann as early as 1894, and is immensely useful for describing the running times of computer algorithms.

One execution of this method involves

  • 11 assignment (int i = 0),
  • n+1n+1 comparisons (i < n),
  • nn increments (i++),
  • nn array offset calculations (a[i]), and
  • nn indirect assignments (a[i] = i)

So we could write this running time as

T(n)=a+b(n+1)+cn+dn+enT(n) = a + b(n+1) + cn + dn + en

where a,b,c,d,a, b, c, d, and ee are constants that depend on the machine running the code and represent the time to perform assignments, comparisons, increment operations, array offset calculations, and indirect assignments, respectively. However, if this expression represents the running time of two lines of code, then clearly, this kind of analysis will not be tractable to complicated code or algorithms. Using big-Oh notation, the running time can be simplified to

T(n)=O(n)T(n) = O(n)

Not only is this more compact, but it also gives nearly as much information. The fact that the running time depends on the constants a,b,c,da, b, c, d, and ee in the above example means that, in general, it will not be possible to compare two running times to know which is faster without knowing the values of these constants. Even if we make an effort to determine these constants (say, through timing tests), then our conclusion will only be valid for the machine we run our tests on.

Big-Oh notation allows us to reason at a much higher level, making it possible to analyze more complicated functions. If two algorithms have the same big-Oh running time, then we won’t know which is faster, and there may not be a clear winner. One may be faster on one machine, and the other may be faster on a different machine. However, if the two algorithms have demonstrably different big-Oh running times, then we can be certain that the one with the smaller running time will be faster for large enough values of nn.

An example of how big-Oh notation allows us to compare two different functions is shown below, which compares the rate of growth of f1(n)=15nf_1(n) = 15n versus f2(n)=2nlognf_2(n) = 2n\log n. It might be that f1(n)f_1(n) is the running time of a complicated linear time algorithm while f2(n)f_2(n) is the running time of a considerably simpler algorithm based on the divide-and-conquer paradigm. This illustrates that, although f1(n)f_1(n) is greater than f2(n)f_2(n) for small values of nn, the opposite is true for large values of nn. Eventually f1(n)f_1(n) wins out, by an increasingly wide margin. Analysis using big-Oh notation told us that this would happen, since O(n)O(nlogn)O(n) \subset O(n \log n).

Plots of 15n versus 2n log n
Plots of 15n versus 2n log n

In a few cases, we will use asymptotic notation on functions with more than one variable. There seems to be no standard for this, but for our purposes, the following definition is sufficient:

O(f(n1,...,nk))={g(n1,...,nk):there exists c>0, and z such thatg(n1,...,nk)cf(n1,...,nk)for all n1,...,nk such that g(n1,...nk)z}O(f(n_1,...,n_k)) = \begin{cases} \begin{rcases} g(n_1,...,n_k): \text{there exists $c > 0$, and $z$ such that}\\ g(n_1,...,n_k) \le c \cdot f (n_1,...,n_k)\\ \text{for all}\ n_1,...,n_k\ \text{such that } g(n_1,...n_k) \ge z \end{rcases} \end{cases}

This definition captures the situation we really care about: when the arguments n1,...,nkn_1 , ..., n_k make gg take on large values. This definition also agrees with the univariate definition of O(f(n))O(f (n)) when f(n)f (n) is an increasing function of nn. The reader should be warned that, although this works for our purposes, other texts may treat multivariate functions and asymptotic notation differently.

Randomization and probability

Some of the data structures presented in this course are randomized; they make random choices that are independent of the data being stored in them or the operations being performed on them. For this reason, performing the same set of operations more than once using these structures could result in different running times. When analyzing these data structures we are interested in their average or expected running times.

Formally, the running time of an operation on a randomized data structure is a random variable, and we want to study its expected value. For a discrete random variable XX taking on values in some countable universe UU, the expected value of XX, denoted by E[X]E[X], is given by the formula

E[X]=xUxPr{X=x}E[X] = \sum_{x \in U} x \cdot \Pr \{X = x\}

Here Pr{ε}\Pr\{\varepsilon\} denotes the probability that the event ε\varepsilon occurs. In all of the examples in this course, these probabilities are only with respect to the random choices made by the randomized data structure; there is no assumption that the data stored in the structure, nor the sequence of operations performed on the data structure, is random.

One of the most important properties of expected values is linearity of expectation. For any two random variables XX and YY,

E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y]

More generally, for any random variables X1,...,XkX_1, ...,X_k,

E[i=1kXk]=i=1kE[Xi]E \left[\sum_{i=1}^{k} X_k \right] = \sum_{i=1}^{k} E[X_i]

Linearity of expectation allows us to break down complicated random variables (like the left-hand sides of the above equations) into sums of simpler random variables (the right-hand sides).

A useful trick, that we will use repeatedly, is defining indicator random variables. These binary variables are useful when we want to count something and are best illustrated by an example. Suppose we toss a fair coin kk times and we want to know the expected number of times the coin turns up as heads. Intuitively, we know the answer is k/2k/2, but if we try to prove it using the definition of expected value, we get

E[X]=i=0k iPr{X=i}=i=0k i(ki)/2k=ki=0k1(k1i)/2k=k/2\begin{split} E[X] & = \sum_{i=0}^{k}\ i \cdot \Pr\{X=i\} \\ & = \sum_{i=0}^{k}\ i \cdot \binom{k}{i} / 2^k \\ & = k \cdot \sum_{i=0}^{k-1} \binom{k-1}{i} / 2^k \\ & = k / 2 \end{split}

This requires that we know enough to calculate that Pr{X=i}=(ki)/2k\Pr\{X=i\} = \binom{k}{i}/2^k, and that we know the binomial identities i(ki)=k(k1i)i \binom{k}{i} = k \binom{k-1}{i} and i=0k(ki)=2k\sum_{i=0}^{k} \binom{k}{i} = 2^k.

Using indicator variables and linearity of expectation make things much easier. For each i{1,...,k}i \in \{1, ...,k\}, define the indicator random variable

Ii={1   if the ith coin toss is head0   otherwiseI_i = \begin{cases} 1 \text{\ \ \ if the $i$th coin toss is head}\\ 0 \text{\ \ \ otherwise} \end{cases}

Then

E[Ii]=(1/2)1+(1/2)0=1/2E[I_i] = (1/2)1 + (1/2)0 = 1/2

Now, X=i=1kIiX = \sum_{i=1}^{k} I_i, so

E[X]=E[i=1kIi]=i=1kE[Ii]=i=1k1/2=k/2\begin{split} E[X] & = E \left[\sum_{i=1}^{k} I_i \right] \\ & = \sum_{i=1}^{k} E[I_i] \\ & = \sum_{i=1}^{k} 1/2 \\ & = k/2 \end{split}

This is a bit more long-winded, but doesn’t require that we know any magical identities or compute any non-trivial probabilities. Even better, it agrees with the intuition that we expect half the coins to turn up as heads precisely because each individual coin turns up as heads with a probability of 1/21/2.