Elements of Statistical Learning

I am reading the book Elements of Statistical Learning by Hastie, Tibshirani and Friedman. What I wanted to get out of this book is both the practical aspect and the theoretical aspect of functional approximation. This is somewhat tied to my effort to gain more understanding in other branches of mathematics and applications, such as representation theory, harmonic analysis, and time series analysis. There are already plenty of people writing up notes on this book. I will incorporate my own notes and solution while cross-checking theirs. This post will be my workspace for reading this book. This post uses the same notation as the book. The input variable is denoted by X. If X is a vector, its j^{th} component is denoted by X_j. Quantitative outputs is denoted by Y, and qualitative outputs by G. These uppercase letters are used to refer to the generic aspect of a variable. Observed values are written in lowercase, so the j^{th} observation of X is x_j. Matrices are represented by bold uppercase, so a set of N input of p-vectors x_i, i = 1,\hdots,N would be represented by N\times p matrix \mathbf{X}. The j^{th} component from the matrix X is denoted by \mathbf{x}_j. Therefore x_j and \mathbf{x}_j are distinguished by their dimension.

Overview of Supervised Learning

Two of the classical methods for supervised learning are introduced: The least-square method and the nearest-neighbor method. Both methods can be derived from the framework of statistical decision theory. We will develop the theory first then go into each of these two methods.

Given some random input vectors X \in \mathcal{X} and Y\in\mathcal{Y} is a real-valued random output variable, and a joint probability distribution P_{X, Y}(x,y), we want to find a function f: \mathcal{X} \longrightarrow \mathcal{Y} from a general function space \mathcal{F} that relates Y to X. The simplest example is to set \mathcal{X} to be a p dimensional Euclidean space \mathbb{R}^p and \mathcal{Y} to be \mathbb{R}. It can also accommodate classification problem by setting \mathcal{Y} to be \{-1, +1\}.

If the function space \mathcal{F} is unrestricted, then one can find functions whose output matches the training outputs y exactly but will probably perform horribly out-of-sample. Therefore we can put some restriction on the space \mathcal{F}. For example, we can be working exclusively in the space of linear functions or the space of quadratic functions.

Functions from the space of linear functions are not expected to fit the output data exactly even when the underlying relationship is linear. This is because there are still errors associated with the data. Supposing that there is no errors, but the underlying relationship is a quadratic relationship while we restrict our function space to the space of all linear functions, then any function from this space will not fit the observed output y exactly. The way to choose a function that “best” relate the input and output requires a notion of how well a function fit the data. This is measured through loss function L:\mathcal{Y} \times \mathcal{Y} \longrightarrow \mathbb{R}^+. A common choice of L is L(Y, f(X)) = (Y - f(X))^2. Given these we can have a criteria for choosing f \in \mathcal{F} such that it minimizes the expected prediction error (EPE), \mathbb{E}\left[L(Y, f(X))\right]. For example, let L(Y, f(X)) = (Y - f(X))^2, the expected prediction error is

    \begin{equation*} \mathrm{EPE}(f) = \int (y - f(x))^2 dP_{X,Y}(x,y) \end{equation*}

and by conditioning on X, we have

    \begin{equation*} \mathrm{EPE}(f) = \mathbb{E}_X \left[ \mathbb{E}_{Y\vert X} \left[ L(Y, f(X)) \right] \vert X \right] \end{equation*}

and to find the f that minimizes the EPE, we have, given any X=x,

    \begin{equation*} f(x) = {\underset{c}{\mathrm{argmin}}}\> \mathbb{E}_{Y\vert X}\left[L(Y, c) \vert X = x\right] \end{equation*}

Therefore the function f can be found by minimizing the EPE pointwise, and the solution at X=x is

    \begin{equation*} f(x) = \mathbb{E}\left[Y \vert X = x\right] \end{equation*}

the conditional expectation. We may not want to use this particular loss function. Let us replace the loss function with L(Y,f(X)) = \lvert Y - f(X) \rvert. The solution of f using this loss function is

    \begin{equation*} f(x) = \mathrm{median}\left[Y\vert X=x\right] \end{equation*}

The function f is sometimes called the decision rule, and I think this is where the term “statistical decision theory” comes from. Another commonly heard term is the Statistical Learning Theory, which is in a sense a superset of the statistical decision theory. Roughly speaking, the statistical decision theory deals with the cases where we assume some amount of knowledge about the distribution P_{X, Y}(x,y). The statistical learning theory, on the other hand, deals with the cases where our knowledge of the joint probability distribution P_{X, Y}(x,y) is limited. In the latter case, the decision rule f itself is based only on the data, rather than the assumed, predetermined probability distribution.

We are now ready to get back to the least-square method and the nearest-neighbor method.

Linear Models and Least Squares Methods

In a linear model, one predicts the output Y from a vector of inputs X^T = (X_1, X_2, /hdots, X_P) via

    \begin{equation*} f(X) = X^T\beta \end{equation*}

The most popular method of fitting the linear model is to use the method of least squares, which choose \beta to minimize the residual sum of squares

    \begin{equation*} RSS(\beta) = \sum_{i=1}^{N}(y_i - x_i^T\beta)^2 \end{equation*}

which has solution

    \begin{equation*} \hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{equation*}

We can see that this method can be derived from the statistical decision framework by setting the function space \mathcal{F} to be the space of all linear functions \left\{f:\mathbb{R}^p\longrightarrow\mathbb{R}, f(X) = X^T\beta, \beta \in \mathbb{R}^p\right\} and with the loss function L(Y, f(X)) to be (Y-f(x))^2. Here we assume that \mathcal{Y} = \mathbb{R}, but this need not be the case. The choice of \mathcal{Y} in this case can also be the set \{0,1\} that numerically codes some qualitative variable with two classes.

Nearest-Neighbor Methods

To Be Continued…

Multiple Time Series Analysis

I am reading the book New Introduction to Multiple Time Series Analysis by Lütkepohl to refresh my memory of vector autoregressive models, error correction models, and state space models (supposedly all equivalent.) I will build in some computational aspect of these models into this post using python or julia. This post serves as my workbook while reading the book.

To Be Continued…

Representation Theory of Finite Group

I am reading the book Representation Theory of Finite Groups by Benjamin Steinberg due to my interest in probability theory on groups and other algebraic structure. This is related to my earlier post on random walk on n-gons which I found to be very interesting. This post will be my workspace for reading through this book and other related material, notably Group Representations in Probability and Statistics by Persi Diaconis.

Representation theory studies how a group act on vector spaces. Another way to think about it is that for any group G, what are the ways to embed or map it to the general linear group GL(V). We need to first review what a group action is. The action of a group G on a set X is a homomorphism \varphi:G\longrightarrow S_X, assigning each group member g \in G some permutation \sigma of the set X that is compatible with the structure of the group G. This is always possible because one can do this trivially (the trivial action) such that \varphi(g) = \sigma_1 for all g \in G, mapping every element of g to the identity permutation. If we take X = G, then it is also always possible because of Cayley’s Theorem, which states that every group is isomorphic to a group of permutation or a subgroup of the symmetric group S_{\lvert G \rvert}. The set X can have some additional structure, and if we take X to be a vector space over field \mathcal{F} and ensure the map \varphi respect the additional structure of the vector space as well, then the group action is called a group representation.

Definition (Representation)

A representation \varphi of group G is the homomorphism \varphi:G\longrightarrow GL(V) from G to the general linear group of the vector space V. The degree of \varphi is the dimension of V.

We write the linear map \varphi(g) as \varphi_g, and \varphi_g(v), sometimes written as \varphi_g v for the action of \varphi_g on v \in V. Below is an example of the representation on S_3.

Example (Representation of S_3)

Immediately from the definition we see the analogy of trivial action as a representation: the trivial representation \varphi:G\longrightarrow\mathbb{C} given by \varphi(g)=1 for all g \in G. We also see that the degree of the trivial representation is 1. It is interesting to see some other representations of a small group such as S_3 the group of all permutations on 3 elements. Aside from the trivial representation, we can compute two others: the alternating representation and the standard representation or the permutation representation.

Let \sigma \in S_3 be a permutation on 3 elements, the alternating representation is the function \varphi_\sigma v = \mathrm{sgn}(\sigma) v where \mathrm{sgn}(\sigma) = +1 if \sigma can be written as a product of an even number of transpositions, and -1 otherwise. More precisely, define the signum of \sigma by

    \begin{equation*} \mathrm{sgn}(\sigma)=\prod_{1\leq i < j \leq n}\frac{\sigma(i)-\sigma(j)}{i-j} \end{equation*}

then the map \sigma\longrightarrow\mathrm{sgn}(\sigma) is a group homomorphism of S_n to the subgroup \left\{1,-1\right\} of the multiplicative group \mathbf{R^*}=\mathbf{R}\setminus\{0\}. Thus the number of transpositions in a representation of \sigma is either even or odd. We can check that this is a representation. For g, h \in G we have

    \begin{align*} \varphi_{gh}v &= \mathrm{sgn}(gh) v \\               &= \left(\mathrm{sgn}(g)\mathrm{sgn}(h)\right) v \\               &= \left(\varphi_{g} \varphi_{h}\right) v \\               &= \varphi_{g} \left(\varphi_{h} v\right) \end{align*}

so \varphi_{gh} = \varphi_g \circ \varphi_h. This representation has degree 1.

The standard representation \varphi: S_3 \longrightarrow GL_3(\mathbb{C}) takes each permutation \sigma \in S_3 to a matrix in GL_3(\mathbb{C}) on standard basis such that rows of the matrix is permuted according to \sigma. In other words, \varphi_\sigma(e_i) = e_{\sigma(i)}. For example

    \begin{equation*} \varphi_{(1 2)} = \left[ \begin{array}{rrr} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1  \end{array} \right], \qquad \varphi_{(1 2 3)} = \left[ \begin{array}{rrr} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0  \end{array} \right] \end{equation*}

The standard representation has degree 3.

If we start with a representation \varphi: G \longrightarrow GL_n(\mathbb{C}) where B is a basis for \mathbb{C}^n, and if we have another isomorphism S:\mathbb{C}^n \longrightarrow \mathbb{C}^n with a basis B', then \varphi_g' = S \varphi_g S^{-1} is another representation. These two representation are intuitively the “same”. More generally, we have the definition of equivalence

Definition (Equivalence)

Two representations \varphi: G \longrightarrow GL(V) and \psi: G \longrightarrow GL(W) are said to be equivalent if there exists an isomorphism T:V \longrightarrow W such that \psi_g = T\varphi_gT^{-1} for all g \in G, or \psi_g T = T \varphi_g, or in picture, the following diagram commutes:

    \begin{equation*} \xymatrix{V\ar[r]^{\varphi_g}\ar[d]_{T} & V\ar[d]^{T}\\W\ar[r]^{\psi_g}& W} \end{equation*}

If \varphi is equivalent to \psi, we write \varphi \sim \phi.

An example of equivalent representations is given in Sternberg as follows:

Example (Equivalence)

Define \psi: \mathbb{Z}/n\mathbb{Z} \longrightarrow GL_2(\mathbb{C}) and \varphi: \mathbb{Z}/n\mathbb{Z} \longrightarrow GL_2(\mathbb{C}) by

    \begin{equation*} \psi_{[m]} = \left[  \begin{array}{cc} e^{\frac{2\pi mi}{n}} & 0 \\ 0 & e^{-\frac{2\pi mi}{n}} \end{array} \right], \quad \varphi_{[m]} = \left[ \begin{array}{lr} \cos\left(\frac{2\pi m}{n}\right) & -\sin\left(\frac{2\pi m}{n}\right) \\ \sin\left(\frac{2\pi m}{n}\right) & \cos\left(\frac{2\pi m}{n}\right) \end{array} \right] \end{equation*}

To show that \varphi \sim \psi, we need to find an an invertible matrix P such that \psi = P^{-1} \varphi P. This matrix P also represents a simple basis transformation in \mathbb{C}^2. Let us find the eigenvectors of the two operators. The eigenvectors for \varphi_{[m]} are [-i, i] and [1, 1], and for \psi_{[m]} is [1, 0] and [0, 1] the standard basis, while the eigenvalues for \varphi_{[m]} are \cos(2\pi m/n)+i\sin(2\pi m/n)) and \cos(2\pi m/n)-\sin(2\pi m/n)) and for \psi_{[m]} are \exp(2\pi i m/n) and \exp(-2\pi i m/n). By Euler’s formula, the two sets of eigenvalues are the same, so we can deduce that the change of basis matrix is the matrix with columns of eigenvectors of \varphi, namely,

    \begin{equation*} P=\left[ \begin{array}{cc} i &-i \\ 1 & 1 \end{array} \right] \end{equation*}

Indeed, P^{-1}\varphi_{[m]}P=\psi_{[m]} after some calculation.

The notion of equivalent representation are the same as matrix similarity, except that the former describes the equivalence between operators and the latter their matrix representation. Also note that in the example above there are two eigenspaces for \psi_{[m]}, namely E_1 = \mathbb{C}\mathbf{e}_1 and E_2 = \mathbb{C}\mathbf{e}_2. For v \in E_1, we have \psi_{[m]}(v) \in E_1 for all m \in \mathbb{Z}/n\mathbb{Z} as well. This motivates the definition of G-invariant subspace.

Definition (G-Invariant Subspace)

Let \varphi:G\longrightarrow GL(V) be a representation. A subspace W\le V is G-invariant if, for all g\in G and w \in W, one has \varphi_g w \in W.

We can use this definition to find the S_3-invariant subspace of the standard representation \varphi: S_3 \longrightarrow GL_3(\mathbb{C}). Note that the 1-dimensional space W spanned by \mathbf{e}_1 + \mathbf{e}_2 + \mathbf{e}_3 is a S_3-invariant subspace, since for any \sigma \in S_3,

(1)   \begin{align*} \varphi_{\sigma}(\mathbf{e}_1 + \mathbf{e}_2 + \mathbf{e}_3) &= \mathbf{e}_{\sigma(1)} + \mathbf{e}_{\sigma(2)} + \mathbf{e}_{\sigma(3)} \\ &= \mathbf{e}_1 + \mathbf{e}_2 + \mathbf{e}_3 \end{align*}

is again in the W. Also note that its compliment W^\perp = \left\{(x_1, x_2, x_3) : x_1 + x_2 + x_3 = 0\right\} is also S_3-invariant. Indeed, if \varphi: G \longrightarrow GL(V) is a representation and W < V is a G-invariant subspace, then restricting \varphi to W yields a subrepresentation \varphi\vert_W:G\longrightarrowGL(W) The following computation of restricting the standard representation to W^\perp follows that of Disconis.

Computation of the 2-Dimensional Representation of S_3

Let w_1 = \mathbf{e}_1 - \mathbf{e}_2 and w_2 = \mathbf{e}_2 - \mathbf{e}_3. Let v = a\mathbf{e}_1 + b\mathbf{e}_2 + c\mathbf{e}_3 \in W^\perp then v = a\mathbf{e}_1 + b\mathbf{e}_2 + (-a-b)\mathbf{e}_3 since the components of v add to 0. then v = a(\mathbf{e}_1 - \mathbf{e}_2) + (b+a)\mathbf{e}_2 + (-b-a)\mathbf{e}_3 yields v = a(\mathbf{e}_1 - \mathbf{e}_2) + (b+a)(\mathbf{e}_2 - \mathbf{e}_3) = aw_1 + (b+a)w_2. Therefore w_1 and w_2 are basis for W^\perp. Let’s consider the action \sigma on this basis.

    \begin{equation*} \begin{array}{cccc} \sigma & \varphi(\sigma)w_1 & \varphi(\sigma)w_2 & \varphi(\sigma) \\ \hline \mathrm{id} & w_1 & w_2 & \left[\begin{array}{p{0.5cm}p{0.5cm}} 1 & 0 \\ 0 & 1\end{array}\right] \\ (1 2) & -w_1 & w_1 + w_2 & \left[\begin{array}{p{0.5cm}p{0.5cm}}-1 & 1 \\ 0 & 1\end{array}\right] \\ (2 3) & w_1 + w_2 & -w_2 & \left[\begin{array}{p{0.5cm}p{0.5cm}}1 & 0 \\ 1 & -1\end{array}\right] \\ (1 3) & -w_2 & -w_1 & \left[\begin{array}{p{0.5cm}p{0.5cm}}0 & -1 \\ -1 & 0\end{array}\right] \\ (1 2 3) & w_2 & -(w_1+w_2) & \left[\begin{array}{p{0.5cm}p{0.5cm}}0 & -1 \\ 1 & -1\end{array}\right] \\ (1 3 2) & -(w_1 + w_2) & w_1 & \left[\begin{array}{p{0.5cm}p{0.5cm}}-1 & 1 \\ -1 & 0\end{array}\right] \\ \hline \end{array}  \end{equation*}

This gives the 2-dimensional representation of S_3

The 2-dimensional representation of S_3 is also irreducible, which means

Definition: Irreducible Representation

A non-zero representation \varphi:G\longrightarrow GL(V) of a group G is said to be irreducible if the only G-invariant subspaces of V are \{0\} and V.

To show that, assume a non-zero vector (x, y, z) \in W^\perp (say x \neq 0) and let W_1 be the span of this vector. If W_1 is a sub representation, then W_1 < W^\perp. Since (1, y', z') \in W_1, \varphi_{(1 2)}(1, y', z') = (y', 1, z') \in W_1 and so their difference (1-y', y'-1, 0) is also in W_1. If y' \neq 1, then \mathbf{e}_1 - \mathbf{e}_2 = w_1 \in W_1. Since \varphi_{(13)}(w_1) = -w_2, w_2 must be in W_1 as well, so W_1 = W^\perp. If y'=1 then (1, 1, -2) \in W_1. \varphi_{(23)}(1, 1, 2) = (1, 2, 1). Take the difference and we have (0, -1, 1) = w_2 \in W_1. Then \varphi_{(132)}w_2 = w_1 must also be in W_1, so again W_1 = W^\perp. This irreducible representation is 2-dimensional. We will see that the restriction of the permutation representation of S_n to W^\perp = \{x\in \mathbb{R}^n:\sum x_i = 0\} is an irreducible n-1-dimensional representation.

Since we can form direct sum of vector spaces, we can similarly define a direct sum of representations. We want to do this because we want to show that if V_1 and V_2 are two G-invariant subspaces and \varphi is a representation, then it makes sense to talk about a representation \varphi\vert_{V_1} \oplus \varphi\vert_{V_2} that maps from G to GL(V_1 \oplus V_2).

Definition: Direct Sum of Representations

Let \varphi^{(1)}:G\longrightarrow GL(V_1) and \varphi^{(2)}:G\longrightarrow GL(V_2) be two representations, then their external direct sum

    \begin{equation*} \varphi^{(1)}\oplus\varphi^{(2)}:G\longrightarrow GL(V_1 \oplus V_2) \end{equation*}

is defined by

    \begin{equation*} (\varphi^{(1)}\oplus\varphi^{(2)})_g(v_1, v_2) = \(\varphi^{(1)}_g(v_1), \varphi^{(2)}_g(v_2)) \end{equation*}

and has dimension \mathrm{dim}(V_1) + \mathrm{dim}(V_2).

If we restrict ourselves to matrices, then suppose \varphi^{(1)}:G\longrightarrow GL_m(\mathbb{C}) and \varphi^{(2)}:G\longrightarrow GL_n(\mathbb{C}). Then

    \begin{equation*} \varphi^{(1)} \oplus \varphi^{(2)}:G \longrightarrow GL_{m+n}(\mathbb{C}) \end{equation*}

has the block matrix form

    \begin{equation*}(\varphi^{(1)} \oplus \varphi^{(2)})_g =  \left[ \begin{array}{cc} \varphi^{(1)}_g & 0 \\ 0 & \varphi^{(2)}_g \end{array} \right] \end{equation*}

The representation \psi in the example about equivalence is a good demonstration of how to form direct product of representations. It should also be clear that if a representation can be written as a direct sum of subrepresentations, then it is not irreducible. Therefore. the representations in our equivalence example are not irreducible.

Related to the notion of irreducible representation are the decomposable and completely reducible representations.

To Be Continued…

From Line to Circle

Sometimes a problem involving a finite sequence can be changed slightly to yield other interesting results. One of my favourite tweak is to make the sequence into a circle so that the last element is adjacent to the first element of the sequence, and try to solve the same problem on this circular sequence. Sometimes the problem is already in this circular form. Take the following problem from the 1988 Austrian Mathematics Olympiad, for example:

Determine the number of all sequences (x_1,...,x_n), with x_i \in \{a,b,c\} for 1<i<n, that satisfy x_1 = x_n = a and x_{i-1} \neq x_i for 1 \leq i \leq n-1.


One of the solutions is from the book From Erdös to Kiev by Ross Honsberger. The other solution is from Crux Mathematicorum due Curtis Cooper.

Solution 1

First let us get our hands dirty. Let f_n denote the number of sequence of length n satisfying the conditions in the question. The table below gives a the first few terms

n Sequence f_n
1 a 1
2 0
3 aba, aca 2
4 abca, acba 2
5 ababa, abaca, acaba, acaba, abcba, acbca 6

Now we need a general rule to generate this table. Looking at n=5, note that we can generate row 5 from sequences in row 4 and row 3 in the following way. For each sequence in row 3, we can append a ‘ca’ and a ‘ba’ at the end. The total number of sequence constructed this way is 2F_{n-2}. Also, for each sequence in row 4, we can insert one and only one letter before the final ‘a’ that is different from the second last letter. This can only be done in one way for each sequence in row 4, so the total number of ways is F_{n-1}. Combining these two gives us the recursive relation:

(1)   \begin{equation*} f_n = f_{n-1} + 2f_{n-2} \end{equation*}

which has the characteristic equation

    \begin{equation*} g(x) = x^2-x-2 \end{equation*}

that has roots at 2 and -1. Therefore the general solution to Equation 1 is

    \begin{equation*} f_n = A2^n+B(-1)^n \end{equation*}

Given that f_1=1, f_2=0, we have A = 1/6 and B = -2/3. The general solution for f_n is

    \begin{equation*} f_n = \frac{1}{6}2^n-\frac{2}{3}(-1)^{n}=\frac{2}{3}[2^{n-2}+(-1)^{n+1}] \end{equation*}

Solution 2

This solution is based on the observation that the number of sequence satisfying the condition but ends with b or c instead of a can be used to construct the sequence of length n+1 that satisfy the condition of the problem. Let A_n be the set of sequences of length n that satisfy the x_i \neq x_{i+1} and the x_1 = x_n = a condition, and let a_n = \lvert A_n \rvert. Similarly, let B_n and C_n be the sets of sequences of length n that satisfy the x_i \neq x_{i+1} condition but instead has x_1 = a, x_n = b and x_1 = a, x_n=c condition, respectively. Let and let b_n = \lvert B_n \rvert and c_n = \lvert C_n \rvert as well. Then observe that for each sequence in B_n and C_n, we can append an a at the end to create a sequence of length n+1 that satisfy x_1=x_n=a and x_i \neq x_{i+1}, which is in A_{n+1}. Moreover, none of the sequence in A_n sequence can be used to create sequences in A_{n+1}. Therefore a_{n+1} = b_n+c_n. Similarly, c_{n+1} = a_{n} + b_{n} and b_{n+1} = a_n + c_n. We now have the following system of equations

    \begin{equation*} \left( \begin{array}{c} a_{n+1}\\ b_{n+1}\\ c_{n+1}\\ \end{array} \right) = \left( \begin{array}{ccc} 0 & 1 & 1\\ 1 & 0 & 1\\ 1 & 1 & 0\\ \end{array} \right) \left( \begin{array}{c} a_{n}\\ b_{n}\\ c_{n}\\ \end{array} \right) \end{equation*}

and we write this as

(2)   \begin{equation*} \mathbf{f_n} = \mathbf{A}\mathbf{f_{n-1}} \end{equation*}

The solution for the homogeneous difference equation is the “guess”

    \begin{equation*} \mathbf{f_n} = \sum_{i=1}^{N}k_i\lambda_i^n\mathbf{c}_i \end{equation*}

where N is the number of distinct eigenvalues \lambda_i of the matrix \mathbf{A}, \mathbf{c}_i is the corresponding eigenvector, and k_i \in \mathbb{R}. This is a solution because if we plug this solution into Equation 2 we get

    \begin{align*} \sum_{i=1}^{N}k_i\lambda_i^n\mathbf{c}_i &= \sum_{i=1}^{N}k_i\lambda_i^{n-1}\mathbf{A}\mathbf{c}_i\\ &= \sum_{i=1}^{N}k_i\lambda_i^{n}\mathbf{c}_i \end{align*}

since \mathbf{A}\mathbf{c}_i = \lambda_i\mathbf{c}_i.

The matrix \mathbf{A} has eigenvalues 2 and -1 of multiplicity 2. The eigenvectors are:

    \begin{equation*} \mathbf{c}_1 = \left(\begin{array}{c} 1 \\ 1 \\ 1 \end{array}\right) \quad \mathbf{c}_2 = \left(\begin{array} -1 \\ 1 \\ 0 \end{array}\right) \quad \mathbf{c}_3 = \left(\begin{array} -1 \\ 0 \\ 1 \end{array}\right) \end{equation*}

The generic solution to a_n is then a_n = A2^n + B(-1)^n. Solving with boundary condition a_1 = 1 and a_2 = 0 gives the same solution above.

Make a Dollar, Pave a Road

A popular combinatorics problem is the Making a Dollar problem:

Making a Dollar Given unlimited numbers of 1, 5, 10, 25, and 50-cent denomination coin, how many ways are there to make a dollar.  \blacksquare

The number of ways to mix the coins into a dollar can be counted in two mutually exclusive ways: 1) the number of ways where the biggest denomination is in the coin mix, and 2) the number of ways the biggest denomination is not in the coin mix. If (1) is true, then there is at least one coin of the biggest denomination in the mix. Let us then use one coin of the biggest denomination and count the number of ways to make the remaining amount using using coins of all denominations. If (2) is true, then let us calculate the number of ways to make a dollar using all other denominations. Let f(n,S) be the number of ways to make n-cents using coins with denominations in S = \{d_1, \dots, d_n\}, then

f(n, S) = f(n - d_n, S) + f(n, S \setminus \{d_n\})

This recursive relation needs several boundary conditions: when the amount to make is equal to zero, when the amount is less than zero, and when the set of denominations is empty. The first boundary condition implies that we have just found a way to make the exact change because the remaining amount is zero. The second boundary condition says that we have not been able to found an exact change because the denomination of the coin we just used is too large. The last boundary condition says that we have used up all possible denominations and still cannot make the change. These statements translate respectively to f(0, S) = 1, f(n, S) = 0 for all negative n, and f(n, \varnothing) = 0.

I am picking up a new programming language called Julia so I decided to code this up as an exercise. The procedure described above can be put as follows

function const_comb(sum::Int, parts::Array{Int})
    if sum == 0
        return 1
    if sum < 0 || length(parts) < 1
        return 0
    return const_comb(sum - parts[end], parts) + const_comb(sum, parts[1:end-1])

This problem can be generalised in several ways. Suppose it matters what order the coins are returned to make the dollar, for example, we want to treat \{1, 3, 1, 1\} as different from \{1, 1, 1, 3\}, then how many ways can we make a dollar? It may seem necessary to keep track of the sequence to get the changes and then to count each sequence separately. However, this is not too hard to do and we can recycle the algorithm above. Instead of return 1 when we have the the exact change, we return the number of permutations of the coins to get the change, such as the number of permutations of \{1, 1, 1, 3\} to make 6 cents. Such permutation with repeating elements are called a permutation of a multi-set. A multi-set is a collection of objects that need not be distinct, i.e. a multi-set M =\{a_i \cdot d_i, \dots a_k \cdot d_k\} where k, a_i are non-negative integers and d_i are distinct objects, contains a_i of d_i. From combinatorics theory, the number of ways to form a permutation on a multi-set M such that \sum_{i=1}^{k} a_i = a is \frac{a!}{a_1!\dots a_k!}. Sounds like a plan.

Let us first code up the above expression to calculate the number of multi-set permutation. Since it involves a few factorials, special care should be taken to avoid integer overflow. (If you ever wonder why the function lfact–the natural log of the factorial function–needs to exist in Julia’s built-in math library, you will like this post where it shows why a function calculating the hypotenuse of a right triangle need to exist in the cmath library.)

function nPr(n::Int, r::AbstractArray{Int,1})     
    if(any(i->i>n, r))
        return 0
    if(n==0 || all(i->i==0,r))
        return 1
        return exp(lfact(n) - sum(lfact(r)));

Now we simply need to keep track of the multi-set of coins of each denomination that make the exact change, and count the number of multi-set permutation. The fact that each object d_i is distinct makes the use of a dictionary in our code a natural choice. The multi-set can be modelled as a dictionary with keys being the distinct object and value being the number of object

function const_comb_rep(sum::Int, parts::AbstractArray{Int, 1}, multiset::Dict)
    if sum == 0
        return nPr(sum(collect(values(multiset))), collect(values(multiset)))
    if sum < 0 || length(parts) < 1
        return 0
    sub_multiset = copy(multiset)
    if haskey(sub_multiset, parts[end])
        sub_multiset[parts[end]] = sub_multiset[parts[end]] + 1
        sub_multiset[parts[end]] = 1
    return const_comb_rep(sum - parts[end], parts, sub_multiset) + const_comb_rep(sum, parts[1:end-1], multiset)

running with some test inputs get

const_comb_rep(7, [1, 2, 3], Dict())
[const_comb_rep(n,1:n, Dict{Int,Int}()) for n = 1:10]
10-element Array{Union{Float64,Int64},1}:

As we recurse, the dictionary keeps tract of the number and denomination of the coins used. Finally when we have an exact change, the number of ways to arrange them is computed and returned back.

Note that the number of ways to make Integer n using numbers from 1 \dots n is 2^{n-1}.

Of course, in real life one does not care the order of which changes are returned to make up the right amount. The problem is perhaps more suitable to describe the Pave a Road problem:

Paving a Road Find the number of ways to pave a 1 \times 7 block of road using 1 \times 1, 1 \times 2, and 1 \times 3 blocks, assuming each block is indistinguishable.  \blacksquare

The above algorithm gives 44 ways, and one can actually print out each way if one add a print command inside the if sum == 0 statement.

Random Walk Down n-gon

I found this problem to be quite interesting.

Random Walk on a Square Consider a particle walk randomly around the edges of a square. From any vertex, the probability of it moving to any of the two adjacent vertices is 0.5. Suppose the walk stops as soon as it has traversed all the vertices and returned to the starting vertex. What is the expected path length?


The solution I saw involves breaking the problem down into several possibilities, and find the expected path length and the probability of each possibility, then sum them up. This method seem to work with the case when the number of nodes are small, but perhaps it is more interesting to find a generalised solution to a regular n-gon.

Indeed someone has already thought of that, and wrote a paper about it. What I particularly like about this paper is that it transform the problem into a gambler’s ruin problem, which was my first thought when I read the problem. Let’s recap and solve the Gambler’s Ruin Problem first.

Gambler’s Ruin Problem

To recap, the Gambler’s Ruin Problem is extracted from the paper and stated below.

Gambler’s Ruin Problem A gambler plays a series of independent bets wagering one dollar on each bet. She either wins one dollar (and gets back her wager) with probability p, or loses her wager with probability q=1-p. Initially she has i dollars. She must quit broke when her fortune reaches 0 (no credit is allowed). Also, she has adopted a policy to quit a winner when her fortune reaches N dollars. Here, N \ge i is a predetermined integer where remains fixed throughout.

  • What is the probability that the gambler quits a winner with N dollars rather than goes broke?
  • What is the expected number of bets she plays until the game ends?


One can see the connection between these two problems. The random walk down n-gon problem is like the gambler’s ruin problem wrap around a circle. Let us solve the simpler problem first, analytically and programmatically.

Probability of Quitting a Winner

Let P_{i,N} be the probability of winning this game with initial wealth i and quitting wealth of N, then the following recursive relationship holds


Using the fact that P_{i,N} = pP_{i,N}+qP_{i,N}, the relationship can also be written as

(1)   \begin{equation*} P_{i+1,N}-P_{i,N} = \frac{q}{p}(P_{i,N}-P_{i-1,N}) \end{equation*}

with boundary conditions P_{N,N}=1 and P_{0,N}=0. Consider the value P_{1,N}: with only 1 dollar of wealth, the probability of ruin is q, while the probability of making one more dollar is p. Using the formula above, we have P_{1,N} = pP_{2,N} or P_{2,N} - P_{1,N} = \frac{q}{p}(P_{1,N}) by the boundary condition P_0 = 0. Now we consider the value P_2, which can also be written as P_{2,N} = qP_{1,N} + pP_{3,N} or P_{3,N}-P_{2,N} = \frac{q}{p}(P_{2,N}-P_{1,N}). There are two equivalent way of proceeding from here to find P_{i,N}:

  • we can start with the Equation 1 for i=1,\dotsN, and substitute them back to solve P_i. This is the approach in Ross’s Introduction to Probability Models. Doing so yields the system of equations

    (2)   \begin{align*} P_{2,N}-P_{1,N} &= \frac{q}{p}(P_{1,N}-P_{0,N})=\frac{q}{p}P_{1,N} \\ P_{3,N}-P_{2,N} &= \frac{q}{p}(P_{2,N}-P_{1,N}) \\ &\vdots \\ P_{N-1,N}-P_{N-2,N} &= \frac{q}{p}(P_{N-2,N}-P_{N-3,N}) \\ P_{N,N}-P_{N-1,N} &= \frac{q}{p}(P_{N-1,N}-P_{N-2,N})  \end{align*}

    To get an expression for P_i, we can sum the first i-1 equations up to get

        \begin{align*} P_{i,N}-P_{1,N}&=P_{1,N}\left(\frac{q}{p}+\frac{q^2}{p^2}+\dots+\frac{q^{i-1}}{p^{i-1}}\right)\\ &=\begin{cases} \displaystyle\frac{1-(q/p)^i}{1-(q/p)}P_1 & \mbox{if } \displaystyle\frac{q}{p} \neq \frac{1}{2} \\ iP_1 & \mbox{if } \displaystyle\frac{q}{p}=1 \end{cases} \end{align*}

    and to solve for P_1, use the fact that P_N=1 then

        \begin{align*}          P_{N,N}&=\displaystyle\frac{1-(q/p)^N}{1-(q/p)}P_{1,N} \\ \implies P_{1,N}&=   \begin{cases}      \displaystyle\frac{1-(q/p)}{1-(q/p)^N} & \mbox{if } \displaystyle p \neq \frac{1}{2} \\     \displaystyle\frac{1}{N} & \mbox{if } \displaystyle p=\frac{1}{2}   \end{cases} \end{align*}

    We can substitute the expression for P_{1,N} to P_{i,N} to get

        \begin{equation*} P_{i,N} &=\begin{cases} \displaystyle\frac{1-(q/p)^i}{1-(q/p)^N} & \mbox{if } \displaystyle\frac{q}{p} \neq \frac{1}{2} \\ \displaystyle\frac{i}{N} & \mbox{if } \displaystyle\frac{q}{p}=1 \end{cases} \end{equation}

    Now if we let N\rightarrow\infty, and denote the probability of winning this game as P_i = P_{i,\infty} we obtain

    (3)   \begin{equation*} P_i &=\begin{cases} \displaystyle 1-\left(\frac{q}{p}\right)^i & \mbox{if } \displaystyle p>\frac{1}{2} \\ 0 & \mbox{if } \displaystyle p \leq \frac{1}{2} \end{cases} \end{equation*}

  • Alternatively, let’s first solve a simpler problem. Let us first assume that our gambler is playing against the house with infinite wealth, and the games goes on as long as our gambler has not lost all her money, or in other words, assume N = \infty. This is almost like working backward from the first approach. Let Q_i be the probability of our gambler lose the game with an initial wealth i. To solve Q_1, we note that Q_1 = q \times 1 + pQ_2, because she will lose 1 dollar with probability q, and reach state 2, which has probability Q_2 of ruin. In general,

    (4)   \begin{equation*} Q_i = pQ_{i-1}+qQ_{i+1} \end{equation*}

    which is just the other way of writing Equation 1. To calculate Q_2, we note that starting with wealth of 2, it has to first lose 1 dollar (not necessarily in 1 step) to end up at wealth 1 and from there has to lose another dollar (also not necessarily in 1 step) to reach ruin. Therefore Q_2 = Q_1 \times Q_1 = Q_1^2. Substituting this into the expression for Q_1:

        \begin{equation*} Q_1 = q + pQ_1^2 \implies pQ_1^2 - Q_1 + (1-p) = 0 \end{equation*}

    which yields

        \begin{equation*} Q_1 = \frac{1\pm\sqrt{1-4p(1-p)}}{2p} = \frac{1+(2p-1)}{2p}=1 \mbox{ or } \frac{1-(2p-1)}{2p} = \frac{1-p}{p} \end{equation*}

    and we can then find the solution for P_1 = 1-Q_1 = 1-\frac{q}{p}
    This checked out with Equation 3 when i = 1. To find the general solution, note that we already have Q_1 = p/q and Q_2 = p^2/q^2. We can prove that Q_n = p^n/Q^n by induction using Equation 4, and the result implies P_n = 1-(q/p)^n.

    We have only just solved the simpler problem for the case when N = \infty. The probability of our gambler winning at n = N is the same as the probability of not losing with initial capital of i divided by the probability of not losing with initial capital of N. This is because P_i = P_{i,N}P_N the probability of reaching N dollars with initial wealth of i dollars times the probability of not losing at N dollars.

  • Expected Number of Bets

    We can find the expected number of steps similarly using the above recursive relation. Let X_{i,N} denote the random variable of the number of steps until the game finishes, either

    (5)   \begin{equation*} \mathbb{E}[X_{i,N}] = 1+p\mathbb{E}[X_{i-1,N}]+q\mathbb{E}[X_{i+1,N}] \end{equation*}

    where we add 1 on the RHS to indicate that wether she wins or loses this round, she has used up one bet. Coupling this with the boundary condition \mathbb{E}[X_{0,N}]=0 and \mathbb{E}[X_{N,N}]=0, we can solve the system of equations similarly.

    To Be Continued…

    The 100 Hats Problem

    I am starting to keep a list of things I learned throughout the day. Nothing horrifies me more than realizing at the end of a day that nothing has been learned, not even the trivial things. Keeping track of them will probably alleviate this kind of anxiety. Today I learned that there are generalizations to the 100 hats puzzle.

    I read about the 100 hat puzzle from the NYT a while back. Today in a conversation I learned that the problem have been generalized perhaps several times, and that there are quite a few variants of the problem. The problem goes like this, copied from the NYT article:

    One hundred persons will be lined up single file, facing north. Each person will be assigned either a red hat or a blue hat. No one can see the color of his or her own hat. However, each person is able to see the color of the hat worn by every person in front of him or her. That is, for example, the last person in line can see the color of the hat on 99 persons in front of him or her; and the first person, who is at the front of the line, cannot see the color of any hat.

    Beginning with the last person in line, and then moving to the 99th person, the 98th, etc., each will be asked to name the color of his or her own hat. If the color is correctly named, the person lives; if incorrectly named, the person is shot dead on the spot. Everyone in line is able to hear every response as well as hear the gunshot; also, everyone in line is able to remember all that needs to be remembered and is able to compute all that needs to be computed.

    Before being lined up, the 100 persons are allowed to discuss strategy, with an eye toward developing a plan that will allow as many of them as possible to name the correct color of his or her own hat (and thus survive). They know all of the preceding information in this problem. Once lined up, each person is allowed only to say “Red” or “Blue” when his or her turn arrives, beginning with the last person in line.

    Your assignment: Develop a plan that allows as many people as possible to live. (Do not waste time attempting to evade the stated bounds of the problem — there’s no trick to the answer.)

    There are countless number of papers (too many), presentations, master thesis, and blogs (such as this) that talk about the solutions to this set of problems. I also learned a nice application of the axiom of choice to this problem.