Elements of Statistical Learning

I am reading the book Elements of Statistical Learning by Hastie, Tibshirani and Friedman. What I wanted to get out of this book is both the practical aspect and the theoretical aspect of functional approximation. This is somewhat tied to my effort to gain more understanding in other branches of mathematics and applications, such as representation theory, harmonic analysis, and time series analysis. There are already plenty of people writing up notes on this book. I will incorporate my own notes and solution while cross-checking theirs. This post will be my workspace for reading this book. This post uses the same notation as the book. The input variable is denoted by X. If X is a vector, its j^{th} component is denoted by X_j. Quantitative outputs is denoted by Y, and qualitative outputs by G. These uppercase letters are used to refer to the generic aspect of a variable. Observed values are written in lowercase, so the j^{th} observation of X is x_j. Matrices are represented by bold uppercase, so a set of N input of p-vectors x_i, i = 1,\hdots,N would be represented by N\times p matrix \mathbf{X}. The j^{th} component from the matrix X is denoted by \mathbf{x}_j. Therefore x_j and \mathbf{x}_j are distinguished by their dimension.

Overview of Supervised Learning

Two of the classical methods for supervised learning are introduced: The least-square method and the nearest-neighbor method. Both methods can be derived from the framework of statistical decision theory. We will develop the theory first then go into each of these two methods.

Given some random input vectors X \in \mathcal{X} and Y\in\mathcal{Y} is a real-valued random output variable, and a joint probability distribution P_{X, Y}(x,y), we want to find a function f: \mathcal{X} \longrightarrow \mathcal{Y} from a general function space \mathcal{F} that relates Y to X. The simplest example is to set \mathcal{X} to be a p dimensional Euclidean space \mathbb{R}^p and \mathcal{Y} to be \mathbb{R}. It can also accommodate classification problem by setting \mathcal{Y} to be \{-1, +1\}.

If the function space \mathcal{F} is unrestricted, then one can find functions whose output matches the training outputs y exactly but will probably perform horribly out-of-sample. Therefore we can put some restriction on the space \mathcal{F}. For example, we can be working exclusively in the space of linear functions or the space of quadratic functions.

Functions from the space of linear functions are not expected to fit the output data exactly even when the underlying relationship is linear. This is because there are still errors associated with the data. Supposing that there is no errors, but the underlying relationship is a quadratic relationship while we restrict our function space to the space of all linear functions, then any function from this space will not fit the observed output y exactly. The way to choose a function that “best” relate the input and output requires a notion of how well a function fit the data. This is measured through loss function L:\mathcal{Y} \times \mathcal{Y} \longrightarrow \mathbb{R}^+. A common choice of L is L(Y, f(X)) = (Y - f(X))^2. Given these we can have a criteria for choosing f \in \mathcal{F} such that it minimizes the expected prediction error (EPE), \mathbb{E}\left[L(Y, f(X))\right]. For example, let L(Y, f(X)) = (Y - f(X))^2, the expected prediction error is

    \begin{equation*} \mathrm{EPE}(f) = \int (y - f(x))^2 dP_{X,Y}(x,y) \end{equation*}

and by conditioning on X, we have

    \begin{equation*} \mathrm{EPE}(f) = \mathbb{E}_X \left[ \mathbb{E}_{Y\vert X} \left[ L(Y, f(X)) \right] \vert X \right] \end{equation*}

and to find the f that minimizes the EPE, we have, given any X=x,

    \begin{equation*} f(x) = {\underset{c}{\mathrm{argmin}}}\> \mathbb{E}_{Y\vert X}\left[L(Y, c) \vert X = x\right] \end{equation*}

Therefore the function f can be found by minimizing the EPE pointwise, and the solution at X=x is

    \begin{equation*} f(x) = \mathbb{E}\left[Y \vert X = x\right] \end{equation*}

the conditional expectation. We may not want to use this particular loss function. Let us replace the loss function with L(Y,f(X)) = \lvert Y - f(X) \rvert. The solution of f using this loss function is

    \begin{equation*} f(x) = \mathrm{median}\left[Y\vert X=x\right] \end{equation*}

The function f is sometimes called the decision rule, and I think this is where the term “statistical decision theory” comes from. Another commonly heard term is the Statistical Learning Theory, which is in a sense a superset of the statistical decision theory. Roughly speaking, the statistical decision theory deals with the cases where we assume some amount of knowledge about the distribution P_{X, Y}(x,y). The statistical learning theory, on the other hand, deals with the cases where our knowledge of the joint probability distribution P_{X, Y}(x,y) is limited. In the latter case, the decision rule f itself is based only on the data, rather than the assumed, predetermined probability distribution.

We are now ready to get back to the least-square method and the nearest-neighbor method.

Linear Models and Least Squares Methods

In a linear model, one predicts the output Y from a vector of inputs X^T = (X_1, X_2, /hdots, X_P) via

    \begin{equation*} f(X) = X^T\beta \end{equation*}

The most popular method of fitting the linear model is to use the method of least squares, which choose \beta to minimize the residual sum of squares

    \begin{equation*} RSS(\beta) = \sum_{i=1}^{N}(y_i - x_i^T\beta)^2 \end{equation*}

which has solution

    \begin{equation*} \hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{equation*}

We can see that this method can be derived from the statistical decision framework by setting the function space \mathcal{F} to be the space of all linear functions \left\{f:\mathbb{R}^p\longrightarrow\mathbb{R}, f(X) = X^T\beta, \beta \in \mathbb{R}^p\right\} and with the loss function L(Y, f(X)) to be (Y-f(x))^2. Here we assume that \mathcal{Y} = \mathbb{R}, but this need not be the case. The choice of \mathcal{Y} in this case can also be the set \{0,1\} that numerically codes some qualitative variable with two classes.

Nearest-Neighbor Methods

To Be Continued…