I am reading the book Elements of Statistical Learning by Hastie, Tibshirani and Friedman. What I wanted to get out of this book is both the practical aspect and the theoretical aspect of functional approximation. This is somewhat tied to my effort to gain more understanding in other branches of mathematics and applications, such as representation theory, harmonic analysis, and time series analysis. There are already plenty of people writing up notes on this book. I will incorporate my own notes and solution while cross-checking theirs. This post will be my workspace for reading this book. This post uses the same notation as the book. The input variable is denoted by . If is a vector, its component is denoted by . Quantitative outputs is denoted by , and qualitative outputs by . These uppercase letters are used to refer to the generic aspect of a variable. Observed values are written in lowercase, so the observation of is . Matrices are represented by bold uppercase, so a set of input of -vectors , would be represented by matrix . The component from the matrix is denoted by . Therefore and are distinguished by their dimension.
Overview of Supervised Learning
Two of the classical methods for supervised learning are introduced: The least-square method and the nearest-neighbor method. Both methods can be derived from the framework of statistical decision theory. We will develop the theory first then go into each of these two methods.
Given some random input vectors and is a real-valued random output variable, and a joint probability distribution , we want to find a function from a general function space that relates to . The simplest example is to set to be a dimensional Euclidean space and to be . It can also accommodate classification problem by setting to be .
If the function space is unrestricted, then one can find functions whose output matches the training outputs exactly but will probably perform horribly out-of-sample. Therefore we can put some restriction on the space . For example, we can be working exclusively in the space of linear functions or the space of quadratic functions.
Functions from the space of linear functions are not expected to fit the output data exactly even when the underlying relationship is linear. This is because there are still errors associated with the data. Supposing that there is no errors, but the underlying relationship is a quadratic relationship while we restrict our function space to the space of all linear functions, then any function from this space will not fit the observed output exactly. The way to choose a function that “best” relate the input and output requires a notion of how well a function fit the data. This is measured through loss function . A common choice of is . Given these we can have a criteria for choosing such that it minimizes the expected prediction error (EPE), . For example, let , the expected prediction error is
and by conditioning on , we have
and to find the that minimizes the EPE, we have, given any ,
Therefore the function can be found by minimizing the EPE pointwise, and the solution at is
the conditional expectation. We may not want to use this particular loss function. Let us replace the loss function with . The solution of using this loss function is
The function is sometimes called the decision rule, and I think this is where the term “statistical decision theory” comes from. Another commonly heard term is the Statistical Learning Theory, which is in a sense a superset of the statistical decision theory. Roughly speaking, the statistical decision theory deals with the cases where we assume some amount of knowledge about the distribution . The statistical learning theory, on the other hand, deals with the cases where our knowledge of the joint probability distribution is limited. In the latter case, the decision rule itself is based only on the data, rather than the assumed, predetermined probability distribution.
We are now ready to get back to the least-square method and the nearest-neighbor method.
Linear Models and Least Squares Methods
In a linear model, one predicts the output from a vector of inputs via
The most popular method of fitting the linear model is to use the method of least squares, which choose to minimize the residual sum of squares
which has solution
We can see that this method can be derived from the statistical decision framework by setting the function space to be the space of all linear functions and with the loss function to be . Here we assume that , but this need not be the case. The choice of in this case can also be the set that numerically codes some qualitative variable with two classes.
To Be Continued…