Preuve de la formule LOOCV

D'après An Introduction to Statistical Learning de James et al., L'estimation de validation croisée avec oubli (LOOCV) est définie par

{CV}_{(n)} = \frac{1}{n} \sum_{i = 1}^{n} {MSE}_{i}

$\text{CV}_{(n)} = \dfrac{1}{n}\sum\limits_{i=1}^{n}\text{MSE}_i$ where

{MSE}_{i} = (y_{i} - {\hat{y}}_{i})^{2}

$\text{MSE}_i = (y_i-\hat{y}_i)^2$ .

Without proof, equation (5.2) states that for a least-squares or polynomial regression (whether this applies to regression on just one variable is unknown to me),

{CV}_{(n)} = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{y_{i} - {\hat{y}}_{i}}{1 - h_{i}})}^{2}

$\text{CV}_{(n)} = \dfrac{1}{n}\sum\limits_{i=1}^{n}\left(\dfrac{y_i - \hat{y}_i}{1-h_i}\right)^2$ where "

{\hat{y}}_{i}

$\hat{y}_i$ is the

i

$i$ th fitted value from the original least squares fit (no idea what this means, by the way, does it mean from using all of the points in the data set?) and

h_{i}

$h_i$ is the leverage" which is defined by

h_{i} = \frac{1}{n} + \frac{(x_{i} - \bar{x})^{2}}{\sum_{j = 1}^{n} (x_{j} - \bar{x})^{2}} .

$h_i = \dfrac{1}{n}+\dfrac{(x_i - \bar{x})^2}{\sum\limits_{j=1}^{n}(x_j - \bar{x})^2}\text{.}$

How does one prove this?

My attempt: one could start by noticing that

{\hat{y}}_{i} = β_{0} + \sum_{i = 1}^{k} β_{k} X_{k} + some polynomial terms of degree \geq 2

$\hat{y}_i = \beta_0 + \sum\limits_{i=1}^{k}\beta_k X_k + \text{some polynomial terms of degree }\geq 2$ but apart from this (and if I recall, that formula for

h_{i}

$h_i$ is only true for simple linear regression...), I'm not sure how to proceed from here.

— Clarinetist
source

Either your equations seem to use

i

$i$ for more than one thing or I'm highly confused. Either way additional clarity would be good.

— Glen_b -Reinstate Monica

@Glen_b I just learned about LOOCV yesterday, so I might not understand some things correctly. From what I understand, you have a set of data points, say

X = {(x_{i}, y_{i}) : i \in Z^{+}}

$\mathcal{X} = \{(x_i, y_i): i \in \mathbb{Z}^+\}$ . With LOOCV, you have for each fixed (positive integer)

k

$k$ some validation set

V_{k} = {(x_{k}, y_{k})}

$\mathcal{V}_k = \{(x_k, y_k)\}$ and a test set

T_{k} = X ∖ V_{k}

$\mathcal{T}_k = \mathcal{X}\setminus \mathcal{V}_k$ used to generate a fitted model for each

k

$k$ . So say, for example, we fit our model using simple linear regression with three data points,

X = {(0, 1), (1, 2), (2, 3)}

$\mathcal{X} = \{(0, 1), (1, 2), (2,3)\}$ . We would have (to be continued)

— Clarinetist

@Glen_b

V_{1} = {(0, 1)}

$\mathcal{V}_1 = \{(0, 1)\}$ and

T_{1} = {(1, 2), (2, 3)}

$\mathcal{T}_1 = \{(1, 2), (2, 3)\}$ . Using the points in

T_{1}

$\mathcal{T}_1$ , we can find that using a simple linear regression, we get the model

{\hat{y}}_{i} = X + 1

$\hat{y}_i = X + 1$ . Then we compute the

MSE

$\text{MSE}$ using

V_{1}

$\mathcal{V}_1$ as the validation set and get

y_{1} = 1

$y_1 = 1$ (just using the point given) and

{\hat{y}}_{1}^{(1)} = 0 + 1 = 1

$\hat{y}_1^{(1)} = 0 + 1 = 1$ , giving

{MSE}_{1} = 0

$\text{MSE}_1 = 0$ . Okay, maybe using the superscript wasn't the best idea - I will change this in the original post.

— Clarinetist

here are some lecture notes on the derivation pages.iu.edu/~dajmcdon/teaching/2014spring/s682/lectures/…

— Xavier Bourret Sicotte

I'll show the result for any multiple linear regression, whether the regressors are polynomials of $X_t$ or not. In fact, it shows a little more than what you asked, because it shows that each LOOCV residual is identical to the corresponding leverage-weighted residual from the full regression, not just that you can obtain the LOOCV error as in (5.2) (there could be other ways in which the averages agree, even if not each term in the average is the same).

Let me take the liberty to use slightly adapted notation.

We first show that

\begin{aligned} \hat{β} - {\hat{β}}_{(t)} & = (\frac{{\hat{u}}_{t}}{1 - h_{t}}) (X^{'} X)^{- 1} X_{t}^{'}, (A) \end{aligned}

$\begin{align*} \hat\beta-\hat\beta_{(t)}&=\left(\frac{\hat u_t}{1-h_t}\right)(X'X)^{-1}X_t', \quad\quad \textrm{(A)} \end{align*}$ where

\hat{β}

$\hat\beta$ is the estimate using all data and

{\hat{β}}_{(t)}

$\hat\beta_{(t)}$ the estimate when leaving out

X_{(t)}

$X_{(t)}$ , observation

t

$t$ . Let

X_{t}

$X_t$ be defined as a row vector such that

{\hat{y}}_{t} = X_{t} \hat{β}

$\hat y_t=X_t\hat\beta$ .

{\hat{u}}_{t}

$\hat u_t$ are the residuals.

The proof uses the following matrix algebraic result.

Let $A$ be a nonsingular matrix, $b$ a vector and $\lambda$ a scalar. If

\begin{aligned} λ & \neq - \frac{1}{b^{'} A^{- 1} b} \end{aligned}

$\begin{align*} \lambda&\neq -\frac{1}{b'A^{-1}b} \end{align*}$ Then

\begin{aligned} (A + λ b b^{'})^{- 1} & = A^{- 1} - (\frac{λ}{1 + λ b^{'} A^{- 1} b}) A^{- 1} b b^{'} A^{- 1} (B) \end{aligned}

$\begin{align*} (A+\lambda bb')^{-1}&=A^{-1}-\left(\frac{\lambda}{1+\lambda b'A^{-1}b}\right)A^{-1}bb'A^{-1}\quad\quad \textrm{(B) }\end{align*}$

The proof of (B) follows immediately from verifying

\begin{aligned} {A^{- 1} - (\frac{λ}{1 + λ b^{'} A^{- 1} b}) A^{- 1} b b^{'} A^{- 1}} (A + λ b b^{'}) = I . \end{aligned}

$\begin{align*} \left\{A^{-1}-\left(\frac{\lambda}{1+\lambda b'A^{-1}b}\right)A^{-1}bb'A^{-1}\right\}(A+\lambda bb')=I. \end{align*}$

The following result is helpful to prove (A)

\begin{aligned} (X_{(t)}^{'} X_{(t)})^{- 1} X_{t}^{'} = (\frac{1}{1 - h_{t}}) (X^{'} X)^{- 1} X_{t}^{'} . (C) \end{aligned}

$\begin{align} (X_{(t)}'X_{(t)})^{-1}X_t'=\left(\frac{1}{1-h_t}\right)(X'X)^{-1}X_t'.\quad\quad \textrm{ (C)} \end{align}$

Proof of (C): By (B) we have, using $\sum_{t=1}^TX_t'X_t=X'X$ ,

\begin{aligned} (X_{(t)}^{'} X_{(t)})^{- 1} & = (X^{'} X - X_{t}^{'} X_{t})^{- 1} \\ = (X^{'} X)^{- 1} + \frac{(X^{'} X)^{- 1} X_{t}^{'} X_{t} (X^{'} X)^{- 1}}{1 - X_{t} (X^{'} X)^{- 1} X_{t}^{'}} . \end{aligned}

$\begin{align*} (X_{(t)}'X_{(t)})^{-1}&=(X'X-X_t'X_t)^{-1}\\ &=(X'X)^{-1}+\frac{(X'X)^{-1}X_t'X_t(X'X)^{-1}}{1-X_t(X'X)^{-1}X_t'}. \end{align*}$ So we find

\begin{aligned} (X_{(t)}^{'} X_{(t)})^{- 1} X_{t}^{'} & = (X^{'} X)^{- 1} X_{t}^{'} + (X^{'} X)^{- 1} X_{t}^{'} (\frac{X_{t} (X^{'} X)^{- 1} X_{t}^{'}}{1 - X_{t} (X^{'} X)^{- 1} X_{t}^{'}}) \\ = (\frac{1}{1 - h_{t}}) (X^{'} X)^{- 1} X_{t}^{'} . \end{aligned}

$\begin{align*} (X_{(t)}'X_{(t)})^{-1}X_t'&=(X'X)^{-1}X_t'+(X'X)^{-1}X_t'\left(\frac{X_t(X'X)^{-1}X_t'}{1-X_t(X'X)^{-1}X_t'}\right)\\ &=\left(\frac{1}{1-h_t}\right)(X'X)^{-1}X_t'. \end{align*}$

The proof of (A) now follows from (C): As

\begin{aligned} X^{'} X \hat{β} & = X^{'} y, \end{aligned}

$\begin{align*} X'X\hat\beta&=X'y, \end{align*}$ we have

\begin{aligned} (X_{(t)}^{'} X_{(t)} + X_{t}^{'} X_{t}) \hat{β} & = X_{(t)}^{'} y_{(t)} + X_{t}^{'} y_{t}, \end{aligned}

$\begin{align*} (X_{(t)}'X_{(t)}+X_t'X_t)\hat\beta &=X_{(t)}'y_{(t)}+X_t' y_t, \end{align*}$ or

\begin{aligned} {I_{k} + (X_{(t)}^{'} X_{(t)})^{- 1} X_{t}^{'} X_{t}} \hat{β} & = {\hat{β}}_{(t)} + (X_{(t)}^{'} X_{(t)})^{- 1} X_{t}^{'} (X_{t} \hat{β} + {\hat{u}}_{t}) . \end{aligned}

$\begin{align*} \left\{I_k+(X_{(t)}'X_{(t)})^{-1}X_t'X_t\right\}\hat\beta&=\hat\beta_{(t)}+(X_{(t)}'X_{(t)})^{-1}X_t'(X_t\hat\beta+\hat u_t). \end{align*}$ So,

\begin{aligned} \hat{β} & = {\hat{β}}_{(t)} + (X_{(t)}^{'} X_{(t)})^{- 1} X_{t}^{'} {\hat{u}}_{t} \\ = {\hat{β}}_{(t)} + (X^{'} X)^{- 1} X_{t}^{'} \frac{{\hat{u}}_{t}}{1 - h_{t}}, \end{aligned}

$\begin{align*} \hat\beta&=\hat\beta_{(t)}+(X_{(t)}'X_{(t)})^{-1}X_t'\hat u_t\\ &=\hat\beta_{(t)}+(X'X)^{-1}X_t'\frac{\hat u_t}{1-h_t}, \end{align*}$ where the last equality follows from (C).

Now, note $h_t=X_t(X'X)^{-1}X_t'$ . Multiply through in (A) by $X_t$ , add $y_t$ on both sides and rearrange to get, with $\hat u_{(t)}$ the residuals resulting from using $\hat\beta_{(t)}$ ( $y_t-X_t\hat\beta_{(t)}$ ),

{\hat{u}}_{(t)} = {\hat{u}}_{t} + (\frac{{\hat{u}}_{t}}{1 - h_{t}}) h_{t}

$\hat u_{(t)}=\hat u_t+\left(\frac{\hat u_t}{1-h_t}\right)h_t$ or

{\hat{u}}_{(t)} = \frac{{\hat{u}}_{t} (1 - h_{t}) + {\hat{u}}_{t} h_{t}}{1 - h_{t}} = \frac{{\hat{u}}_{t}}{1 - h_{t}}

$\hat u_{(t)}=\frac{\hat u_t(1-h_t)+\hat u_th_t}{1-h_t}=\frac{\hat u_t}{1-h_t}$

— Christoph Hanck
source

The definition for

X_{(t)}

$X_{(t)}$ is missing in your answer. I assume this is a matrix

X

$X$ with row

X_{t}

$X_t$ removed.

— mpiktas

Also mentioning the fact that

X^{'} X = \sum_{t = 1}^{T} X_{t}^{'} X_{t}

$X'X=\sum_{t=1}^T X_t'X_t$ would be helpful too.

— mpiktas

@mpiktas, yes, thanks for the pointers. I edited to take the first comment into account. Where exactly would the second help? Or just leave it in your comment?

— Christoph Hanck

When you start the proof of (C) you write

(X_{(t)}^{'} X_{(t)})^{- 1} = (X^{'} X - X_{t}^{'} X_{t})^{- 1}

$(X_{(t)}'X_{(t)})^{-1}=(X'X-X_t'X_t)^{-1}$ . That is a nice trick, but I doubt that casual reader is aware of it.

— mpiktas

Two years later... I appreciate this answer even more, now that I've gone through a graduate-level linear models sequence. I'm re-learning this material with this new perspective. Do you have any suggested references (textbooks?) which go through derivations like what you have in this answer in detail?

— Clarinetist