Pourquoi utiliser la méthode de Newton pour l'optimisation de la régression logistique est-elle appelée moindres carrés itératifs repondérés?

Cela ne me semble pas clair, car la perte logistique et la perte des moindres carrés sont des choses complètement différentes.

— Haitao Du
source

Je ne pense pas que ce soit pareil. IRLS est Newton-Raphson avec la Hesse attendue plutôt que la Hesse observée.

— Dimitriy V. Masterov

@ DimitriyV.Masterov merci, pourriez-vous m'en dire plus sur la Hesse attendue vs Observée? Aussi, que pensez-vous de cette explication

— Haitao Du

Voir aussi stats.stackexchange.com/questions/236676/…

— kjetil b halvorsen

Résumé: Les GLM sont ajustés via le score de Fisher qui, comme le note Dimitriy V. Masterov, est Newton-Raphson avec la Hesse attendue à la place (c'est-à-dire que nous utilisons une estimation des informations de Fisher au lieu des informations observées). Si nous utilisons la fonction de lien canonique, il s'avère que le Hessian observé est égal au Hessian attendu, donc les scores NR et Fisher sont les mêmes dans ce cas. Quoi qu'il en soit, nous verrons que le score de Fisher correspond en fait à un modèle linéaire des moindres carrés pondérés, et les estimations de coefficient à partir de cette convergence * sur un maximum de la probabilité de régression logistique. En plus de réduire l'ajustement d'une régression logistique à un problème déjà résolu, nous avons également l'avantage de pouvoir utiliser des diagnostics de régression linéaire sur l'ajustement WLS final pour en savoir plus sur notre régression logistique.

Je vais continuer de mettre l'accent sur la régression logistique, mais pour une perspective plus générale sur la probabilité maximale dans les GLM, je recommande la section 15.3 de ce chapitre qui passe par là et dérive l'IRL dans un cadre plus général (je pense que c'est de John Fox's Applied Analyse de régression et modèles linéaires généralisés ).

$^*$ voir les commentaires à la fin

La fonction de probabilité et de score

Nous ajusterons notre GLM en itérant quelque chose de la forme

b^{(m + 1)} = b^{(m)} - J_{(m)}^{- 1} \nabla ℓ (b^{(m)})

$b^{(m+1)} = b^{(m)} - J^{-1}_{(m)}\nabla \ell(b^{(m)})$ où

ℓ

$\ell$ est la probabilité logarithmique et

J_{m}

$J_{m}$ sera soit la Hesse observée ou attendue de la vraisemblance logarithmique.

Notre fonction de lien est une fonction qui mappe la moyenne conditionnelle à notre prédicteur linéaire, donc notre modèle pour la moyenne est . Soit la fonction de lien inverse mappant le prédicteur linéaire à la moyenne. $g$ $\mu_i = E(y_i | x_i)$ $g(\mu_i) = x_i^T\beta$ $h$

Pour une régression logistique, nous avons une vraisemblance de Bernoulli avec des observations indépendantes donc Prendre des dérivés,

ℓ (b; y) = \sum_{je = 1}^{n} y_{je} Journal h (X_{je}^{T} b) + (1 - y_{je}) Journal (1 - h (X_{je}^{T} b)) .

$\ell(b; y) = \sum_{i=1}^n y_i\log h(x_i^T b) + (1 - y_i) \log(1 - h(x_i^Tb)).$

\frac{\partial ℓ}{\partial b_{j}} = \sum_{je = 1}^{n} \frac{y_{je}}{h (X_{je}^{T} b)} h^{'} (X_{je}^{T} b) X_{je j} - \frac{1 - y_{je}}{1 - h (X_{je}^{T} b)} h^{'} (X_{je}^{T} b) X_{je j}

$\frac{\partial \ell}{\partial b_j} = \sum_{i=1}^n \frac{y_i}{h(x_i^T b)} h'(x_i^T b) x_{ij} - \frac{1 - y_i}{1 - h(x_i^T b)} h'(x_i^T b) x_{ij}$

= \sum_{i = 1}^{n} x_{i j} h^{'} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)})

$= \sum_{i=1}^n x_{ij} h'(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right)$

= \sum_{i} x_{i j} \frac{h^{'} (x_{i}^{T} b)}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} (y_{i} - h (x_{i}^{T} b)) .

$= \sum_i x_{ij} \frac{h'(x_i^T b)}{h(x_i^T b)(1 - h(x_i^T b))}(y_i - h(x_i^T b)).$

Utilisation du lien canonique

Supposons maintenant que nous utilisons la fonction de lien canonique . Alors $g_c = \text{logit}$ doncce qui signifie que cela se simplifie en $g^{-1}_c(x) := h_c(x) = \frac{1}{1+e^{-x}}$ $h_c' = h_c \cdot (1-h_c)$ si De plus, toujours en utilisant,

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i} x_{i j} (y_{i} - h_{c} (x_{i}^{T} b))

$\frac{\partial \ell}{\partial b_j} = \sum_i x_{ij} (y_i - h_c(x_i^T b))$

\nabla ℓ (b; y) = X^{T} (y - \hat{y}) .

$\nabla \ell (b; y) = X^T (y - \hat y).$

h_{c}

$h_c$

\frac{\partial^{2} ℓ}{\partial b_{k} \partial b_{j}} = - \sum_{i} x_{i j} \frac{\partial}{\partial b_{k}} h_{c} (x_{i}^{T} b) = - \sum_{i} x_{i j} x_{i k} [h_{c} (x_{i}^{T} b) (1 - h_{c} (x_{i}^{T} b))] .

$\frac{\partial^2 \ell}{\partial b_k \partial b_j} = - \sum_i x_{ij} \frac{\partial}{\partial b_k} h_c(x_i^T b) = - \sum_i x_{ij}x_{ik} \left[h_c(x_i^T b) (1 - h_c(x_i^T b))\right].$

Soit Ensuite, nous avons et notons comment cela n'a plus de , donc(nous considérons cela en fonction dedonc la seule chose aléatoire estlui-même). Ainsi, nous avons montré que le score de Fisher est équivalent à Newton-Raphson lorsque nous utilisons le lien canonique dans la régression logistique. En outreen vertu

W = diag (h_{c} (x_{1}^{T} b) (1 - h_{c} (x_{1}^{T} b)), \dots, h_{c} (x_{n}^{T} b) (1 - h_{c} (x_{n}^{T} b))) = diag ({\hat{y}}_{1} (1 - {\hat{y}}_{1}), \dots, {\hat{y}}_{n} (1 - {\hat{y}}_{n})) .

$W = \text{diag}\left(h_c(x_1^T b)(1 - h_c(x_1^T b)), \dots, h_c(x_n^T b)(1 - h_c(x_n^T b))\right) = \text{diag}\left(\hat y_1(1 - \hat y_1), \dots, \hat y_n (1 - \hat y_n)\right).$

H = - X^{T} W X

$H = -X^TWX$

y_{i}

$y_i$

E (H) = H

$E(H) = H$

b

$b$

y

$y$

sera toujours strictement définie négative, bien que numériquement si

est trop près de

alors nous pouvons avoirpoids Arrondir à

qui peut rendre

semidéfinie de négatif etconséquent informatiquement singulier.

{\hat{y}}_{i} \in (0, 1)

$\hat y_i \in (0,1)$

- X^{T} W X

$-X^TWX$

{\hat{y}}_{i}

$\hat y_i$

0

$0$

1

$1$

0

$0$

H

$H$

Maintenant créer la réponse de travail et notez que $z = W^{-1}(y - \hat y)$

\nabla ℓ = X^{T} (y - \hat{y}) = X^{T} W z .

$\nabla \ell = X^T(y - \hat y) = X^T W z.$

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)} X)^{- 1} X^{T} W_{(m)} z_{(m)}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)} X)^{-1}X^T W_{(m)} z_{(m)}$

(X^{T} W_{(m)} X)^{- 1} X^{T} W_{(m)} z_{(m)}

$(X^T W_{(m)} X)^{-1}X^T W_{(m)} z_{(m)}$

\hat{β}

$\hat \beta$

z_{(m)}

$z_{(m)}$

X

$X$

Enregistrement dans R:

set.seed(123)
p <- 5
n <- 500
x <- matrix(rnorm(n * p), n, p)
betas <- runif(p, -2, 2)
hc <- function(x) 1 /(1 + exp(-x)) # inverse canonical link
p.true <- hc(x %*% betas)
y <- rbinom(n, 1, p.true)

# fitting with our procedure
my_IRLS_canonical <- function(x, y, b.init, hc, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- hc(eta)
    h.prime_eta <- y.hat * (1 - y.hat)
    z <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z ~ x - 1, weights = h.prime_eta)$coef  # WLS regression
    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

my_IRLS_canonical(x, y, rep(1,p), hc)
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851

glm(y ~ x - 1, family=binomial())$coef
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851

et ils sont d'accord.

Fonctions de liaison non canoniques

$\frac{h'}{h(1-h)} = 1$ $\nabla \ell$ $H$ $E(H)$

$\nabla \ell$

\frac{\partial^{2} ℓ}{\partial b_{k} \partial b_{j}} = \sum_{i} x_{i j} \frac{\partial}{\partial b_{k}} h^{'} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)})

$\frac{\partial^2 \ell}{\partial b_k \partial b_j} = \sum_i x_{ij} \frac{\partial}{\partial b_k}h'(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right)$

= \sum_{i} x_{i j} x_{i k} [h^{″} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{y_{i}}{h (x_{i}^{T} b)^{2}} + \frac{1 - y_{i}}{(1 - h (x_{i}^{T} b))^{2}})]

$= \sum_i x_{ij}x_{ik} \left[h''(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{y_i}{h(x_i^T b)^2} + \frac{1-y_i}{(1-h(x_i^T b))^2} \right)\right]$

Via the linearity of expectation all we need to do to get $E(H)$ is replace each occurrence of $y_i$ with its mean under our model which is $\mu_i=h(x_i^T\beta)$ . Each term in the summand will therefore contain a factor of the form

h^{″} (x_{i}^{T} b) (\frac{h (x_{i}^{T} β)}{h (x_{i}^{T} b)} - \frac{1 - h (x_{i}^{T} β)}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{h (x_{i}^{T} β)}{h (x_{i}^{T} b)^{2}} + \frac{1 - h (x_{i}^{T} β)}{(1 - h (x_{i}^{T} b))^{2}}) .

$h''(x_i^T b) \left(\frac{h(x_i^T \beta)}{h(x_i^T b)} - \frac{1 - h(x_i^T \beta)}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{h(x_i^T \beta)}{h(x_i^T b)^2} + \frac{1-h(x_i^T \beta)}{(1-h(x_i^T b))^2} \right).$ But to actually do our optimization we'll need to estimate each

β

$\beta$ , and at step

m

$m$

b^{(m)}

$b^{(m)}$ is the best guess we have. This means that this will reduce to

h^{″} (x_{i}^{T} b) (\frac{h (x_{i}^{T} b)}{h (x_{i}^{T} b)} - \frac{1 - h (x_{i}^{T} b)}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{h (x_{i}^{T} b)}{h (x_{i}^{T} b)^{2}} + \frac{1 - h (x_{i}^{T} b)}{(1 - h (x_{i}^{T} b))^{2}})

$h''(x_i^T b) \left(\frac{h(x_i^T b)}{h(x_i^T b)} - \frac{1 - h(x_i^T b)}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{h(x_i^T b)}{h(x_i^T b)^2} + \frac{1-h(x_i^T b)}{(1-h(x_i^T b))^2} \right)$

= - h^{'} (x_{i}^{T} b)^{2} (\frac{1}{h (x_{i}^{T} b)} + \frac{1}{1 - h (x_{i}^{T} b)})

$= - h'(x_i^T b)^2\left(\frac{1}{h(x_i^T b)} + \frac{1}{1-h(x_i^T b)} \right)$

= - \frac{h^{'} (x_{i}^{T} b)^{2}}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} .

$= -\frac{h'(x_i^T b)^2}{h(x_i^T b)(1-h(x_i^T b))}.$ This means we will use

J

$J$ with

J_{j k} = - \sum_{i} x_{i j} x_{i k} \frac{h^{'} (x_{i}^{T} b)^{2}}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} .

$J_{jk} = -\sum_i x_{ij}x_{ik} \frac{h'(x_i^T b)^2}{h(x_i^T b)(1-h(x_i^T b))}.$

Now let

W^{*} = diag (\frac{h^{'} (x_{1}^{T} b)^{2}}{h (x_{1}^{T} b) (1 - h (x_{1}^{T} b))}, \dots, \frac{h^{'} (x_{n}^{T} b)^{2}}{h (x_{n}^{T} b) (1 - h (x_{n}^{T} b))})

$W^* = \text{diag}\left(\frac{h'(x_1^T b)^2}{h(x_1^T b)(1-h(x_1^T b))} ,\dots, \frac{h'(x_n^T b)^2}{h(x_n^T b)(1-h(x_n^T b))}\right)$ and note how under the canonical link

h_{c}^{'} = h_{c} \cdot (1 - h_{c})

$h_c' = h_c \cdot (1-h_c)$ reduces

W^{*}

$W^*$ to

W

$W$ from the previous section. This lets us write

J = - X^{T} W^{*} X

$J = -X^TW^*X$ except this is now

\hat{E} (H)

$\hat E(H)$ rather than necessarily being

H

$H$ itself, so this can differ from Newton-Raphson. For all

i

$i$

W_{i i}^{*} > 0

$W_{ii}^* > 0$ so aside from numerical issues

J

$J$ will be negative definite.

We have

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i} x_{i j} \frac{h^{'} (x_{i}^{T} b)}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} (y_{i} - h (x_{i}^{T} b))

$\frac{\partial \ell}{\partial b_j} = \sum_i x_{ij} \frac{h'(x_i^T b)}{h(x_i^T b)(1 - h(x_i^T b))}(y_i - h(x_i^T b))$ so letting our new working response be

z^{*} = D^{- 1} (y - \hat{y})

$z^* = D^{-1}(y-\hat y)$ with

D = diag (h^{'} (x_{1}^{T} b), \dots, h^{'} (x_{n}^{T} b))

$D=\text{diag}\left(h'(x_1^T b), \dots, h'(x_n^T b)\right)$ , we have

\nabla ℓ = X^{T} W^{*} z^{*}

$\nabla \ell = X^TW^*z^*$ .

All together we are iterating

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} z_{(m)}^{*}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)}^* X)^{-1}X^T W_{(m)}^* z_{(m)}^*$ so this is still a sequence of WLS regressions except now it's not necessarily Newton-Raphson.

I've written it out this way to emphasize the connection to Newton-Raphson, but frequently people will factor the updates so that each new point $b^{(m+1)}$ is itself the WLS solution, rather than a WLS solution added to the current point $b^{(m)}$ . If we wanted to do this, we can do the following:

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} z_{(m)}^{*}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)}^* X)^{-1}X^T W_{(m)}^* z_{(m)}^*$

= (X^{T} W_{(m)}^{*} X)^{- 1} (X^{T} W_{(m)}^{*} X b^{(m)} + X^{T} W_{(m)}^{*} z_{(m)}^{*})

$= (X^T W_{(m)}^* X)^{-1}\left(X^T W_{(m)}^* Xb^{(m)}+ X^TW^*_{(m)}z_{(m)}^* \right)$

= (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} (X b^{(m)} + z_{(m)}^{*})

$= (X^T W_{(m)}^* X)^{-1}X^TW_{(m)}^*\left(Xb^{(m)}+ z_{(m)}^* \right)$ so if we're going this way you'll see the working response take the form

η^{(m)} + D_{(m)}^{- 1} (y - {\hat{y}}^{(m)})

$\eta^{(m)} + D^{-1}_{(m)}(y - \hat y^{(m)})$ , but it's the same thing.

Let's confirm that this works by using it to perform a probit regression on the same simulated data as before (and this is not the canonical link, so we need this more general form of IRLS).

my_IRLS_general <- function(x, y, b.init, h, h.prime, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- h(eta)
    h.prime_eta <- h.prime(eta)
    w_star <- h.prime_eta^2 / (y.hat * (1 - y.hat))
    z_star <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z_star ~ x - 1, weights = w_star)$coef  # WLS

    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

# probit inverse link and derivative
h_probit <- function(x) pnorm(x, 0, 1)
h.prime_probit <- function(x) dnorm(x, 0, 1)

my_IRLS_general(x, y, rep(0,p), h_probit, h.prime_probit)
# x1         x2         x3         x4         x5 
# -0.6456508  1.2520266  0.5820856  0.4982678 -0.6768585 

glm(y~x-1, family=binomial(link="probit"))$coef
# x1         x2         x3         x4         x5 
# -0.6456490  1.2520241  0.5820835  0.4982663 -0.6768581

and again the two agree.

Comments on convergence

Finally, a few quick comments on convergence (I'll keep this brief as this is getting really long and I'm no expert at optimization). Even though theoretically each $J_{(m)}$ is negative definite, bad initial conditions can still prevent this algorithm from converging. In the probit example above, changing the initial conditions to b.init=rep(1,p) results in this, and that doesn't even look like a suspicious initial condition. If you step through the IRLS procedure with that initialization and these simulated data, by the second time through the loop there are some $\hat y_i$ that round to exactly $1$ and so the weights become undefined. If we're using the canonical link in the algorithm I gave we won't ever be dividing by $\hat y_i (1 - \hat y_i)$ to get undefined weights, but if we've got a situation where some $\hat y_i$ are approaching $0$ or $1$ , such as in the case of perfect separation, then we'll still get non-convergence as the gradient dies without us reaching anything.

— jld
source

+1. I love how detailed your answers often are.

— amibe dit Réintégrer Monica

You stated "the coefficient estimates from this converge on a maximum of the logistic regression likelihood." Is that necessarily so, from any initial values?

— Mark L. Stone

@MarkL.Stone ah I was being too casual there, didn't mean to offend the optimization people :) I'll add some more details (and would appreciate your thoughts on them when I do)

— 2018

any chance you watched the link I posted? Seems that video is talking from machine learning perspective, just optimize logistic loss, without talking about Hessain expectation?

— Haitao Du

@ hxd1011 dans ce pdf auquel je suis lié (lien à nouveau: sagepub.com/sites/default/files/upm-binaries/… ) à la page 24 de celui-ci, l'auteur entre dans la théorie et explique ce qui rend exactement une fonction de lien canonique. J'ai trouvé ce fichier PDF extrêmement utile lorsque je l'ai rencontré pour la première fois (même si cela m'a pris du temps).

— 2018