Comment prouver que la fonction de base radiale est un noyau?

35

Comment prouver que la fonction de base radiale est un noyau? Pour autant que je sache, afin de prouver cela, nous devons prouver l'un des éléments suivants: $k(x, y) = \exp(-\frac{||x-y||^2)}{2\sigma^2})$

Pour tout ensemble de vecteurs matrice = est semi - définie positive. $x_1, x_2, ..., x_n$ $K(x_1, x_2, ..., x_n)$ $(k(x_i, x_j))_{n \times n}$
A mapping $\Phi$ can be presented such as $k(x, y)$ = $\langle\Phi(x), \Phi(y)\rangle$ .

Any help?

svm kernel-trick

— Leo
source

1

Just to link it more obviously: the feature map is also discussed in this question, particularly Marc Claesen's answer based on Taylor series and mine which discusses both the RKHS and the general version of the

L_{2}

$L_2$ embedding given by Douglas below.

— Dougal

26

Zen used method 1. Here is method 2: Map $x$ to a spherically symmetric Gaussian distribution centered at $x$ in the Hilbert space $L^2$ . The standard deviation and a constant factor have to be tweaked for this to work exactly. For example, in one dimension,

\int_{- \infty}^{\infty} \frac{\exp [- (x - z)^{2} / (2 σ^{2})]}{\sqrt{2 π} σ} \frac{\exp [- (y - z)^{2} / (2 σ^{2})}{\sqrt{2 π} σ} d z = \frac{\exp [- (x - y)^{2} / (4 σ^{2})]}{2 \sqrt{π} σ} .

$\int_{-\infty}^\infty \frac{\exp[-(x-z)^2/(2\sigma^2)]}{\sqrt{2 \pi} \sigma} \frac{\exp[-(y-z)^2/(2 \sigma^2)}{\sqrt{2 \pi} \sigma} dz = \frac{\exp [-(x-y)^2/(4 \sigma^2)]}{2 \sqrt \pi \sigma}.$

So, use a standard deviation of $\sigma/\sqrt 2$ and scale the Gaussian distribution to get $k(x,y) = \langle \Phi(x), \Phi(y)\rangle$ . This last rescaling occurs because the $L^2$ norm of a normal distribution is not $1$ in general.

— Douglas Zare
source

2

@Zen, Douglas Zare: thank you for your great answers. How am I supposed to select the official answer now?

— Leo

23

I will use method 1. Check Douglas Zare's answer for a proof using method 2.

I will prove the case when $x,y$ are real numbers, so $k(x,y)=\exp(-(x-y)^2/2\sigma^2)$ . The general case follows mutatis mutandis from the same argument, and is worth doing.

Without loss of generality, suppose that $\sigma^2=1$ .

Write $k(x,y)=h(x-y)$ , where

h (t) = \exp (- \frac{t^{2}}{2}) = E [e^{i t Z}]

$h(t)=\exp\left(-\frac{t^2}{2}\right)=\mathrm{E}\left[e^{itZ}\right]$ is the characteristic function of a random variable

Z

$Z$ with

N (0, 1)

$N(0,1)$ distribution.

For real numbers $x_1,\dots,x_n$ and $a_1,\dots,a_n$ , we have

\sum_{j, k = 1}^{n} a_{j} a_{k} h (x_{j} - x_{k}) = \sum_{j, k = 1}^{n} a_{j} a_{k} E [e^{i (x_{j} - x_{k}) Z}] = E [\sum_{j, k = 1}^{n} a_{j} e^{i x_{j} Z} a_{k} e^{- i x_{k} Z}] = E [{| \sum_{j = 1}^{n} a_{j} e^{i x_{j} Z} |}^{2}] \geq 0,

$\sum_{j,k=1}^n a_j\,a_k\,h(x_j-x_k) = \sum_{j,k=1}^n a_j\,a_k\,\mathrm{E} \left[ e^{i(x_j-x_k)Z}\right] = \mathrm{E} \left[ \sum_{j,k=1}^n a_j\,e^{i x_j Z}\,a_k\,e^{-i x_k Z}\right] = \mathrm{E}\left[ \left| \sum_{j=1}^n a_j\,e^{i x_j Z}\right|^2\right] \geq 0 \, ,$ which entails that

k

$k$ is a positive semidefinite function, aka a kernel.

To understand this result in greater generality, check out Bochner's Theorem: http://en.wikipedia.org/wiki/Positive-definite_function

— Zen
source

2

This is a good start, in the right direction, with two caveats: (a)

h (t)

$h(t)$ is not equal to the expectation shown (check the sign in the exponent) and (b) this appears to restrict attention to the case where

x

$x$ and

y

$y$ are scalars and not vectors. I've upvoted in the meantime, because the exposition is nice and clean and I'm sure you'll quickly plug these small gaps. :-)

— cardinal

1

Tks! I'm in a hurry here. :-)

— Zen

1

Excuse me, I really don't see how you manage the mutatis mutandis here. If you develop the norm before passing to the

h

$h$ form, then you got products and you can't swap products and sum. And I simply don't see how to develop the norm after passing to the h form to obtain a nice expression. Can you lead me a bit there ? :)

— Alburkerk

23

I'll add a third method, just for variety: building up the kernel from a sequence of general steps known to create pd kernels. Let $\mathcal X$ denote the domain of the kernels below and $\varphi$ the feature maps.

Scalings: If $\kappa$ is a pd kernel, so is $\gamma \kappa$ for any constant $\gamma > 0$ .

Proof: if $\varphi$ is the feature map for $\kappa$ , $\sqrt\gamma \varphi$ is a valid feature map for $\gamma \kappa$ .
Sums: If $\kappa_1$ and $\kappa_2$ are pd kernels, so is $\kappa_1 + \kappa_2$ .

Proof: Concatenate the feature maps $\varphi_1$ and $\varphi_2$ , to get $x \mapsto \begin{bmatrix}\varphi_1(x) \\ \varphi_2(x)\end{bmatrix}$ .
Limits: If $\kappa_1, \kappa_2, \dots$ are pd kernels, and $\kappa(x, y) := \lim_{n \to \infty} \kappa_n(x, y)$ exists for all $x, y$ , then $\kappa$ is pd.

Proof: For each $m, n \ge 1$ and every $\{ (x_i, c_i) \}_{i=1}^m \subseteq \mathcal{X} \times \mathbb R$ we have that $\sum_{i=1}^m c_i \kappa_n(x_i, x_j) c_j \ge 0$ . Taking the limit as $n \to \infty$ gives the same property for $\kappa$ .
Products: If $\kappa_1$ and $\kappa_2$ are pd kernels, so is $g(x, y) = \kappa_1(x, y) \, \kappa_2(x, y)$ .

Proof: It follows immediately from the Schur product theorem, but Schölkopf and Smola (2002) give the following nice, elementary proof. Let
$(V_{1}, \dots, V_{m}) \sim N (0, {[κ_{1} (x_{i}, x_{j})]}_{i j}) (W_{1}, \dots, W_{m}) \sim N (0, {[κ_{2} (x_{i}, x_{j})]}_{i j})$ $(V_1, \dots, V_m) \sim \mathcal{N}\left( 0, \left[ \kappa_1(x_i, x_j) \right]_{ij} \right) \\ (W_1, \dots, W_m) \sim \mathcal{N}\left( 0, \left[ \kappa_2(x_i, x_j) \right]_{ij} \right)$ be independent. Thus $C o v (V_{i} W_{i}, V_{j} W_{j}) = C o v (V_{i}, V_{j}) C o v (W_{i}, W_{j}) = κ_{1} (x_{i}, x_{j}) κ_{2} (x_{i}, x_{j}) .$ $\mathrm{Cov}(V_i W_i, V_j W_j) = \mathrm{Cov}(V_i, V_j) \,\mathrm{Cov}(W_i, W_j) = \kappa_1(x_i, x_j) \kappa_2(x_i, x_j).$ Covariance matrices must be psd, so considering the covariance matrix of $(V_1 W_1, \dots, V_n W_n)$ proves it.
Powers: If $\kappa$ is a pd kernel, so is $\kappa^n(x, y) := \kappa(x, y)^n$ for any positive integer $n$ .

Proof: immediate from the "products" property.
Exponents: If $\kappa$ is a pd kernel, so is $e^\kappa(x, y) := \exp(\kappa(x, y))$ .

Proof: We have $e^\kappa(x, y) = \lim_{N \to \infty} \sum_{n=0}^N \frac{1}{n!} \kappa(x, y)^n$ ; use the "powers", "scalings", "sums", and "limits" properties.
Functions: If $\kappa$ is a pd kernel and $f : \mathcal X \to \mathbb R$ , $g(x, y) := f(x) \kappa(x, y) f(y)$ is as well.

Proof: Use the feature map $x \mapsto f(x) \varphi(x)$ .

Now, note that

\begin{aligned} k (x, y) & = \exp (- \frac{1}{2 σ^{2}} ‖ x - y ‖^{2}) \\ = \exp (- \frac{1}{2 σ^{2}} ‖ x ‖^{2}) \exp (\frac{1}{σ^{2}} x^{T} y) \exp (- \frac{1}{2 σ^{2}} ‖ y ‖^{2}) . \end{aligned}

$\begin{align*} k(x, y) &= \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right) \\&= \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x \rVert^2 \right) \exp\left( \tfrac{1}{\sigma^2} x^T y \right) \exp\left( - \tfrac{1}{2 \sigma^2} \lVert y \rVert^2 \right) .\end{align*}$ Start with the linear kernel

κ (x, y) = x^{T} y

$\kappa(x, y) = x^T y$ , apply "scalings" with

\frac{1}{σ^{2}}

$\frac{1}{\sigma^2}$ , apply "exponents", and apply "functions" with

x \mapsto \exp (- \frac{1}{2 σ^{2}} ‖ x ‖^{2})

$x \mapsto \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x \rVert^2 \right)$ .

— Dougal
source