Gain d'information, information mutuelle et mesures connexes

33

Andrew More définit le gain d'information comme suit:

$IG(Y|X) = H(Y) - H(Y|X)$

où est l' entropie conditionnelle . Cependant, Wikipedia appelle les informations mutuelles de quantité ci-dessus . $H(Y|X)$

D'autre part, Wikipedia définit le gain d'information comme la divergence de Kullback – Leibler (ou divergence d'information ou entropie relative) entre deux variables aléatoires:

$D_{KL}(P||Q) = H(P,Q) - H(P)$

où est défini comme l' entropie croisée . $H(P,Q)$

Ces deux définitions semblent être incompatibles l'une avec l'autre.

J'ai également vu d'autres auteurs parler de deux concepts connexes supplémentaires, à savoir l'entropie différentielle et le gain d'informations relatives.

Quelle est la définition ou relation précise entre ces quantités? Y a-t-il un bon livre de texte qui les couvre tous?

Gain d'information
Information mutuelle
Entropie croisée
Entropie conditionnelle
Entropie différentielle
Gain relatif d'information

information-theory

— Amelio Vazquez-Reina
source

2

Pour ajouter à la confusion, notez que la notation que vous avez utilisée pour l’entropie croisée est également la même que celle utilisée pour l’entropie jointe. J'ai utilisé pour l'entropie croisée pour éviter de me confondre, mais c'est pour mon bénéfice et je n'ai jamais vu cette notation ailleurs.

H^{x} (P, Q)

$H^x(P, Q)$

— Michael McGowan

24

Je pense que qualifier de "gain d'information" la divergence de Kullback-Leibler n'est pas standard.

La première définition est standard.

EDIT: Cependant, peut aussi être appelé information mutuelle. $H(Y)−H(Y|X)$

Notez que je ne pense pas que vous trouverez une discipline scientifique ayant réellement un schéma de dénomination normalisé, précis et cohérent. Vous devrez donc toujours regarder les formules, car elles vous donneront généralement une meilleure idée.

Manuels: voir "Bonne introduction à différents types d'entropie" .

Également: Cosma Shalizi: Méthodes et techniques de la science des systèmes complexes: vue d'ensemble, chapitre 1 (p. 33–114) dans Thomas S. Deisboeck et J. Yasha Kresh (éd.), Science des systèmes complexes en biomédecine http: // arxiv.org/abs/nlin.AO/0307015

Robert M. Gray: Théorie de l'entropie et de l'information http://ee.stanford.edu/~gray/it.html

David MacKay: Théorie de l'information, algorithmes d'inférence et d'apprentissage http://www.inference.phy.cam.ac.uk/mackay/itila/book.html

aussi, "Qu'est-ce que" l'entropie et le gain d'information "?"

— wolf.rauch
source

Merci @wolf. Je suis enclin à accepter cette réponse. Si la première définition est standard, comment définiriez-vous l'information mutuelle?

— Amelio Vazquez-Reina

2

Désolé. la première quantité,

est aussi souvent appelée information mutuelle. C'est un cas d'appellation incohérente. Comme je l'ai dit, je ne pense pas qu'il y ait une correspondance cohérente, sans ambiguïté, un à un des concepts et des noms. Par exemple, "information mutuelle" ou "gain d'information" est un cas particulier de divergence KL, de sorte que cet article de Wikipédia n'est pas si éloigné.

I G (Y | X) = H (Y) - H (Y | X)

$IG(Y|X)=H(Y)−H(Y|X)$

— wolf.rauch

4

La divergence de Kullback-Leiber entre $p(X,Y)$ et $P(X)P(Y)$ est la même que l'information mutuelle, qui peut être facilement dérivée:

\begin{aligned} I (X; Y) & = H (Y) - H (Y ∣ X) \\ = - \sum_{y} p (y) \log p (y) + \sum_{x, y} p (x) p (y ∣ x) \log p (y ∣ x) \\ = \sum_{x, y} p (x, y) \log p (y ∣ x) - \sum_{y} (\sum_{x} p (x, y)) \log p (y) \\ = \sum_{x, y} p (x, y) \log p (y ∣ x) - \sum_{x, y} p (x, y) \log p (y) \\ = \sum_{x, y} p (x, y) \log \frac{p (y ∣ x)}{p (y)} \\ = \sum_{x, y} p (x, y) \log \frac{p (y ∣ x) p (x)}{p (y) p (x)} \\ = \sum_{x, y} p (x, y) \log \frac{p (x, y)}{p (y) p (x)} \\ = D_{K L} (P (X, Y) ∣∣ P (X) P (Y)) \end{aligned}

$\begin{align} I(X; Y) &= H(Y) - H(Y \mid X)\\ &= - \sum_y p(y) \log p(y) + \sum_{x,y} p(x) p(y\mid x) \log p(y\mid x)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{y} \left(\sum_{x}p(x,y)\right) \log p(y)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{x,y}p(x, y) \log p(y)\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)}{p(y)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)p(x)}{p(y)p(x)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(y)p(x)}\\ &= \mathcal D_{KL} (P(X,Y)\mid\mid P(X)P(Y)) \end{align}$

Note: $p(y) = \sum_x p(x,y)$

— chris elgoog
source

1

Mutual information can be defined using Kullback-Liebler as

\begin{aligned} I (X; Y) = D_{K L} (p (x, y) | | p (x) p (y)) . \end{aligned}

$\begin{align*} I(X;Y) = D_{KL}(p(x,y)||p(x)p(y)). \end{align*}$

— yters
source

1

Extracting mutual information from textual datasets as a feature to train machine learning model: ( the task was to predict age, gender and personality of bloggers)

— Krebto
source

1

Both definitions are correct, and consistent. I'm not sure what you find unclear as you point out multiple points that might need clarification.

Firstly: $MI_{Mutual Information}\equiv$ $IG_{InformationGain}\equiv I_{Information}$ are all different names for the same thing. In different contexts one of these names may be preferable, i will call it hereon Information.

The second point is the relation between the Kullback–Leibler divergence- $D_{KL}$ , and Information. The Kullback–Leibler divergence is simply a measure of dissimilarity between two distributions. The Information can be defined in these terms of distributions' dissimilarity (see Yters' response). So information is a special case of $K_{LD}$ , where $K_{LD}$ is applied to measure the difference between the actual joint distribution of two variables (which captures their dependence) and the hypothetical joint distribution of the same variables, were they to be independent. We call that quantity Information.

The third point to clarify is the inconsistent, though standard notation being used, namely that $\operatorname{H} (X,Y)$ is both the notation for Joint entropy and for Cross-entropy as well.

So, for example, in the definition of Information:

\begin{aligned} I (X; Y) & \equiv H (X) - H (X | Y) \\ \equiv H (Y) - H (Y | X) \\ \equiv H (X) + H (Y) - H (X, Y) \\ \equiv H (X, Y) - H (X | Y) - H (Y | X) \end{aligned}

$\begin{aligned}\operatorname {I} (X;Y)&{}\equiv \mathrm {H} (X)-\mathrm {H} (X|Y)\\&{}\equiv \mathrm {H} (Y)-\mathrm {H} (Y|X)\\&{}\equiv \mathrm {H} (X)+\mathrm {H} (Y)-\mathrm {H} (X,Y)\\&{}\equiv \mathrm {H} (X,Y)-\mathrm {H} (X|Y)-\mathrm {H} (Y|X)\end{aligned}$ in both last lines,

H (X, Y)

$\operatorname{H}(X,Y)$ is the joint entropy. This may seem inconsistent with the definition in the Information gain page however:

D K L (P | | Q) = H (P, Q) - H (P)

$DKL(P||Q)=H(P,Q)−H(P)$ but you did not fail to quote the important clarification -

H (P, Q)

$\operatorname{H}(P,Q)$ is being used there as the cross-entropy (as is the case too in the cross entropy page).

Joint-entropy and Cross-entropy are NOT the same.

Check out this and this where this ambiguous notation is addressed and a unique notation for cross-entropy is offered - $H_q(p)$

I would hope to see this notation accepted and the wiki-pages updated.

— אלימלך שרייבר
source

wonder why the equations are not displayed properly..

— Shaohua Li