Pourquoi y a-t-il -1 dans la fonction de densité de distribution bêta?

La distribution bêta apparaît sous deux paramétrisations (ou ici )

\begin{matrix} (1) & f (x) \propto x^{α} (1 - x)^{β} \end{matrix}

$f(x) \propto x^{\alpha} (1-x)^{\beta} \tag{1}$

ou celui qui semble être le plus utilisé

\begin{matrix} (2) & f (x) \propto x^{α - 1} (1 - x)^{β - 1} \end{matrix}

$f(x) \propto x^{\alpha-1} (1-x)^{\beta-1} \tag{2}$

Mais pourquoi exactement y a-t-il " $-1$ " dans la deuxième formule?

La première formulation semble intuitivement correspondre plus directement à la distribution binomiale

\begin{matrix} (3) & g (k) \propto p^{k} (1 - p)^{n - k} \end{matrix}

$g(k) \propto p^k (1-p)^{n-k} \tag{3}$

mais « vu » de la $p$ perspective de . Cela est particulièrement clair dans le modèle bêta-binomial où $\alpha$ peut être compris comme un nombre antérieur de succès et $\beta$ est un nombre antérieur d'échecs.

Alors, pourquoi exactement la deuxième forme a-t-elle gagné en popularité et quelle est la justification derrière elle? Quelles sont les conséquences de l'utilisation de l'une ou l'autre des paramétrisations (par exemple pour la connexion avec la distribution binomiale)?

Ce serait formidable si quelqu'un pouvait en outre indiquer les origines d'un tel choix et les arguments initiaux pour cela, mais ce n'est pas une nécessité pour moi.

— Tim
source

Une raison profonde est suggérée dans cette réponse :

f

$f$ est égal à

x^{α} (1 - x)^{β}

$x^\alpha(1-x)^\beta$ rapport à la mesure

d μ = d x / ((x (1 - x))

$d\mu=dx/((x(1-x))$ . Cela réduit votre question à «pourquoi cette mesure particulière "? Reconnaissant que cette mesure est

d μ = d (\log (\frac{x}{1 - x}))

$d\mu=d\left(\log\left(\frac{x}{1-x}\right)\right)$ suggère que la "bonne" façon de comprendre ces distributions est d'appliquer la transformation logistique: lestermes"

- 1

$-1$ " disparaîtront alors.

— whuber

Je pense que la véritable raison pour laquelle cela s'est produit est historique - car elle apparaît de cette façon dans la fonction bêta pour laquelle la distribution est nommée. Quant à savoir pourquoi cela a

- 1

$-1$ au pouvoir, je m'attends à ce que cela soit finalement lié à la raison mentionnée (même si cela n'a historiquement rien à voir avec la mesure ou même la probabilité).

— Glen_b -Reinstate Monica

@Glen_b C'est plus qu'historique: il y a des raisons profondes. Ils sont dus à la connexion intime entre les fonctions bêta et gamma, réduisant la question de savoir pourquoi l'exposant en

est

et non

. C'est parce que

est une somme de Gauss . De manière équivalente, il est "juste" de considérer

comme une intégrale d'un homomorphisme multiplicatif

fois un caractère additif

Γ (s) = \int_{0}^{\infty} t^{s - 1} e^{- t} d t

$\Gamma(s)=\int_0^\infty t^{s-1}e^{-t}dt$

s - 1

$s-1$

s

$s$

Γ

$\Gamma$

Γ

$\Gamma$

t \to t^{s}

$t\to t^s$

contre la mesure de Haar

sur le groupe multiplicatif

t \to e^{- t}

$t\to e^{-t}$

d t / t

$dt/t$

R^{\times}

$\mathbb{R}^{\times}$

— whuber

@wh C'est une bonne raison pour laquelle la fonction gamma devrait être choisie de cette façon (et j'ai déjà suggéré qu'une telle raison existait ci-dessus et j'accepte une forme de raisonnement semblable à cela - mais nécessairement avec un formalisme différent - est entré dans le choix d'Euler); des raisons impérieuses correspondantes se produisent avec la densité; mais cela n'établit pas que c'était en fait la raison du choix (pourquoi le formulaire a été choisi tel quel), mais seulement que c'est une bonne raison de le faire. La forme de la fonction gamma ... ctd

— Glen_b -Reinstate Monica

ctd ... à lui seul pourrait facilement être une raison suffisante pour choisir cette forme pour la densité et pour que d'autres suivent. [Souvent, les choix sont faits pour des raisons plus simples que celles que nous pouvons identifier par la suite, puis il faut souvent des raisons impérieuses pour faire autre chose. Est - ce que nous savons c'est pourquoi il a été choisi d' abord] - vous expliquer clairement qu'il ya une raison pour laquelle nous devrions choisir la densité à être de cette façon, plutôt que pourquoi il est de cette façon. Cela implique une séquence de personnes faisant des choix (pour l'utiliser de cette façon, et pour emboîter le pas), et leurs raisons au moment où ils ont choisi.

— Glen_b -Reinstate Monica

Réponses:

C'est une histoire sur les degrés de liberté et les paramètres statistiques et pourquoi il est agréable que les deux aient une connexion simple et directe.

Historiquement, les termes " " sont apparus dans les études d'Euler sur la fonction Bêta. Il utilisait cette paramétrisation en 1763, tout comme Adrien-Marie Legendre: leur utilisation a établi la convention mathématique suivante. Ce travail est antérieur à toutes les applications statistiques connues. $-1$

La théorie mathématique moderne fournit de nombreuses indications, grâce à la richesse des applications en analyse, en théorie des nombres et en géométrie, que les termes " " ont réellement une signification. J'ai esquissé certaines de ces raisons dans les commentaires sur la question. $-1$

Ce qui devrait être la "bonne" paramétrisation statistique est plus intéressant. Ce n'est pas aussi clair et cela n'a pas à être le même que la convention mathématique. Il existe un immense réseau de familles de distributions de probabilités bien connues et interdépendantes. Ainsi, les conventions utilisées pour nommer (c'est-à-dire paramétrer) une famille impliquent généralement des conventions liées pour nommer les familles liées. Modifiez un paramétrage et vous voudrez tous les changer. Nous pourrions donc examiner ces relations pour trouver des indices.

Peu de gens seraient en désaccord avec le fait que les familles de distribution les plus importantes dérivent de la famille normale. Rappelons qu'une variable aléatoire est dite "distribuée normalement" lorsque a une densité de probabilité proportionnel à . Lorsque et , aurait une distribution normale standard . $X$ $(X-\mu)/\sigma$ $f(x)$ $\exp(-x^2/2)$ $\sigma=1$ $\mu=0$ $X$

De nombreux ensembles de données sont étudiés à l'aide de statistiques relativement simples impliquant des combinaisons rationnelles des données et de faibles puissances (généralement des carrés). Lorsque ces données sont modélisées comme des échantillons aléatoires à partir d'une distribution normale - de sorte que chaque est considéré comme une réalisation d'une variable normale , tous les partagent une distribution commune et sont indépendants - les distributions de ces statistiques sont déterminés par cette distribution normale. Celles qui surviennent le plus souvent dans la pratique sont $x_1, x_2, \ldots, x_n$ $x_i$ $X_i$ $X_i$

,ladistribution deStudent $t_\nu$ $t$ avec "degrés de liberté". Il s'agit de la distribution de la statistique $\nu = n-1$ oùmodélise la moyenne des données et
$t = \frac{\bar{X}}{se (X)}$ $t = \frac{\bar X}{\operatorname{se}(X)}$ $\bar X = (X_1 + X_2 + \cdots + X_n)/n$ est l'erreur standard de la moyenne. La division parmontre quedoit êtreou plus, d'oùest un entierou plus. La formule, bien qu'apparemment un peu compliquée, est la racine carrée d'une fonction rationnelle des données de degré deux: elle est relativement simple. $\operatorname{se}(X) = (1/\sqrt{n})\sqrt{(X_1^2+X_2^2 + \cdots + X_n^2)/(n-1) - \bar X^2}$ $n-1$ $n$ $2$ $\nu$ $1$
,ladistribution (khi carré)avec "degrés de liberté" (df). Il s'agit de la distribution de la somme des carrés desvariables normales standard indépendantes . La répartition de la moyenne des carrés de ces variables sera donc un de distribution miséchelle par : Je vaisréférence à cela comme une « normalisée » distribution. $\chi^2_\nu$ $\chi^2$ $\nu$ $\nu$ $\chi^2$ $1/\nu$ $\chi^2$
$F_{\nu_1, \nu_2}$ , the $F$ ratio distribution with parameters $(\nu_1, \nu_2)$ is the ratio of two independent normalized $\chi^2$ distributions with $\nu_1$ and $\nu_2$ degrees of freedom.

Mathematical calculations show that all three of these distributions have densities. Importantly, the density of the $\chi^2_\nu$ distribution is proportional to the integrand in Euler's integral definition of the Gamma ( $\Gamma$ ) function. Let's compare them:

f_{χ_{ν}^{2}} (2 x) \propto x^{ν / 2 - 1} e^{- x}; f_{Γ (ν)} (x) \propto x^{ν - 1} e^{- x} .

$f_{\chi^2_\nu}(2x) \propto x^{\nu/2 - 1}e^{-x};\quad f_{\Gamma(\nu)}(x) \propto x^{\nu-1}e^{-x}.$

This shows that twice a $\chi^2_\nu$ variable has a Gamma distribution with parameter $\nu/2$ . The factor of one-half is bothersome enough, but subtracting $1$ would make the relationship much worse. This already supplies a compelling answer to the question: if we want the parameter of a $\chi^2$ distribution to count the number of squared Normal variables that produce it (up to a factor of $1/2$ ), then the exponent in its density function must be one less than half that count.

Why is the factor of $1/2$ less troublesome than a difference of $1$ ? The reason is that the factor will remain consistent when we add things up. If the sum of squares of $n$ independent standard Normals is proportional to a Gamma distribution with parameter $n$ (times some factor), then the sum of squares of $m$ independent standard Normals is proportional to a Gamma distribution with parameter $m$ (times the same factor), whence the sum of squares of all $n+m$ variables is proportional to a Gamma distribution with parameter $m+n$ (still times the same factor). The fact that adding the parameters so closely emulates adding the counts is very helpful.

If, however, we were to remove that pesky-looking " $-1$ " from the mathematical formulas, these nice relationships would become more complicated. For example, if we changed the parameterization of Gamma distributions to refer to the actual power of $x$ in the formula, so that a $\chi^2_1$ distribution would be related to a "Gamma $(0)$ " distribution (since the power of $x$ in its PDF is $1-1=0$ ), then the sum of three $\chi^2_1$ distributions would have to be called a "Gamma $(2)$ " distribution. In short, the close additive relationship between degrees of freedom and the parameter in Gamma distributions would be lost by removing the $-1$ from the formula and absorbing it in the parameter.

Similarly, the probability function of an $F$ ratio distribution is closely related to Beta distributions. Indeed, when $Y$ has an $F$ ratio distribution, the distribution of $Z=\nu_1 Y/(\nu_1 Y + \nu_2)$ has a Beta $(\nu_1/2, \nu_2/2)$ distribution. Its density function is proportional to

f_{Z} (z) \propto z^{ν_{1} / 2 - 1} (1 - z)^{ν_{2} / 2 - 1} .

$f_Z(z) \propto z^{\nu_1/2 - 1}(1-z)^{\nu_2/2-1}.$

Furthermore--taking these ideas full circle--the square of a Student $t$ distribution with $\nu$ d.f. has an $F$ ratio distribution with parameters $(1,\nu)$ . Once more it is apparent that keeping the conventional parameterization maintains a clear relationship with the underlying counts that contribute to the degrees of freedom.

From a statistical point of view, then, it would be most natural and simplest to use a variation of the conventional mathematical parameterizations of $\Gamma$ and Beta distributions: we should prefer calling a $\Gamma(\alpha)$ distribution a " $\Gamma(2\alpha)$ distribution" and the Beta $(\alpha, \beta)$ distribution ought to be called a "Beta $(2\alpha, 2\beta)$ distribution." In fact, we have already done that: this is precisely why we continue to use the names "Chi-squared" and " $F$ Ratio" distribution instead of "Gamma" and "Beta". Regardless, in no case would we want to remove the " $-1$ " terms that appear in the mathematical formulas for their densities. If we did that, we would lose the direct connection between the parameters in the densities and the data counts with which they are associated: we would always be off by one.

— whuber
source

Thanks for your answer (I +1d already). I have just a small follow-up question: maybe I'm missing something, but aren't we sacrificing the direct relation with binomial by using the -1 parametrization?

— Tim

I'm not sure which "direct relation with binomial" you're referring to, Tim. For instance, when the Beta

(a, b)

$(a,b)$ distribution is used as a conjugate prior for a Binomial sample, clearly the parameters are exactly the right ones to use: you add

a

$a$ (not

a - 1

$a-1$ ) to the number of successes and

b

$b$ (not

b - 1

$b-1$ ) to the number of failures.

— whuber

The notation is misleading you. There is a "hidden $-1$ " in your formula $(1)$ , because in $(1)$ , $\alpha$ and $\beta$ must be bigger than $-1$ (the second link you provided in your question says this explicitly). The $\alpha$ 's and $\beta$ 's in the two formulas are not the same parameters; they have different ranges: in $(1)$ , $\alpha,\beta>-1$ , and in $(2)$ , $\alpha,\beta>0$ . These ranges for $\alpha$ and $\beta$ are necessary to guarantee that the integral of the density doesn't diverge. To see this, consider in $(1)$ the case $\alpha=-1$ (or less) and $\beta=0$ , then try to integrate the (kernel of the) density between $0$ and $1$ . Equivalently, try the same in $(2)$ for $\alpha=0$ (or less) and $\beta=1$ .

— Zen
source

The issue of a range of definition for

α

$\alpha$ and

β

$\beta$ seems to go away when the integral is interpreted, as Pochhammer did in 1890, as a specific contour integral. In that case it can be equated to an expression that determines an analytic function for all values of

α

$\alpha$ and

β

$\beta$ --including all complex ones. This throws light on the concern in the question: why exactly has this specific parameterization been adopted, given there are many other possible parameterizations that seem like they might serve equally well?

— whuber

To me, the OP's doubt seems to be much more basic. He's kind of confused about the "-1" in (2), but not in (1) (not true, of course). It seems that your comment is answering a different question (much more interesting, by the way).

— Zen

Thanks for your effort and answer, but it still does not answer my main concern: why -1 was chosen? Following your logic, basically any value could be chosen changing the arbitrary lower bound to something else. I can't see why -1 or 0 could be better or worse lower bound for parameter values besides the fact that 0 is "aesthetically" nicer bound. On another hand, Beta(0, 0) would be nice "default" for uniform distribution when using the first form. Yes, those are very subjective comments, but that is my main point: are there any non-arbitrary reasons for such choice?

— Tim

Zen, I agree there was a question of how to interpret the original post. Thank you, Tim, for your clarifications.

— whuber

Hi, Tim! I don't see any definitive reason, although it makes more direct the connection with the fact that for

α, β > 0

$\alpha,\beta>0$ , if

U \sim G a m m a (α, 1)

$U\sim\mathrm{Gamma}(\alpha,1)$ and

V \sim G a m m a (β, 1)

$V\sim\mathrm{Gamma}(\beta,1)$ are independent, then

X = U / (U + V)

$X=U/(U+V)$ is

B e t a (α, β)

$\mathrm{Beta}(\alpha,\beta)$ , and the density of

X

$X$ is proportional to

x^{α - 1} (1 - x)^{β - 1}

$x^{\alpha-1}(1-x)^{\beta-1}$ . But then you can question the parameterization of the gamma distribution...

— Zen

For me, the existence of -1 in the exponent is related with the develpment of the Gamma function. The motivation of the Gamma function is to find a smooth curve to connect the points of a factorial $x!$ . Since it is not possible to compute $x!$ directly if $x$ is not integer, the idea was to find a function for any $x \geq 0$ that satisfies the recurrence relation defined by the factorial, namely

$f(1)=1\\ f(x+1)=x \cdot f(x).$

Solution was by means of the convergence of an integral. For the function defined as

$f(x+1) = \displaystyle\int_{0}^{\infty} t^{x}e^{-x} dt,$

integration by parts provides the following:

$\begin{align} f(x+1) & = \displaystyle\int_{0}^{\infty} t^{x}e^{-x} dt \\ & = \Big[-t^{x}e^{-x} \Big]^{\infty}_{0} + \displaystyle\int_{0}^{\infty} x\cdot t^{x-1}e^{-x} dt \\ &= \lim_{x \to \infty} (-t^{x}e^{-x}) - 0 \cdot e^{-0} + x\cdot \displaystyle\int_{0}^{\infty} t^{x-1}e^{-x} dt \\ &= 0 - 0 + x\cdot \displaystyle\int_{0}^{\infty} t^{x-1}e^{-x} dt \\ &= x \cdot f(x) . \end{align}$

So, the function above satisfies this property, and the -1 in the exponent derives from the procedure of integration by parts. See the Wikipedia article https://en.wikipedia.org/wiki/Gamma_function .

Edit: I apologise if my post is not fully clear; I am just trying to point that, in my idea, the existence of -1 in the beta distribution comes from the generalisation of the factorial by means of the Gamma function. There are two conditions: $f(1)=1$ and $f(x+1)=x \cdot f(x)$ . We have $\Gamma(x) = (x-1)!$ , therefore it satisfies $\Gamma(x+1) = x \cdot \Gamma(x) = x \cdot (x-1)! = x!$ . In addition, we have $\Gamma(1) = (1-1)! = 0! = 1$ . As for the beta distribution with parameters $\alpha, \beta$ , generalisation of the Binomial coefficient is $\dfrac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \cdot \Gamma(\beta)} = \dfrac{(\alpha + \beta - 1)!}{(\alpha-1)! \cdot (\beta-1)!}$ . There we have the -1 in the denominator, for both parameters.

— aatr
source

This makes no sense because the recurrence function satisfied by the factorial is not what you state:

(x + 1)! \neq x \cdot x! .

$(x+1)! \ne x \cdot x!.$

— whuber

The function

f (x)

$f(x)$ satisfying the recurrence relation is the Gamma:

Γ (x + 1) = x \cdot Γ (x)

$\Gamma(x+1) = x \cdot \Gamma(x)$ . This is how it is defined.

— aatr

Yes: but your stated motivation is based on the factorial function, not the Gamma.

— whuber

It is important to recall the relation between Gamma and factorial:

Γ (x) = (x - 1)!

$\Gamma(x) = (x-1)!$ .

— aatr

Unfortunately, that's circular logic: you start off with the factorial, characterize Gamma as interpolating it, and then conclude that's why there's a -1. In fact, your post exhibits the -1 as if it fell out mistakenly by confusing Gamma with the factorial. Few will find that either illuminating or convincing.

— whuber