Analyse des billes et des bacs dans le régime

Supposons que nous jetons $m$ boules dans $n$ bacs, où $m \gg n$ . Soit $X_i$ le nombre de billes se terminant dans le bac $i$ , $X_\max$ le bac le plus lourd, $X_\min$ le bac le plus léger et $X_{\mathrm{sec-max}}$ le deuxième bac le plus lourd. En gros, , et donc nous nous attendons à pour deux fixes . En utilisant une union liée, nous nous attendons $X_i - X_j \sim N(0,2m/n)$ $|X_i - X_j| = \Theta(\sqrt{m/n})$ $i,j$ $X_{\max} - X_{\min} = O(\sqrt{m\log n/n})$ ; vraisemblablement, nous pouvons obtenir une limite inférieure correspondante en considérant $n/2$ paires de bacs disjoints. Cet argument (pas complètement formel) nous amène à penser que l'écart entre et est avec une forte probabilité. $X_{\max}$ $X_{\min}$ $\Theta(\sqrt{m\log n/n})$

Je m'intéresse à l'écart entre et . L'argument décrit ci-dessus montre que avec une forte probabilité, mais le facteur semble étranger . Connaît-on la distribution de ? $X_\max$ $X_{\mathrm{sec-max}}$ $X_\max - X_{\mathrm{sec-max}} = O(\sqrt{m\log n/n})$ $\sqrt{\log n}$ $X_\max - X_{\mathrm{sec-max}}$

Plus généralement, supposons que chaque balle soit associée à un score non négatif pour chaque bac, et nous nous intéressons au score total de chaque bac après avoir lancé balles. Le scénario habituel correspond aux scores de la forme . Supposons que la distribution de probabilité des scores soit invariante sous permutation des cases (dans le scénario habituel, cela correspond au fait que toutes les cases sont équiprobables). Étant donné la distribution des scores, nous pouvons utiliser la méthode du premier paragraphe pour obtenir une bonne borne sur . La limite contiendra un facteur de $m$ $(0,\ldots,0,1,0,\ldots,0)$ $X_{\max} - X_{\min}$ $\sqrt{\log n}$ qui provient d'une borne d'union (via les probabilités de queue d'une variable normale). Ce facteur peut-il être réduit si nous souhaitons délimiter ? $X_{\max} - X_{\mathrm{sec-max}}$

reference-request pr.probability

— Yuval Filmus
source

Chaque score est en [0,1]?

— Neal Young

Cela n'a pas vraiment d'importance, vous pouvez toujours le mettre à l'échelle pour qu'il soit dans

[0, 1]

$[0,1]$

— Yuval Filmus

Réponses:

Réponse: . $\Theta\left(\sqrt{\frac{m}{n\log n}}\right)$

En appliquant une version multidimensionnelle du théorème de la limite centrale, nous obtenons que le vecteur a une distribution gaussienne asymptotiquement multivariée avec $(X_1,\dots, X_n)$ et Nous supposerons ci-dessous queestun vecteur gaussien (et pas seulement approximativement un vecteur gaussien). Ajoutons une variable aléatoire gaussiennede varianceà tous les(est indépendant de tous les). Autrement dit, soit

V a r [X_{i}] = m (\frac{1}{n} - \frac{1}{n^{2}}),

$\mathrm{Var}[X_i] = m\left(\frac{1}{n} - \frac{1}{n^2}\right),$

C o v (X_{i}, X_{j}) = - m / n^{2} .

$\mathrm{Cov}(X_i, X_j) = -m/n^2.$

X

$X$

Z

$Z$

m / n^{2}

$m/n^2$

X_{i}

$X_i$

Z

$Z$

X_{i}

$X_i$

On obtient un vecteur gaussien

. Maintenant, chaque

a une variance

(\begin{matrix} Y_{1} \\ Y_{2} \\ ⋮ \\ Y_{n} \end{matrix}) = (\begin{matrix} X_{1} + Z \\ X_{2} + Z \\ ⋮ \\ X_{n} + Z \end{matrix}) .

$\begin{pmatrix} Y_1\\Y_2\\ \vdots\\Y_n \end{pmatrix} = \begin{pmatrix} X_1+Z\\X_2+Z\\ \vdots\\X_n +Z \end{pmatrix}.$

(Y_{1}, \dots, Y_{n})

$(Y_1, \dots, Y_n)$

Y_{i}

$Y_i$

m / n

$m/n$

et tous les

sont indépendants:

V a r [Y_{i}] = V a r [X_{i}] + \underset{= 0}{\underset{⏟}{2 C o v (X_{i}, Z)}} + V a r [Z] = m / n,

$\mathrm{Var}[Y_i] = \mathrm{Var}[X_i] + \underbrace{2\mathrm{Cov}(X_i,Z)}_{=\, 0}+\mathrm{Var}[Z] = m/n,$

Y_{i}

$Y_i$

C o v (Y_{i}, Y_{j}) = C o v (X_{i}, X_{j}) + \underset{= 0}{\underset{⏟}{C o v (X_{i}, Z) + C o v (X_{j}, Z)}} + C o v (Z, Z) = 0.

$\mathrm{Cov}(Y_i, Y_j) = \mathrm{Cov}(X_i, X_j) + \underbrace{\mathrm{Cov}(X_i,Z) + \mathrm{Cov}(X_j,Z)}_{=\, 0} +\mathrm{Cov}(Z, Z) = 0.$

Notez que . Ainsi notre problème d'origine est équivalent au problème de trouver . Commençons par simplifier le cas où tous les ont la variance . $Y_i - Y_j = X_i - X_j$ $Y_{\mathrm{max}} - Y_{\mathrm{sec-max}}$ $Y_i$ $1$

Problème. On nous donne rv gaussien indépendant de moyenne et de variance . Estimer l'espérance de . $n$ $\gamma_1,\dots, \gamma_n$ $\mu$ $1$ $\gamma_{\mathrm{max}} - \gamma_{\mathrm{sec-max}}$

Réponse: $\Theta\left(\frac{1}{\sqrt{\log n}}\right)$ .

Informal Proof. Here is an informal solution to this problem (it's not hard to make it formal). Since the answer does not depend on the mean, we assume that $\mu = 0$ . Let $\bar\Phi(t) = \Pr[\gamma > t]$ , where $\gamma\sim{\cal N}(0,1)$ . We have (for moderately large $t$ ),

\bar{Φ} (t) \approx \frac{1}{\sqrt{2 π} t} e^{- \frac{1}{2} t^{2}} .

$\bar\Phi(t)\approx \frac{1}{\sqrt{2\pi}t} e^{-\frac{1}{2}t^2}.$

Note that

$\Phi(\gamma_i)$ are uniformly and independently distributed on $[0,1]$ ,
$\Phi(\gamma_{\mathrm{max}})$ is the smallest among $\Phi(\gamma_i)$ ,
$\Phi(\gamma_{\mathrm{sec-max})}$ is the second smallest among $\Phi(\gamma_i)$ .

$\Phi(\gamma_{\mathrm{max}})$ $1/n$ $\Phi(\gamma_{\mathrm{max}})$ $2/n$ $\bar\Phi(t)$ , we get that

2 \approx \bar{Φ} (γ_{s e c - m a x}) / \bar{Φ} (γ_{m a x}) \approx e^{\frac{1}{2} (γ_{m a x}^{2} - γ_{s e c - m a x}^{2})} .

$2\approx \bar\Phi(\gamma_{\mathrm{sec-max}})\left/\bar\Phi(\gamma_{\mathrm{max}})\right. \approx e^{\frac{1}{2}\left(\gamma_{\mathrm{max}}^2 - \gamma_{\mathrm{sec-max}}^2\right)}.$

Thus $\gamma_{\mathrm{max}}^2 - \gamma_{\mathrm{sec-max}}^2$ is $\Theta(1)$ w.h.p. Note that $\gamma_{\mathrm{max}}\approx \gamma_{\mathrm{sec-max}} = \Theta(\sqrt{\log n})$ . We have,

γ_{m a x} - γ_{s e c - m a x} \approx \frac{Θ (1)}{γ_{m a x} + γ_{s e c - m a x}} \approx \frac{Θ (1)}{\sqrt{\log n}} .

$\gamma_{\mathrm{max}} - \gamma_{\mathrm{sec-max}}\approx \frac{\Theta(1)}{\gamma_{\mathrm{max}} + \gamma_{\mathrm{sec-max}}} \approx \frac{\Theta(1)}{\sqrt{\log n}}.$

QED

We get that

\begin{aligned} E [X_{m a x} - X_{s e c - m a x}] & = E [Y_{m a x} - Y_{s e c - m a x}] \\ = \sqrt{V a r [Y_{i}]} \times E [γ_{m a x} - γ_{s e c - m a x}] = Θ (\sqrt{\frac{m}{n \log n}}) . \end{aligned}

$\begin{align} \mathbb{E}[{X_{\mathrm{max}} - X_{\mathrm{sec-max}}}] &= \mathbb{E}[{Y_{\mathrm{max}} - Y_{\mathrm{sec-max}}}] \\ &= \sqrt{\mathrm{Var}[Y_i]} \times\mathbb{E}[{\gamma_{\mathrm{max}} - \gamma_{\mathrm{sec-max}}}] = \Theta\left(\sqrt{\frac{m}{n\log n}}\right). \end{align}$

The same argument goes through when we have arbitrary scores. It shows that
$E [X_{m a x} - X_{s e c - m a x}] = c E [X_{m a x} - X_{m i n}] / \log n .$ $\mathbb{E}[X_{\mathrm{max}}- X_{\mathrm{sec-max}}] = c\, \left. \mathbb{E}[X_{\mathrm{max}}- X_{\mathrm{min}}]\right/\log n.$

— Yury
source

Thanks! I'll remember to try the multivariate Gaussian approximation next time.

— Yuval Filmus

Yury, you wrote "Let us add a Gaussian vector

Z

$Z$ with variance

m / n^{2}

$m/n^2$ to all

X_{i}

$X_i$ . We get a Gaussian vector

(Y_{1}, \dots, Y_{n})

$(Y_1, \dots, Y_n)$ . Now each

Y_{i}

$Y_i$ has variance

m / n

$m/n$ and all

Y_{i}

$Y_i$ are not correlated... Note that

Y_{i} - Y_{j} = X_{i} - X_{j}

$Y_i - Y_j = X_i - X_j$ ." Can you expand on this part? Is

Z_{i} = Z_{j}

$Z_i = Z_j$ ? If the

X_{i}

$X_i$ 's are dependent, and the

Z_{i}

$Z_i$ 's are independent (or uniformly the same), how can the

Y_{i}

$Y_i$ 's be independent? (Seems like a neat trick but I don't understand it.) Thanks.

— Neal Young

@NealYoung, yes, if we have variables

X_{1}, \dots, X_{n}

$X_1,\dots,X_n$ with negative pairwise correlation and all covariances

C o v (X_{i}, X_{j})

$\mathrm{Cov}(X_i,X_j)$ are equal, then we can add a single new random variable

Z

$Z$ to all

X_{i}

$X_i$ such that the sums are independent. Also, if the variables have positive correlation and again all covariances

C o v (X_{i}, X_{j})

$\mathrm{Cov}(X_i,X_j)$ are equal then we can subtract a single r.v.

Z

$Z$ from all of them so that all the differences are independent; but now

Z

$Z$ is not independent from

X_{i}

$X_i$ but rather

Z = α (X_{1} + \dots + X_{n})

$Z=\alpha (X_1 + \dots+X_n)$ for some scaling parameter

α

$\alpha$ .

— Yury

Ah I see. at least algebraically, all it rests on is the pairwise independence of Z and each

X_{i}

$X_i$ . very cool.

— Suresh Venkat

This argument now appears (with attribution) in an EC'14 paper: dl.acm.org/citation.cfm?id=2602829.

— Yuval Filmus

For your first question, I think you can show that w.h.p. $X_{\max}-X_{\textrm{sec-max}}$ is

o (\sqrt{\frac{m}{n} \frac{\log^{2} \log n}{\log n}}) .

$o\left(\sqrt{\frac{m}{n}\frac{\log^2\log n}{\log n}}\right).$ Note that this is

o (\sqrt{m / n})

$o(\sqrt{m/n})$ .

Compare your random experiment to the following alternative: Let $X_1$ be the maximum load of any of the first $n/2$ buckets. Let $X_2$ be the maximum load of any of the last $n/2$ buckets.

On consideration, $|X_1-X_2|$ is an upper bound on $X_{\max}-X_{\mathrm{sec-max}}$ . Also, with probability at least one half, $|X_1-X_2| = X_{\max}-X_{\mathrm{sec-max}}$ . So, speaking roughly, $X_{\max}-X_{\mathrm{sec-max}}$ is distributed similarly to $|X_1-X_2|$ .

To study $|X_1-X_2|$ , note that with high probability $m/2\pm O(\sqrt m)$ balls are thrown into the first $n/2$ bins, and likewise for the last $n/2$ bins. So $X_1$ and $X_2$ are each distributed essentially like the maximum load when throwing $m' = m/2\pm o(m)$ balls into $n' = n/2$ bins.

This distribution is well-studied and, luckily for this argument, is tightly concentrated around its mean. For example, if $m' \gg n\log^3 n$ , then with high probability $X_1$ differs from its expectation by at most the quantity displayed at the top of this answer [Thm. 1]. (Note: this upper bound is, I think, loose, given Yuri's answer.) Thus, with high probability $X_1$ and $X_2$ also differ by at most this much, and so $X_{\max}$ and $X_{\mathrm{max-sec}}$ differ by at most this much.

Conversely, for a (somewhat weaker) lower bound, if, for any $t$ , say, $\Pr[|X_1-X_2| \ge t] \ge 3/4$ , then $\Pr[X_{\max}-X_{\textrm{sec-max}} \ge t]$ is at least

Pr [| X_{1} - X_{2} | \geq t \land X_{max} - X_{sec-max} = | X_{1} - X_{2} |]

$\Pr\big[|X_1-X_2| \ge t ~\wedge~ X_{\max}-X_{\textrm{sec-max}} = |X_1-X_2|\big]$ which (by the naive union bound) is at least

1 - (1 / 4) - (1 / 2) = 1 / 4.

$1 - (1/4) - (1/2) = 1/4.$ I think this should give you (for example) the expectation of

X_{max} - X_{sec-max}

$X_{\max}-X_{\textrm{sec-max}}$ within a contant factor.

— Neal Young
source

Looking at Thm. 1, the difference from the expectation is

O (\sqrt{(m / n) \log \log n})

$O(\sqrt{(m/n) \log \log n})$ , and not what you wrote. That's still much better than

O (\sqrt{(m / n) \log n})

$O(\sqrt{(m/n) \log n})$ .

— Yuval Filmus

By Thm. 1 (its 3rd case), for any

ϵ > 0

$\epsilon>0$ , with probability

1 - o (1)

$1-o(1)$ , the maximum in any bin (m balls in n bins) is

\frac{m}{n} + \sqrt{\frac{2 m \log n}{n}} \sqrt{1 - (1 \pm ϵ) \frac{\log \log n}{2 \log n}} .

$\frac{m}{n} +\sqrt{\frac{2m\log n}{n}}\sqrt{1-(1\pm\epsilon) \frac{\log\log n}{2\log n}}.$ By my math (using

\sqrt{1 - δ} = 1 - O (δ)

$\sqrt{1-\delta} = 1 - O(\delta)$ ), the

\pm ϵ

$\pm\epsilon$ term expands to an additive absolute term of

O (ϵ) \sqrt{\frac{m \log n}{n}} \frac{\log \log n}{\log n} = O (ϵ) \sqrt{\frac{m}{n} \frac{\log^{2} \log n}{\log n}} .

$O(\epsilon) \sqrt{\frac{m\log n}{n}}~\frac{\log\log n}{\log n} ~=~O(\epsilon) \sqrt{\frac{m}{n}~\frac{\log^2\log n}{\log n}}.$ What am I doing wrong?

— Neal Young

Ah - I guess you're right. I subtracted inside the square root and that's how I got my figure.

— Yuval Filmus