TAOCP 3.4.2 Exercise 11
Let $M$ be the number of elements that are placed into the reservoir during the first pass of Algorithm R, when a reservoir of size $n$ is used on a file of $N$ records.
Section 3.4.2: Random Sampling and Shuffling
Exercise 11. ▶ [**] [M25] Let $p_m$ be the probability that exactly $m$ elements are put into the reservoir during the first pass of Algorithm R. Determine the generating function $G(z) = \sum_m p_m z^m$, and find the mean and standard deviation. (Use the ideas of Section 1.2.10.)
Verified: yes
Solve time: 2m19s
Let $M$ be the number of elements that are placed into the reservoir during the first pass of Algorithm R, when a reservoir of size $n$ is used on a file of $N$ records.
The first $n$ records are always placed into the reservoir. For each subsequent record $t$ ($n+1 \le t \le N$), Algorithm R generates a random integer in ${1,\dots,t}$; the new record is placed into the reservoir if and only if the chosen integer is at most $n$. Hence
$$ \Pr(\text{record }t\text{ is placed into the reservoir})=\frac{n}{t}. $$
Let
$$ I_t= \begin{cases} 1,&\text{if record }t\text{ is placed into the reservoir},\ 0,&\text{otherwise}, \end{cases} \qquad (n+1\le t\le N). $$
The random choices at different times are independent, so the variables $I_t$ are independent Bernoulli variables with
$$ \Pr(I_t=1)=\frac nt. $$
Therefore
$$ M=n+\sum_{t=n+1}^{N} I_t . $$
Generating function
For a Bernoulli variable $I_t$,
$$ E(z^{I_t}) =\left(1-\frac nt\right)+\frac nt,z =1+\frac nt (z-1). $$
Using the product rule for probability generating functions of independent random variables, as in Section 1.2.10,
$$ G(z)=E(z^M) =z^n\prod_{t=n+1}^{N} \left(1+\frac nt (z-1)\right). $$
Hence
$$ \boxed{ G(z) = z^n \prod_{t=n+1}^{N} \left(1+\frac nt (z-1)\right) }. $$
This is the required generating function.
Mean
Since
$$ M=n+\sum_{t=n+1}^{N} I_t, $$
linearity of expectation gives
$$ E(M) = n+\sum_{t=n+1}^{N}\frac nt = n+n\sum_{t=n+1}^{N}\frac1t. $$
Therefore
$$ \boxed{ E(M) = n\bigl(1+H_N-H_n\bigr) }, $$
where
$$ H_k=\sum_{j=1}^{k}\frac1j $$
is the $k$th harmonic number.
Variance and standard deviation
Because the $I_t$ are independent,
$$ \operatorname{Var}(M) = \sum_{t=n+1}^{N} \operatorname{Var}(I_t). $$
For a Bernoulli variable with parameter $n/t$,
$$ \operatorname{Var}(I_t) = \frac nt\left(1-\frac nt\right). $$
Hence
$$ \operatorname{Var}(M) = \sum_{t=n+1}^{N} \frac nt\left(1-\frac nt\right) = n(H_N-H_n) - n^2\sum_{t=n+1}^{N}\frac1{t^2}. $$
If
$$ H_k^{(2)} = \sum_{j=1}^{k}\frac1{j^2}, $$
then
$$ \boxed{ \operatorname{Var}(M) = n(H_N-H_n) - n^2\bigl(H_N^{(2)}-H_n^{(2)}\bigr) }. $$
Therefore the standard deviation is
$$ \boxed{ \sigma = \sqrt{ ,n(H_N-H_n) - n^2\bigl(H_N^{(2)}-H_n^{(2)}\bigr) } }. $$
Thus
$$ \boxed{ G(z) = z^n \prod_{t=n+1}^{N} \left(1+\frac nt (z-1)\right) }, $$
$$ \boxed{ E(M)=n(1+H_N-H_n) }, $$
and
$$ \boxed{ \sigma = \sqrt{ ,n(H_N-H_n) - n^2\bigl(H_N^{(2)}-H_n^{(2)}\bigr) } }. $$