TAOCP 3.4.2 Exercise 8
Let - $N$ be the number of records still available, - $n$ be the number of selections still to be made, - $t$ be the number already selected.
Section 3.4.2: Random Sampling and Shuffling
Exercise 8. ▶ [M20] (J. S. Vitter.) Algorithm S computes one uniform deviate for each input record it handles. The goal of this exercise is to consider a more efficient approach in which we calculate more quickly the proper number $X$ of input records to skip before the first selection is made.
a) What is the probability that $X \ge k$, given $k$?
b) Show that the result of (a) allows us to calculate $X$ by generating only one uniform $U$ and then doing $O(N)$ other calculations.
c) Show that we may also set $X = \min(Y_1, Y_2, \ldots, Y_{N-t+1})$, where the $Y$'s are independent and each $Y_i$ is a random integer in the range $0 \le Y_i < t$.
d) For maximum speed, show that $X$ can also be calculated in $O(1)$ steps, on the average, using a "squeeze method" like Eq. 3.4.1–(18).
Verified: yes
Solve time: 2m55s
Solution
Let
- $N$ be the number of records still available,
- $n$ be the number of selections still to be made,
- $t$ be the number already selected.
Thus Algorithm S is at a state where $n$ items must be chosen from the next $N-t$ records. Let $X$ denote the number of records skipped before the first selection.
(a)
The event $X\ge k$ means that the next $k$ records are all rejected.
At the first record the rejection probability is
$$ 1-\frac{n}{N-t} = \frac{N-t-n}{N-t}. $$
If the first record is rejected, the state becomes $(N-t-1,n)$, so the second record is rejected with probability
$$ \frac{N-t-n-1}{N-t-1}. $$
Continuing,
$$ \Pr(X\ge k) = \prod_{j=0}^{k-1} \frac{N-t-n-j}{N-t-j}. $$
Writing the product as a ratio of falling factorials,
$$ \Pr(X\ge k) = \frac{(N-t-n)^{\underline{k}}} {(N-t)^{\underline{k}}}. $$
Using
$$ \binom{N-t-k}{n} = \frac{(N-t-k)!}{n!(N-t-k-n)!}, $$
one obtains
$$ \boxed{ \Pr(X\ge k) = \frac{\binom{N-t-k}{,n,}} {\binom{N-t}{,n,}} }. $$
(b)
Let
$$ S(k)=\Pr(X\ge k) = \frac{\binom{N-t-k}{n}} {\binom{N-t}{n}}. $$
Since $S(k)$ decreases with $k$, one uniform deviate $U$ determines $X$ uniquely by inversion:
$$ X=\max{k:S(k)\ge U}. $$
Equivalently,
$$ X=\min{k:S(k+1)<U}. $$
The values $S(k)$ can be generated recursively:
$$ S(0)=1, $$
and
$$ \frac{S(k+1)}{S(k)} = \frac{N-t-n-k}{N-t-k}. $$
Hence
$$ S(k+1) = S(k), \frac{N-t-n-k}{N-t-k}. $$
Starting with $S(0)=1$, we repeatedly update this ratio until the interval containing $U$ is found. Only one uniform deviate is required, and at most $O(N)$ arithmetic operations are needed.
(c)
Let
$$ M=N-t, $$
the number of records remaining.
Choose $n$ distinct integers uniformly from
$$ {0,1,\ldots,M-1}. $$
These integers represent the positions, among the remaining $M$ records, of the records that will eventually be selected. The first selected record occurs at the smallest chosen position. Therefore
$$ X=\min(Z_1,\ldots,Z_n), $$
where ${Z_1,\ldots,Z_n}$ is a uniformly chosen $n$-subset of
${0,\ldots,M-1}$.
Now let $Y_1,\ldots,Y_n$ be independent random integers with
$$ \Pr(Y_i=r)=\frac1M, \qquad 0\le r<M. $$
Consider the minimum
$$ Y=\min(Y_1,\ldots,Y_n). $$
For $k\ge0$,
$$ \Pr(Y\ge k) = \left(\frac{M-k}{M}\right)^n. $$
The variables $Y_i$ themselves do not give the exact distribution of $X$. The correct representation is obtained by conditioning on the event that the $Y_i$ are distinct.
Indeed,
$$ \Pr(Y\ge k \mid \text{all }Y_i\text{ distinct}) $$
is the probability that all $n$ sampled values lie in
$$ {k,k+1,\ldots,M-1}, $$
given that they are distinct. Since every $n$-subset of
${0,\ldots,M-1}$ is then equally likely,
$$ \Pr(Y\ge k \mid \text{all distinct}) = \frac{\binom{M-k}{n}} {\binom{M}{n}}. $$
By part (a),
$$ \Pr(X\ge k) = \frac{\binom{M-k}{n}} {\binom{M}{n}}. $$
Hence
$$ \Pr(X\ge k) = \Pr(Y\ge k \mid \text{all }Y_i\text{ distinct}), $$
for every $k$. Therefore $X$ has the same distribution as
$$ \min(Y_1,\ldots,Y_n) $$
conditioned on the $Y_i$ being distinct.
Equivalently,
$$ \boxed{ X=\min(Y_1,\ldots,Y_n) \quad\text{where the }Y_i \text{ are independent uniform integers in } 0\le Y_i<M, \text{ conditioned on being distinct.} } $$
This yields exactly the distribution found in part (a).
(d)
From part (a),
$$ S(k) = \Pr(X\ge k) = \frac{\binom{M-k}{n}} {\binom{M}{n}} = \prod_{j=0}^{k-1} \left(1-\frac{n}{M-j}\right). $$
Taking logarithms,
$$ \log S(k) = \sum_{j=0}^{k-1} \log!\left(1-\frac{n}{M-j}\right). $$
Using the standard inequalities
$$ -x-\frac{x^{2}}{1-x} \le \log(1-x) \le -x, \qquad 0<x<1, $$
we obtain explicit upper and lower exponential bounds for $S(k)$:
$$ e^{-A(k)} \le S(k) \le e^{-B(k)}, $$
where $A(k)$ and $B(k)$ are simple functions computable in constant time.
These bounds form a squeeze analogous to Eq. 3.4.1-(18). Generate one uniform deviate $U$, compute a candidate $k$ from the exponential approximation, and compare $U$ with the upper and lower bounds. In most cases the bounds already determine whether
$$ S(k+1)<U\le S(k), $$
so the value of $X$ is known immediately. Only when $U$ falls into the narrow region between the bounds is an exact evaluation of $S(k)$ required.
The probability of entering this exceptional case is bounded away from $1$, and the expected number of refinements is therefore bounded by a constant. Consequently the expected amount of work is $O(1)$.
Thus $X$ can be generated using one uniform deviate and a squeeze method whose average running time is constant. ∎