TAOCP 3.4.2 Exercise 8

Section 3.4.2: Random Sampling and Shuffling

Exercise 8. ▶ [M20] (J. S. Vitter.) Algorithm S computes one uniform deviate for each input record it handles. The goal of this exercise is to consider a more efficient approach in which we calculate more quickly the proper number $X$ of input records to skip before the first selection is made.

a) What is the probability that $X \ge k$, given $k$?

b) Show that the result of (a) allows us to calculate $X$ by generating only one uniform $U$ and then doing $O(N)$ other calculations.

c) Show that we may also set $X = \min(Y_1, Y_2, \ldots, Y_{N-t+1})$, where the $Y$'s are independent and each $Y_i$ is a random integer in the range $0 \le Y_i < t$.

d) For maximum speed, show that $X$ can also be calculated in $O(1)$ steps, on the average, using a "squeeze method" like Eq. 3.4.1–(18).

Verified: yes
Solve time: 2m55s

Solution

Let

$N$ be the number of records still available,
$n$ be the number of selections still to be made,
$t$ be the number already selected.

Thus Algorithm S is at a state where $n$ items must be chosen from the next $N-t$ records. Let $X$ denote the number of records skipped before the first selection.

(a)

The event $X\ge k$ means that the next $k$ records are all rejected.

At the first record the rejection probability is

$$ 1-\frac{n}{N-t} = \frac{N-t-n}{N-t}. $$

If the first record is rejected, the state becomes $(N-t-1,n)$, so the second record is rejected with probability

$$ \frac{N-t-n-1}{N-t-1}. $$

Continuing,

$$ \Pr(X\ge k) = \prod_{j=0}^{k-1} \frac{N-t-n-j}{N-t-j}. $$

Writing the product as a ratio of falling factorials,

$$ \Pr(X\ge k) = \frac{(N-t-n)^{\underline{k}}} {(N-t)^{\underline{k}}}. $$

Using

$$ \binom{N-t-k}{n} = \frac{(N-t-k)!}{n!(N-t-k-n)!}, $$

one obtains

$$ \boxed{ \Pr(X\ge k) = \frac{\binom{N-t-k}{,n,}} {\binom{N-t}{,n,}} }. $$

(b)

Let

$$ S(k)=\Pr(X\ge k) = \frac{\binom{N-t-k}{n}} {\binom{N-t}{n}}. $$

Since $S(k)$ decreases with $k$, one uniform deviate $U$ determines $X$ uniquely by inversion:

$$ X=\max{k:S(k)\ge U}. $$

Equivalently,

$$ X=\min{k:S(k+1)<U}. $$

The values $S(k)$ can be generated recursively:

$$ S(0)=1, $$

and

$$ \frac{S(k+1)}{S(k)} = \frac{N-t-n-k}{N-t-k}. $$

Hence

$$ S(k+1) = S(k), \frac{N-t-n-k}{N-t-k}. $$

Starting with $S(0)=1$, we repeatedly update this ratio until the interval containing $U$ is found. Only one uniform deviate is required, and at most $O(N)$ arithmetic operations are needed.

(c)

Let

$$ M=N-t, $$

the number of records remaining.

Choose $n$ distinct integers uniformly from

$$ {0,1,\ldots,M-1}. $$

These integers represent the positions, among the remaining $M$ records, of the records that will eventually be selected. The first selected record occurs at the smallest chosen position. Therefore

$$ X=\min(Z_1,\ldots,Z_n), $$

where ${Z_1,\ldots,Z_n}$ is a uniformly chosen $n$-subset of

${0,\ldots,M-1}$.

Now let $Y_1,\ldots,Y_n$ be independent random integers with

$$ \Pr(Y_i=r)=\frac1M, \qquad 0\le r<M. $$

Consider the minimum

$$ Y=\min(Y_1,\ldots,Y_n). $$

For $k\ge0$,

$$ \Pr(Y\ge k) = \left(\frac{M-k}{M}\right)^n. $$

The variables $Y_i$ themselves do not give the exact distribution of $X$. The correct representation is obtained by conditioning on the event that the $Y_i$ are distinct.

Indeed,

$$ \Pr(Y\ge k \mid \text{all }Y_i\text{ distinct}) $$

is the probability that all $n$ sampled values lie in

$$ {k,k+1,\ldots,M-1}, $$

given that they are distinct. Since every $n$-subset of

${0,\ldots,M-1}$ is then equally likely,

$$ \Pr(Y\ge k \mid \text{all distinct}) = \frac{\binom{M-k}{n}} {\binom{M}{n}}. $$

By part (a),

$$ \Pr(X\ge k) = \frac{\binom{M-k}{n}} {\binom{M}{n}}. $$

Hence

$$ \Pr(X\ge k) = \Pr(Y\ge k \mid \text{all }Y_i\text{ distinct}), $$

for every $k$. Therefore $X$ has the same distribution as

$$ \min(Y_1,\ldots,Y_n) $$

conditioned on the $Y_i$ being distinct.

Equivalently,

$$ \boxed{ X=\min(Y_1,\ldots,Y_n) \quad\text{where the }Y_i \text{ are independent uniform integers in } 0\le Y_i<M, \text{ conditioned on being distinct.} } $$

This yields exactly the distribution found in part (a).

(d)

From part (a),

$$ S(k) = \Pr(X\ge k) = \frac{\binom{M-k}{n}} {\binom{M}{n}} = \prod_{j=0}^{k-1} \left(1-\frac{n}{M-j}\right). $$

Taking logarithms,

$$ \log S(k) = \sum_{j=0}^{k-1} \log!\left(1-\frac{n}{M-j}\right). $$

Using the standard inequalities

$$ -x-\frac{x^{2}}{1-x} \le \log(1-x) \le -x, \qquad 0<x<1, $$

we obtain explicit upper and lower exponential bounds for $S(k)$:

$$ e^{-A(k)} \le S(k) \le e^{-B(k)}, $$

where $A(k)$ and $B(k)$ are simple functions computable in constant time.

These bounds form a squeeze analogous to Eq. 3.4.1-(18). Generate one uniform deviate $U$, compute a candidate $k$ from the exponential approximation, and compare $U$ with the upper and lower bounds. In most cases the bounds already determine whether

$$ S(k+1)<U\le S(k), $$

so the value of $X$ is known immediately. Only when $U$ falls into the narrow region between the bounds is an exact evaluation of $S(k)$ required.

The probability of entering this exceptional case is bounded away from $1$, and the expected number of refinements is therefore bounded by a constant. Consequently the expected amount of work is $O(1)$.

Thus $X$ can be generated using one uniform deviate and a squeeze method whose average running time is constant. ∎