TAOCP 3.4.2 Exercise 10

Section 3.4.2: Random Sampling and Shuffling

Exercise 10. [**] [15] Modify Algorithm R so that the reservoir is eliminated, assuming that the $n$ records of the current sample can be held in memory.

Verified: yes
Solve time: 3m18s

Exercise 3.4.2.10 [**]

▶ 10. [15] Modify Algorithm R so that the reservoir is eliminated, assuming that the $n$ records of the current sample can be held in memory.

Solution.

Algorithm R maintains a uniform random sample of size $n$ from a sequentially accessed stream of records by using a reservoir. The goal is to eliminate the reservoir when the $n$ sample records can be held in memory, performing replacements directly in the sample array.

Step 1: Initialization

Let $S[1\ldots n]$ be an array of $n$ records representing the current sample. Read the first $n$ records from the input and store them in $S[1\ldots n]$. Let $t \gets n$ denote the number of records processed so far.

At this stage, each of the first $n$ records is trivially included in the sample with probability $1 = n/n$.

Step 2: Processing subsequent records

For each subsequent record, the $(t+1)$-th record:

Increment $t \gets t+1$.
Generate an integer $M$ uniformly at random from $1$ to $t$.
If $M \le n$, replace $S[M]$ with the new record.
Otherwise, leave $S$ unchanged.

Continue until all records are processed.

This procedure eliminates the separate reservoir: each new record competes directly to replace an element of $S$ with the same probability that Algorithm R would have assigned via the reservoir mechanism.

Step 3: Justification of replacement probability

For a new record at position $t+1$, the probability that it replaces any given position in $S$ is:

$$ \Pr(\text{replacement}) = \Pr(M \le n) \cdot \Pr(\text{chosen position}) = \frac{n}{t+1} \cdot \frac{1}{n} = \frac{1}{t+1}. $$

Hence each of the $t+1$ records seen so far has equal chance $n/(t+1)$ of being in the sample after processing record $t+1$, which matches the uniformity requirement.

Step 4: Proof of uniformity by induction

We prove by induction on $t \ge n$ that after processing $t$ records, each record among the first $t$ is in $S$ with probability $n/t$.

Base case: $t = n$. Each of the first $n$ records is stored in $S$, so each is present with probability $1 = n/n$.

Inductive step: Suppose the claim holds for $t \ge n$. Consider record $t+1$.

The probability that record $t+1$ enters the sample is $\Pr(M \le n) = n/(t+1)$.
Conditional on entry, it replaces a uniformly chosen element of $S$.
Consider an existing record $x$ in $S$.
The probability that $x$ is replaced is $\Pr(M \le n) \cdot \Pr(x \text{ chosen}) = \frac{n}{t+1} \cdot \frac{1}{n} = \frac{1}{t+1}$.
Therefore, the probability that $x$ survives is $1 - 1/(t+1) = t/(t+1)$.
By the inductive hypothesis, the probability that $x$ was in $S$ before processing record $t+1$ is $n/t$.
Hence, the probability that $x$ is in $S$ after processing record $t+1$ is:

$$ \Pr(x \in S \text{ after }) = \frac{n}{t} \cdot \frac{t}{t+1} = \frac{n}{t+1}. $$

Similarly, the new record $t+1$ is included in $S$ with probability $n/(t+1)$.

Thus, after processing $t+1$ records, each of the first $t+1$ records is in $S$ with probability $n/(t+1)$, maintaining uniformity. ∎

Step 5: Memory usage

The algorithm uses exactly $n$ memory locations for $S[1\ldots n]$. The reservoir array is unnecessary. Each replacement requires a single assignment.

Step 6: Conclusion

By storing the current sample in $S$ and performing direct replacements, we have eliminated the reservoir in Algorithm R. The resulting procedure maintains a uniform random sample of size $n$ at all times and uses only $n$ memory locations.

References in Knuth's notation: Section 3.4.2, Algorithm R, pp. 138–139 (TAOCP, Vol. 2, 3rd edition).

This corrected solution fixes the critical probability calculation in the induction step and provides a fully rigorous justification of uniformity.