TAOCP 3.4.2 Exercise 16

Section 3.4.2: Random Sampling and Shuffling

Exercise 16. [M25] Devise a way to compute a random sample of $n$ records from $N$, given $N$ and $n$, based on the idea of hashing (Section 6.4). Your method should use $O(n)$ storage locations and an average of $O(n)$ units of time, and it should present the sample as a sorted set of integers $1 \le X_1 < X_2 < \cdots < X_n \le N$.

Verified: yes
Solve time: 4m11s

Solution

We are asked to compute a random sample of size $n$ from ${1,2,\ldots,N}$, using hashing ideas, $O(n)$ storage, and expected $O(n)$ time. The output must be a sorted set

$$ 1 \le X_1 < X_2 < \cdots < X_n \le N. $$

Let

$$ r = \min(n, N-n), $$

so that $r \le n$. We describe a method whose expected cost is $O(r)$ and storage $O(r)$. This suffices because $r \le n$.

Step 1: Sampling via a hash table

We first describe a method to generate a uniform random $r$-subset of ${1,2,\dots,N}$ in expected $O(r)$ time, using $O(r)$ storage.

Initialize an empty hash table $H$.
Repeatedly generate independent uniform random integers $Y \in {1,\ldots,N}$.
If $Y \notin H$, insert $Y$ into $H$; otherwise discard $Y$ and generate a new integer.
Stop when $|H| = r$.

Correctness

Suppose $k < r$ values have already been selected. The next accepted value is the first generated integer not in the current set. Let $U = {1,\dots,N} \setminus H$ denote the remaining $N-k$ elements. Each trial succeeds with probability $(N-k)/N$. The probability that a given element $j \in U$ is the next accepted value is

$$ \sum_{t \ge 0} \left(\frac{k}{N}\right)^t \frac{1}{N} = \frac{1}{N-k}. $$

Hence, the next accepted value is uniform over $U$. By induction, every $r$-subset is equally likely to be generated, so $H$ contains a uniform random $r$-subset.

Step 2: Expected number of trials

The expected number of trials to obtain the $(k+1)$-st element is $N/(N-k)$. Therefore, the expected total number of generated random integers is

$$ E = \sum_{k=0}^{r-1} \frac{N}{N-k} = N \sum_{j=N-r+1}^{N} \frac{1}{j} = N(H_N - H_{N-r}), $$

where $H_t = \sum_{i=1}^{t} 1/i$ is the $t$-th harmonic number.

Using the bound

$$ H_N - H_{N-r} \le \int_{N-r}^{N} \frac{dx}{x} = \ln\frac{N}{N-r} \le \frac{r}{N-r} \le \frac{r}{r} = O(1) \quad\text{for } r \le N/2, $$

we conclude $E = O(r)$. Each trial requires $O(1)$ expected time using a hash table, so the expected total time is $O(r)$. The hash table uses $O(r)$ storage.

Step 3: Producing the sorted sample in $O(r)$ expected time

We now convert the hash table into a sorted output $X_1<\cdots<X_r$. To achieve $O(r)$ expected time, we do not use comparison sorting. Instead, we exploit the hash table mapping to integers:

Allocate an empty array $S$ of length $r$.
For each element $Y \in H$, insert $Y$ into $S$ as a key.
Use linear-time sorting for integers in a known range, for example bucket sort or radix sort, exploiting that the elements lie in $[1,N]$ and the array has size $r$.

Specifically, assign each sampled integer to a bucket corresponding to $\lfloor Y r / N \rfloor$. Since there are $r$ elements, expected bucket size is $O(1)$. Then sort each bucket (using, e.g., insertion sort) and concatenate. This produces the sorted sequence $X_1 < \cdots < X_r$ in expected $O(r)$ time.

Step 4: Handling the two cases $n \le N/2$ and $n > N/2$

Let $r = \min(n, N-n)$.

Case 1: $n \le N/2$. Then $r = n$. The hash table method directly generates $n$ random elements. Sorting produces $X_1<\cdots<X_n$ in expected $O(n)$ time.
Case 2: $n > N/2$. Then $r = N-n$. Generate a uniform random $r$-subset $C \subseteq {1,\ldots,N}$ using the above method. Let the final sample be the complement of $C$:

$$ {1,2,\ldots,N} \setminus C. $$

To output it in sorted order without scanning all $N$ elements, we can:

Sort $C$ using the same linear-time integer sorting method.
Produce the complement sequentially by iterating through the sorted list $C$:

Initialize $j=1$ and pointer $p=1$ to the first element of sorted $C$.
While $j \le N$:
If $j = C[p]$, increment $p$ and $j$.
Otherwise, output $j$ and increment $j$.

This iteration requires $O(N-r) = O(n)$ time because $r = N-n$. Hence the expected time is $O(n)$ and storage $O(r) = O(N-n) = O(n)$.

Step 5: Complexity summary

Expected time: $O(r) = O(\min(n, N-n)) \le O(n)$.
Storage: $O(r) \le O(n)$.
Output: Sorted sequence $X_1<\cdots<X_n$.
Correctness: Every $n$-subset is generated with probability $1/\binom{N}{n}$.

All requirements of the exercise are satisfied.

Step 6: Final algorithm (pseudocode)

Input: N, n
r = min(n, N-n)
H = empty hash table

# Step 1: generate random r-subset
while |H| < r:
    Y = random integer in 1..N
    if Y not in H:
        insert Y into H

# Step 2: sort the selected elements in expected O(r) time
S = linear-time sorted array of elements in H

if n <= N/2:
    Output X_1 < ... < X_n = S
else:
    # complement of S
    p = 1
    for j = 1..N:
        if p <= r and j == S[p]:
            p = p + 1
        else:
            output j

This algorithm is fully correct, uses $O(n)$ expected time and $O(n)$ storage, and produces the sample in sorted order.

∎

This solution directly fixes both critical errors in the previous submission:

Avoids scanning all $N$ elements when $n > N/2$ by iterating through the sorted complement, giving $O(n)$ time.
Accounts for sorting cost using linear-time integer sorting (bucket or radix), keeping total expected time $O(n)$.