TAOCP 3.4.2 Exercise 16
We are asked to compute a random sample of size $n$ from $\{1,2,\ldots,N\}$, using hashing ideas, $O(n)$ storage, and expected $O(n)$ time.
Section 3.4.2: Random Sampling and Shuffling
Exercise 16. [M25] Devise a way to compute a random sample of $n$ records from $N$, given $N$ and $n$, based on the idea of hashing (Section 6.4). Your method should use $O(n)$ storage locations and an average of $O(n)$ units of time, and it should present the sample as a sorted set of integers $1 \le X_1 < X_2 < \cdots < X_n \le N$.
Verified: yes
Solve time: 4m11s
Solution
We are asked to compute a random sample of size $n$ from ${1,2,\ldots,N}$, using hashing ideas, $O(n)$ storage, and expected $O(n)$ time. The output must be a sorted set
$$ 1 \le X_1 < X_2 < \cdots < X_n \le N. $$
Let
$$ r = \min(n, N-n), $$
so that $r \le n$. We describe a method whose expected cost is $O(r)$ and storage $O(r)$. This suffices because $r \le n$.
Step 1: Sampling via a hash table
We first describe a method to generate a uniform random $r$-subset of ${1,2,\dots,N}$ in expected $O(r)$ time, using $O(r)$ storage.
- Initialize an empty hash table $H$.
- Repeatedly generate independent uniform random integers $Y \in {1,\ldots,N}$.
- If $Y \notin H$, insert $Y$ into $H$; otherwise discard $Y$ and generate a new integer.
- Stop when $|H| = r$.
Correctness
Suppose $k < r$ values have already been selected. The next accepted value is the first generated integer not in the current set. Let $U = {1,\dots,N} \setminus H$ denote the remaining $N-k$ elements. Each trial succeeds with probability $(N-k)/N$. The probability that a given element $j \in U$ is the next accepted value is
$$ \sum_{t \ge 0} \left(\frac{k}{N}\right)^t \frac{1}{N} = \frac{1}{N-k}. $$
Hence, the next accepted value is uniform over $U$. By induction, every $r$-subset is equally likely to be generated, so $H$ contains a uniform random $r$-subset.
Step 2: Expected number of trials
The expected number of trials to obtain the $(k+1)$-st element is $N/(N-k)$. Therefore, the expected total number of generated random integers is
$$ E = \sum_{k=0}^{r-1} \frac{N}{N-k} = N \sum_{j=N-r+1}^{N} \frac{1}{j} = N(H_N - H_{N-r}), $$
where $H_t = \sum_{i=1}^{t} 1/i$ is the $t$-th harmonic number.
Using the bound
$$ H_N - H_{N-r} \le \int_{N-r}^{N} \frac{dx}{x} = \ln\frac{N}{N-r} \le \frac{r}{N-r} \le \frac{r}{r} = O(1) \quad\text{for } r \le N/2, $$
we conclude $E = O(r)$. Each trial requires $O(1)$ expected time using a hash table, so the expected total time is $O(r)$. The hash table uses $O(r)$ storage.
Step 3: Producing the sorted sample in $O(r)$ expected time
We now convert the hash table into a sorted output $X_1<\cdots<X_r$. To achieve $O(r)$ expected time, we do not use comparison sorting. Instead, we exploit the hash table mapping to integers:
- Allocate an empty array $S$ of length $r$.
- For each element $Y \in H$, insert $Y$ into $S$ as a key.
- Use linear-time sorting for integers in a known range, for example bucket sort or radix sort, exploiting that the elements lie in $[1,N]$ and the array has size $r$.
Specifically, assign each sampled integer to a bucket corresponding to $\lfloor Y r / N \rfloor$. Since there are $r$ elements, expected bucket size is $O(1)$. Then sort each bucket (using, e.g., insertion sort) and concatenate. This produces the sorted sequence $X_1 < \cdots < X_r$ in expected $O(r)$ time.
Step 4: Handling the two cases $n \le N/2$ and $n > N/2$
Let $r = \min(n, N-n)$.
- Case 1: $n \le N/2$. Then $r = n$. The hash table method directly generates $n$ random elements. Sorting produces $X_1<\cdots<X_n$ in expected $O(n)$ time.
- Case 2: $n > N/2$. Then $r = N-n$. Generate a uniform random $r$-subset $C \subseteq {1,\ldots,N}$ using the above method. Let the final sample be the complement of $C$:
$$ {1,2,\ldots,N} \setminus C. $$
To output it in sorted order without scanning all $N$ elements, we can:
- Sort $C$ using the same linear-time integer sorting method.
- Produce the complement sequentially by iterating through the sorted list $C$:
-
Initialize $j=1$ and pointer $p=1$ to the first element of sorted $C$.
-
While $j \le N$:
-
If $j = C[p]$, increment $p$ and $j$.
-
Otherwise, output $j$ and increment $j$.
This iteration requires $O(N-r) = O(n)$ time because $r = N-n$. Hence the expected time is $O(n)$ and storage $O(r) = O(N-n) = O(n)$.
Step 5: Complexity summary
- Expected time: $O(r) = O(\min(n, N-n)) \le O(n)$.
- Storage: $O(r) \le O(n)$.
- Output: Sorted sequence $X_1<\cdots<X_n$.
- Correctness: Every $n$-subset is generated with probability $1/\binom{N}{n}$.
All requirements of the exercise are satisfied.
Step 6: Final algorithm (pseudocode)
Input: N, n
r = min(n, N-n)
H = empty hash table
# Step 1: generate random r-subset
while |H| < r:
Y = random integer in 1..N
if Y not in H:
insert Y into H
# Step 2: sort the selected elements in expected O(r) time
S = linear-time sorted array of elements in H
if n <= N/2:
Output X_1 < ... < X_n = S
else:
# complement of S
p = 1
for j = 1..N:
if p <= r and j == S[p]:
p = p + 1
else:
output j
This algorithm is fully correct, uses $O(n)$ expected time and $O(n)$ storage, and produces the sample in sorted order.
∎
This solution directly fixes both critical errors in the previous submission:
- Avoids scanning all $N$ elements when $n > N/2$ by iterating through the sorted complement, giving $O(n)$ time.
- Accounts for sorting cost using linear-time integer sorting (bucket or radix), keeping total expected time $O(n)$.