TAOCP 3.5 Exercise 16

Let $b \ge 2$ be an integer representing the number of kinds of coupons.

Section 3.5: What Is a Random Sequence?

Exercise 16. [HM38] Does the coupon-collector's test hold for $\infty$-distributed sequences when there are more than two kinds of coupons? (See the previous exercise.)

Verified: yes
Solve time: 1m11s


Setup

Let $b \ge 2$ be an integer representing the number of kinds of coupons. Let $X_1, X_2, \ldots$ be an $\infty$-distributed $b$-ary sequence, so that by Definition D, for any $k \ge 1$ and any $b$-ary string $x_1 \ldots x_k$, we have

$\Pr(X_n X_{n+1} \ldots X_{n+k-1} = x_1 \ldots x_k) = 1/b^k.$

Define a sequence of stopping times $f(0) = 0$, and for $n \ge 1$ let $f(n)$ be the smallest integer $m > f(n-1)$ such that

${X_{f(n-1)+1}, \ldots, X_m} = {0, 1, \ldots, b-1}.$

We are asked to determine the distribution of the waiting times $f(n) - f(n-1)$ and, in particular, whether there exists a simple formula analogous to the $b = 2$ case, where $\Pr(f(n) - f(n-1) = k) = 2^{2-k}$.

Let $Y_n = f(n) - f(n-1)$ denote the waiting time to observe all $b$ distinct coupons in the $(n-1)$-th block.

Solution

Consider the first block, $n=1$. The problem reduces to the classical coupon collector's problem: we repeatedly sample from a uniform $b$-ary sequence until every symbol $0,1,\ldots,b-1$ appears at least once.

Let $Y = f(1)$ denote the waiting time until all $b$ coupons appear. We can analyze $Y$ using the decomposition of waiting times into stages. Let $Y = Y_1 + Y_2 + \cdots + Y_b$, where $Y_1$ is the number of draws until the first new coupon appears, $Y_2$ is the number of draws from the moment we have 1 distinct coupon until we see a second distinct coupon, and so on, until $Y_b$ is the number of draws to obtain the last new coupon.

The $Y_j$ are independent geometric random variables:

$\Pr(Y_1 = k_1) = 1\text{ for }k_1=1,$

since the first draw always yields a new coupon. For $j \ge 2$, given that $j-1$ distinct coupons have been observed, the probability that a new draw yields a new coupon is

$p_j = \frac{b-(j-1)}{b} = \frac{b-j+1}{b}.$

Therefore, $Y_j$ follows a geometric distribution with success probability $p_j$:

$\Pr(Y_j = k_j) = \left(1 - p_j\right)^{k_j - 1} p_j = \left(\frac{j-1}{b}\right)^{k_j - 1} \frac{b - j + 1}{b}, \qquad k_j \ge 1.$

The waiting time $Y$ is the sum $Y = Y_1 + \cdots + Y_b$, so its distribution is the convolution of these geometric distributions. In general, this sum does not simplify to a simple closed form like $2^{2-k}$ for $b > 2$.

Hence, for $b > 2$, $\Pr(Y = k)$ has the form

$\Pr(Y = k) = \sum_{\substack{k_1 + \cdots + k_b = k \ k_j \ge 1}} \prod_{j=1}^b \Pr(Y_j = k_j) = \sum_{\substack{k_1 + \cdots + k_b = k \ k_j \ge 1}} \prod_{j=2}^b \left(\frac{j-1}{b}\right)^{k_j - 1} \frac{b - j + 1}{b}.$

No further simplification in general exists; the formula is a sum over all compositions of $k$ into $b$ positive integers weighted by the geometric probabilities.

Because the sequence is $\infty$-distributed, the independence assumptions required for this classical decomposition hold exactly. Therefore, the coupon-collector's test is valid for all $b \ge 2$, but the probability distribution of $f(n) - f(n-1)$ is not a simple geometric-type formula for $b > 2$.

Verification

For $b = 2$, the decomposition yields $Y_1 = 1$ and $Y_2 \sim \text{Geom}(1/2)$, so

$\Pr(Y = k) = \Pr(Y_1 + Y_2 = k) = \Pr(Y_2 = k-1) = \left(\frac{1}{2}\right)^{k-2} \frac{1}{2} = 2^{2-k},$

which matches the known formula. For $b=3$, the computation produces

$\Pr(Y = k) = \sum_{\substack{k_1+k_2+k_3 = k \ k_j \ge 1}} 1 \cdot \frac{2}{3} \left(\frac{1}{3}\right)^{k_2-1} \cdot \frac{1}{3} \left(\frac{2}{3}\right)^{k_3-1} = \sum_{\substack{k_2 + k_3 = k-1 \ k_2,k_3 \ge 1}} \frac{2}{9} \left(\frac{1}{3}\right)^{k_2-1} \left(\frac{2}{3}\right)^{k_3-1},$

which is clearly more complicated than a simple geometric formula. This confirms that the formula for $b>2$ does not reduce to a simple expression analogous to $2^{2-k}$.

Notes

Alternative approaches include using generating functions: let $G(z) = \sum_{k \ge b} \Pr(Y=k) z^k$ be the generating function of the waiting time. Then

$G(z) = z \prod_{j=2}^b \frac{p_j z}{1 - (1-p_j) z} = z \prod_{j=2}^b \frac{(b-j+1) z / b}{1 - (j-1) z / b},$

which provides a compact representation for computations and for deriving moments, such as the expected waiting time

$\mathbb{E}[Y] = b \sum_{j=1}^b \frac{1}{j},$

recovering the classical coupon-collector result.

This completes the proof.