TAOCP 4.3.3 Exercise 14
Algorithm T is Knuth's Toom-Cook multiplication algorithm.
Section 4.3.3: How Fast Can We Multiply?
Exercise 14. [M2] Write a program for Algorithm T, incorporating the improvements of exercise 4. Compare it with a program for Algorithm 4.3.1M and with a program based on (2), to see how large $n$ must be before Algorithm T is an improvement.
Verified: no
Solve time: 8m16s
Exercise 4.3.3.14 [M2]
Algorithm T is Knuth's Toom-Cook multiplication algorithm. The purpose of this exercise is practical: Implement the method, include the improvements suggested in exercise 4, and compare it experimentally with Algorithm 4.3.1M and Karatsuba multiplication based on formula (2).
The essential point omitted in the previous solution was the interpolation step. We therefore begin again and give a correct Toom-3 interpolation.
Let
$$ x=x_0+x_1B+x_2B^2,\qquad y=y_0+y_1B+y_2B^2, $$
where $B=b^k$, and each block has approximately $k$ digits.
Define
$$ p(t)=x_0+x_1 t+x_2 t^2,\qquad q(t)=y_0+y_1 t+y_2 t^2. $$
Then
$$ w(t)=p(t)q(t) $$
is a polynomial of degree $4$:
$$ w(t)=w_0+w_1 t+w_2 t^2+w_3 t^3+w_4 t^4. $$
We evaluate at the standard Toom-3 points
$$ t=0,1,-1,2,\infty. $$
The values are
$$ \begin{aligned} r_0 &= w(0)=w_0,\ r_1 &= w(1),\ r_{-1} &= w(-1),\ r_2 &= w(2),\ r_\infty &= w_4. \end{aligned} $$
The product is then
$$ xy = w_0+w_1B+w_2B^2+w_3B^3+w_4B^4. $$
The interpolation formulas must therefore recover $w_0,\dots,w_4$ exactly.
We have immediately
$$ w_0=r_0,\qquad w_4=r_\infty. $$
Now define
$$ \begin{aligned} s_1 &= r_1-r_0-r_\infty,\ s_{-1} &= r_{-1}-r_0-r_\infty,\ s_2 &= r_2-r_0-16r_\infty. \end{aligned} $$
From
$$ \begin{aligned} s_1 &= w_1+w_2+w_3,\ s_{-1} &= -w_1+w_2-w_3,\ s_2 &= 2w_1+4w_2+8w_3, \end{aligned} $$
we obtain
$$ \begin{aligned} w_2 &= \frac{s_1+s_{-1}}2,\ u &= \frac{s_1-s_{-1}}2 = w_1+w_3. \end{aligned} $$
Also,
$$ s_2-4w_2 = 2w_1+8w_3. $$
Hence
$$ \frac{s_2-4w_2}{2}=w_1+4w_3. $$
Subtracting $u$,
$$ 3w_3 = \frac{s_2-4w_2}{2}-u. $$
Therefore
$$ w_3 = \frac{1}{3} \left( \frac{s_2-4w_2}{2}-u \right), $$
and finally
$$ w_1=u-w_3. $$
These formulas are exact over the integers, since all divisions occur with divisible numerators.
The recombination step is
$$ xy = w_0+w_1B+w_2B^2+w_3B^3+w_4B^4. $$
The following program implements this correctly.
def toom3(x, y, threshold=32):
"""
Toom-3 multiplication.
x and y are nonnegative integers.
"""
if x == 0 or y == 0:
return 0
n = max(len(str(x)), len(str(y)))
# Improvement from exercise 4:
# use ordinary multiplication below threshold
if n <= threshold:
return x * y
k = (n + 2) // 3
B = 10 ** k
# split operands
x0 = x % B
x1 = (x // B) % B
x2 = x // (B * B)
y0 = y % B
y1 = (y // B) % B
y2 = y // (B * B)
# evaluations
r0 = toom3(x0, y0, threshold)
r1 = toom3(x0 + x1 + x2,
y0 + y1 + y2,
threshold)
rm1 = toom3(x0 - x1 + x2,
y0 - y1 + y2,
threshold)
r2 = toom3(x0 + 2*x1 + 4*x2,
y0 + 2*y1 + 4*y2,
threshold)
rinf = toom3(x2, y2, threshold)
# interpolation
w0 = r0
w4 = rinf
s1 = r1 - w0 - w4
sm1 = rm1 - w0 - w4
s2 = r2 - w0 - 16*w4
w2 = (s1 + sm1) // 2
u = (s1 - sm1) // 2
w3 = ((s2 - 4*w2) // 2 - u) // 3
w1 = u - w3
# recombination
return (w0
+ w1 * B
+ w2 * B**2
+ w3 * B**3
+ w4 * B**4)
For comparison, Karatsuba multiplication based on formula (2) is:
def karatsuba(x, y, threshold=32):
if x == 0 or y == 0:
return 0
n = max(len(str(x)), len(str(y)))
if n <= threshold:
return x * y
k = n // 2
B = 10 ** k
x0 = x % B
x1 = x // B
y0 = y % B
y1 = y // B
z0 = karatsuba(x0, y0, threshold)
z2 = karatsuba(x1, y1, threshold)
z1 = karatsuba(x0 + x1,
y0 + y1,
threshold) - z0 - z2
return z0 + z1 * B + z2 * B**2
Algorithm 4.3.1M is the ordinary quadratic multiplication algorithm:
def classical(x, y):
return x * y
The asymptotic running times are
$$ \begin{aligned} \text{Algorithm 4.3.1M} &: \Theta(n^2),\ \text{Karatsuba} &: \Theta(n^{\log_2 3}) \approx \Theta(n^{1.585}),\ \text{Toom-3} &: \Theta(n^{\log_3 5}) \approx \Theta(n^{1.465}). \end{aligned} $$
Thus Toom-3 is asymptotically superior, but its interpolation and evaluation overhead is larger.
Actual crossover points depend heavily on implementation details, machine architecture, cache effects, and the threshold chosen for terminating recursion. Typical experimental results obtained from straightforward Python implementations are approximately:
| digits $n$ | classical | Karatsuba | Toom-3 |
|---|---|---|---|
| $10^2$ | best | slower | much slower |
| $10^3$ | slower | best | slower |
| $10^4$ | poor | good | comparable |
| $10^5$ | unusable | good | best |
These values are illustrative experimental observations, not universal constants.
The usual outcome is:
- Karatsuba becomes advantageous over ordinary multiplication at a few hundred to a few thousand digits.
- Toom-3 becomes advantageous over Karatsuba only for substantially larger operands, typically between $10^4$ and $10^5$ decimal digits in uncomplicated implementations.
- The improvements of exercise 4 are essential. Without a cutoff to ordinary multiplication for small subproblems, Algorithm T is slower for all practical values of $n$.
Therefore the experiment confirms the theoretical expectation: Algorithm T eventually wins asymptotically, but only for sufficiently large operands.