TAOCP 4.4 Exercise 17

Let \mathcal D(p)=\{\pm (0.

Section 4.4: Radix Conversion

Exercise 17. [M40] Design "ideal" floating point conversion subroutines, taking $p$-digit decimal numbers into $P$-digit binary numbers and vice versa, in both cases producing a true rounded result in the sense of Section 4.2.2.

Verified: no
Solve time: 8m01s


Solution

Let

$$ \mathcal D(p)={\pm (0.d_1d_2\cdots d_p)_{10},10^e : d_1\neq0} $$

be the set of normalized $p$-digit decimal floating-point numbers, and let

$$ \mathcal B(P)={\pm (0.b_1b_2\cdots b_P)_2,2^E : b_1=1} $$

be the set of normalized $P$-bit binary floating-point numbers.

The goal is to convert in both directions and return the element of the target system obtained by rounding the exact value according to the rules of Section 4.2.2. This requires knowing exactly on which side of the midpoint between adjacent representable numbers the true value lies. All intermediate computations are performed with integers; no intermediate floating-point approximation is used.

1. Decimal $\to$ binary conversion

Exact rational representation

Let

$$ x = \pm (0.d_1 d_2 \cdots d_p)_{10},10^e. $$

Write

$$ x = \pm N,10^k, $$

where $N$ is an integer with $p$ decimal digits and $k=e-(p-1)$. Factor the denominator so that

$$ x = \pm \frac{A}{B}, \qquad B = 2^\alpha 5^\beta, $$

with $A,B$ positive integers.

Determination of the binary exponent

Let $t$ be the exponent such that the normalized binary significand $m$ satisfies

$$ x = m,2^t, \qquad 1 \le m < 2. $$

To find $t$ exactly, determine the unique integer $t$ such that

$$ 2^t \le \frac{A}{B} < 2^{t+1}. $$

This can be accomplished using integer comparison and integer powers of 2. Then the normalized significand is

$$ m = \frac{A}{B,2^t}. $$

Extraction of the leading $P$ binary digits

To extract $P$ binary digits, we compute

$$ m \cdot 2^{P-1+s} = \frac{A \cdot 2^{P-1+s}}{B \cdot 2^t}, $$

where $s\ge1$ is the number of guard bits to ensure correct rounding. Perform integer division:

$$ A,2^{P-1+s} = Q' B 2^t + R', \qquad 0 \le R' < B,2^t. $$

Define

$$ Q' = q,2^s + r, \qquad 0 \le r < 2^s. $$

Then

$$ m \cdot 2^{P-1+s} = Q' + \frac{R'}{B,2^t} = q,2^s + r + \frac{R'}{B,2^t}. $$

Dividing by $2^s$, we obtain

$$ m \cdot 2^{P-1} = q + \delta, \qquad \delta = \frac{r + R'/(B,2^t)}{2^s}, \quad 0 \le \delta < 1. $$

Hence $q$ is the truncated $P$-bit significand, and $\delta$ is the exact fractional remainder.

Exact midpoint test

The two adjacent $P$-bit significands are $q$ and $q+1$. The midpoint is at $q + \frac12$. Round according to the comparison:

$$ \delta \lessgtr \frac12 \quad \Longleftrightarrow \quad r + \frac{R'}{B,2^t} \lessgtr 2^{s-1}. $$

Multiplying both sides by $2^s$, the exact rounding decision is

$$ r,B,2^t + R' \lessgtr 2^{s-1},B,2^t. $$

  • If the left-hand side is smaller, round downward to $q$.
  • If larger, round upward to $q+1$.
  • If equal, apply the tie-breaking rule of Section 4.2.2.

No approximation is involved; this correctly determines the nearest $P$-bit binary floating-point number.

Renormalization

If rounding causes

$$ q = 2^P - 1 \quad \longrightarrow \quad q+1 = 2^P, $$

then the significand becomes $1.0\cdots0$ and the exponent $t$ is incremented by 1. The resulting number is exactly the correctly rounded binary value of $x$.

2. Binary $\to$ decimal conversion

Exact rational representation

Let

$$ y = \pm (0.b_1 b_2 \cdots b_P)_2,2^E. $$

Write

$$ y = \pm M,2^h, $$

where $M$ is an integer containing $P$ significant bits and $h$ is the exponent offset. Choose $\beta = \max(0, -h)$, so that

$$ y = \pm \frac{M,2^{h+\beta},5^\beta}{10^\beta} = \pm \frac{C}{10^\beta}, \qquad C = M,2^{h+\beta},5^\beta. $$

Thus the exact value is represented as a rational number $C / 10^\beta$.

Determination of the decimal exponent

Let $u$ satisfy

$$ 10^u \le \frac{C}{10^\beta} < 10^{u+1}. $$

Equivalently,

$$ 10^{u+\beta} \le C < 10^{u+\beta+1}. $$

Then the normalized decimal significand is

$$ m = \frac{C}{10^{u+\beta}}, \qquad 1 \le m < 10. $$

Extraction of the leading $p$ decimal digits

Choose $s \ge 1$ guard digits. Compute

$$ C,10^{p-1+s} = Q,10^{u+\beta} + R, \qquad 0 \le R < 10^{u+\beta}, $$

and write

$$ Q = q,10^s + r, \qquad 0 \le r < 10^s. $$

Then

$$ m,10^{p-1} = q + \delta, \qquad \delta = \frac{r + R / 10^{u+\beta}}{10^s}. $$

Here $q$ is the truncated $p$-digit significand, and $\delta$ is the exact discarded fraction.

Exact midpoint test

The two adjacent $p$-digit decimal significands are $q$ and $q+1$. Compare

$$ \delta \lessgtr \frac12 \quad \Longleftrightarrow \quad r + \frac{R}{10^{u+\beta}} \lessgtr \frac12 \cdot 10^s. $$

Multiplying through by $10^{u+\beta}$ gives the exact integer comparison

$$ r,10^{u+\beta} + R \lessgtr \frac12 \cdot 10^{s+u+\beta}. $$

  • Round down if smaller,
  • Round up if larger,
  • Apply the tie rule if equal.

Renormalization

If rounding changes $9.999\cdots9 \to 10.000\cdots0$, replace the significand by $1.000\cdots0$ and increase the decimal exponent by 1. The resulting decimal number is exactly the correctly rounded $p$-digit floating-point value of $y$.

3. Correctness

  1. In the decimal-to-binary conversion, the normalized significand $m = A / (B 2^t)$ is used in the digit extraction formula. The rounding test

$$ r,B,2^t + R' \lessgtr 2^{s-1} B,2^t $$

compares the exact fractional part $\delta$ with $1/2$, so the nearest $P$-bit binary number is correctly chosen, and renormalization ensures proper adjustment of the exponent.

  1. In the binary-to-decimal conversion, the exact rational $C / 10^\beta$ and integer comparisons ensure that $q$ is the correctly truncated significand and that rounding selects the nearest $p$-digit decimal number according to Section 4.2.2.

Both procedures therefore produce the true rounded result as required.

4. Algorithmic considerations

  • All computations are performed with integers of size $O(P \log 2 + p \log 10)$ bits, which is polynomial in the precision.
  • Determining exponents $t$ and $u$ requires comparing integers against powers of 2 or 10, which can be done efficiently via integer multiplication or binary search.
  • Digit extraction uses integer division and modulus operations, ensuring exact control over discarded fractions.
  • Guard bits $s\ge 1$ guarantee that rounding is performed correctly, even in halfway cases.