TAOCP 4.2.3 Exercise 6

Section 4.2.3: Double-Precision Calculations

Exercise 6. [**] [23] Assume that the double-precision subroutines of this section and the single-precision subroutines of Section 4.2.1 are being used in the same main program. Write a subroutine that converts a single-precision floating point number into double-precision form (1), and write another subroutine that converts a double-precision floating point

Verified: yes
Solve time: 3m47s

Let the single-precision format of Section 4.2.1 be

$$ (\sigma\mid e\mid f_1f_2f_3f_4f_5), $$

where $\sigma$ is the sign bit, $e$ is the characteristic, and $f_1,\ldots,f_5$ are the five fraction bytes.

Let the double-precision format (1) of Section 4.2.3 be

$$ \bigl(\sigma\mid e\mid f_1f_2f_3f_4f_5,; f_6f_7f_8f_9f_{10}\bigr). $$

The value represented by the single-precision number is

$$ (-1)^\sigma (.f_1f_2f_3f_4f_5)_b, b^{,e-b/2}, $$

while the value represented by the double-precision number is

$$ (-1)^\sigma (.f_1f_2f_3f_4f_5f_6f_7f_8f_9f_{10})_b , b^{,e-b/2}. $$

Single precision $\rightarrow$ double precision

The conversion is exact. Copy the sign, characteristic, and five fraction bytes into the first word of the double-precision number and set the second word equal to zero.

$$ (\sigma\mid e\mid f_1f_2f_3f_4f_5) \longmapsto \bigl(\sigma\mid e\mid f_1f_2f_3f_4f_5,; 0,0,0,0,0\bigr). $$

Since the appended fraction bytes are all zero,

$$ (.f_1f_2f_3f_4f_5)_b = (.f_1f_2f_3f_4f_5,0,0,0,0,0)_b, $$

hence the numerical value is unchanged.

A suitable subroutine is:

$$ \begin{array}{l} \text{SUBROUTINE S_TO_D}(S,D_1,D_2)\[2mm] D_1 \leftarrow S,\ D_2 \leftarrow 0,\ \text{RETURN}. \end{array} $$

Here $S$ is the single-precision word and $(D_1,D_2)$ is the resulting double-precision number.

Double precision $\rightarrow$ single precision

Let

$$ D= \bigl(\sigma\mid e\mid f_1f_2f_3f_4f_5,; f_6f_7f_8f_9f_{10}\bigr). $$

The simplest conversion retains the first word and discards the second:

$$ D \longmapsto (\sigma\mid e\mid f_1f_2f_3f_4f_5). $$

This truncates the low-order five fraction bytes.

If rounding to the nearest single-precision number is desired, proceed as follows.

Let

$$ T=(f_6f_7f_8f_9f_{10})_b $$

be the discarded part.

If $T<\frac12 b^{-5}$, leave $f_1,\ldots,f_5$ unchanged.
If $T>\frac12 b^{-5}$, add one unit in the last retained place to the fraction $f_1\cdots f_5$.
If $T=\frac12 b^{-5}$, apply the tie-breaking rule of Section 4.2.1, namely choose the representable single-precision number whose last retained digit is even.

Any carry generated by step 2 or 3 is propagated through the retained fraction and, if necessary, into the characteristic in the usual way.

A suitable subroutine is therefore:

$$ \begin{array}{l} \text{SUBROUTINE D_TO_S}(D_1,D_2,S)\[2mm] S \leftarrow D_1,\ \text{if rounding is desired, use }D_2\ \text{to round the fraction in }S\text{ according to}\ \text{the rules above},\ \text{RETURN}. \end{array} $$

Thus single $\rightarrow$ double conversion is exact, while double $\rightarrow$ single conversion consists of retaining the first word, with optional rounding based on the discarded second word.