TAOCP 4.2.3 Exercise 6
Let the single-precision format of Section 4.
Section 4.2.3: Double-Precision Calculations
Exercise 6. [**] [23] Assume that the double-precision subroutines of this section and the single-precision subroutines of Section 4.2.1 are being used in the same main program. Write a subroutine that converts a single-precision floating point number into double-precision form (1), and write another subroutine that converts a double-precision floating point
Verified: yes
Solve time: 3m47s
Let the single-precision format of Section 4.2.1 be
$$ (\sigma\mid e\mid f_1f_2f_3f_4f_5), $$
where $\sigma$ is the sign bit, $e$ is the characteristic, and $f_1,\ldots,f_5$ are the five fraction bytes.
Let the double-precision format (1) of Section 4.2.3 be
$$ \bigl(\sigma\mid e\mid f_1f_2f_3f_4f_5,; f_6f_7f_8f_9f_{10}\bigr). $$
The value represented by the single-precision number is
$$ (-1)^\sigma (.f_1f_2f_3f_4f_5)_b, b^{,e-b/2}, $$
while the value represented by the double-precision number is
$$ (-1)^\sigma (.f_1f_2f_3f_4f_5f_6f_7f_8f_9f_{10})_b , b^{,e-b/2}. $$
Single precision $\rightarrow$ double precision
The conversion is exact. Copy the sign, characteristic, and five fraction bytes into the first word of the double-precision number and set the second word equal to zero.
$$ (\sigma\mid e\mid f_1f_2f_3f_4f_5) \longmapsto \bigl(\sigma\mid e\mid f_1f_2f_3f_4f_5,; 0,0,0,0,0\bigr). $$
Since the appended fraction bytes are all zero,
$$ (.f_1f_2f_3f_4f_5)_b = (.f_1f_2f_3f_4f_5,0,0,0,0,0)_b, $$
hence the numerical value is unchanged.
A suitable subroutine is:
$$ \begin{array}{l} \text{SUBROUTINE S_TO_D}(S,D_1,D_2)\[2mm] D_1 \leftarrow S,\ D_2 \leftarrow 0,\ \text{RETURN}. \end{array} $$
Here $S$ is the single-precision word and $(D_1,D_2)$ is the resulting double-precision number.
Double precision $\rightarrow$ single precision
Let
$$ D= \bigl(\sigma\mid e\mid f_1f_2f_3f_4f_5,; f_6f_7f_8f_9f_{10}\bigr). $$
The simplest conversion retains the first word and discards the second:
$$ D \longmapsto (\sigma\mid e\mid f_1f_2f_3f_4f_5). $$
This truncates the low-order five fraction bytes.
If rounding to the nearest single-precision number is desired, proceed as follows.
Let
$$ T=(f_6f_7f_8f_9f_{10})_b $$
be the discarded part.
- If $T<\frac12 b^{-5}$, leave $f_1,\ldots,f_5$ unchanged.
- If $T>\frac12 b^{-5}$, add one unit in the last retained place to the fraction $f_1\cdots f_5$.
- If $T=\frac12 b^{-5}$, apply the tie-breaking rule of Section 4.2.1, namely choose the representable single-precision number whose last retained digit is even.
Any carry generated by step 2 or 3 is propagated through the retained fraction and, if necessary, into the characteristic in the usual way.
A suitable subroutine is therefore:
$$ \begin{array}{l} \text{SUBROUTINE D_TO_S}(D_1,D_2,S)\[2mm] S \leftarrow D_1,\ \text{if rounding is desired, use }D_2\ \text{to round the fraction in }S\text{ according to}\ \text{the rules above},\ \text{RETURN}. \end{array} $$
Thus single $\rightarrow$ double conversion is exact, while double $\rightarrow$ single conversion consists of retaining the first word, with optional rounding based on the discarded second word.