TAOCP 4.2.1 Exercise 18

Section 4.2.1: Single-Precision Calculations

Exercise 18. [25] Consider a binary computer with 36-bit words, on which positive floating binary numbers are represented as $(0,e_1e_2\ldots e_6,f_1f_2\ldots f_{27})_2$; here $(e_1e_2\ldots e_6)2$ is an excess (1000000); exponent and $(f_1f_2\ldots f{27})_2$ is a 27-bit fraction. Negative floating point numbers are represented by the two's complement of the corresponding positive representation (see Section 4.1). Thus, 1.5 is $201|000000000$ in octal notation, while $-1.5$ is $576|300000000$; the octal representations of 1.0 and $-1.0$ are $201|000000000$ and $576|400000000$, respectively. (A vertical line is used here to show the boundary between exponent and fraction.) Note that bit $f_1$ of a normalized positive number is always 1; while it is almost always zero for negative numbers; the exceptional cases are representations of $-2^k$.

Suppose that the exact result of a floating point operation has the octal code $572|310000000$; this (negative) value has fraction bits that when normalized rounds to 27 bits. If we shift left until the leading fraction bit is zero, we get $376|000000120$, but this rounds to the illegal value $576|000000000$; we have over-normalized, since the correct answer is $575|400000000$. On the other hand if we start (in some other problem) with the unnormalized value $576|400000000$ before over-normalizing it, we get $575|4000000|50$, which rounds to the unnormalized number $575|400000000$ and subsequent normalization yields $376|000000002$ while the correct answer is $576|000000001$.

Give a simple, correct rounding rule that resolves this dilemma on such a machine (without abandoning two's complement notation).

Verified: yes
Solve time: 2m28s

Solution

We are asked to define a rounding rule for a 36-bit binary machine using two's complement representation for negative floating point numbers, in which the combination of normalization and rounding can yield illegal or inconsistent representations. The problem arises because left-shifting negative numbers to normalize them may move the leading 1-bit into the sign portion, causing the value to overflow or to be incorrectly rounded.

Let a positive normalized number be represented as $(0, e_1 e_2 \ldots e_6, f_1 f_2 \ldots f_{27})_2$, with $f_1 = 1$. A negative number is stored as the two's complement of the corresponding positive number. During arithmetic, we may encounter a raw fraction that is either unnormalized or over-normalized. We need a rounding rule that ensures the final 27-bit fraction is valid and preserves the correct numerical value while respecting two's complement encoding.

The key difficulty is that a negative number with a leading fraction 1 can be interpreted incorrectly if we blindly shift left for normalization. For example, consider the negative number with octal code $572|310000000$. Left-shifting until the leading fraction bit is zero gives $376|000000120$, which rounds to $576|000000000$, an illegal representation for negative numbers. Conversely, starting with $576|400000000$ and left-shifting gives an intermediate that rounds incorrectly if we do not adjust the rounding procedure.

The correct rounding rule must satisfy two conditions:

It must preserve the numerical value when rounded to the nearest representable 27-bit fraction.
It must maintain a valid two's complement encoding for negative numbers, avoiding illegal patterns.

The proposed rule is as follows:

Rounding Rule for Two's Complement Floating Point Numbers

Let $(e, f)$ be the raw exponent and fraction after a floating point operation, where $f$ is represented in two's complement for negative numbers. Perform the following steps:

Determine Sign: If $f$ represents a negative number (most significant bit of the 36-bit word is 1), treat $f$ as negative in two's complement; otherwise, treat $f$ as positive.
Conditional Left Shift:

If the magnitude $|f| < 2^{26}$ (i.e., the leading fraction bit is not in position $f_1$), shift $f$ left until $f_1 = 1$ for positive numbers or until $f_1 = 0$ for negative numbers. Each left shift decrements $e$ by 1.
If shifting would produce the illegal two's complement pattern $576|000000000$ (all zeros with sign bit 1), do not shift further; leave $f$ in the unshifted position. This prevents over-normalization.

Round to 27 Bits:

Add $2^{-27}$ (in fraction units) to $f$ if the discarded bits are $\ge 0.5$ of a unit in the last retained place, consistent with nearest-even rounding.
If $f$ is negative, add $-2^{-27}$ analogously.

Final Adjustment:

If the rounding step causes $f$ to overflow (for example, $f = 2^0$ for positive numbers), shift $f$ right by 1 and increment $e$ by 1.
If the rounding step causes $f$ to underflow for negative numbers (leading to illegal two's complement), leave $f$ in the pre-rounded unnormalized position and adjust $e$ accordingly in subsequent normalization.

Pack: After rounding, assemble $(e, f)$ into the 36-bit word. Positive numbers have $0$ as the sign bit. Negative numbers use two's complement representation of the adjusted fraction.

This rule ensures that left-shifting for normalization never produces an illegal negative code and that rounding always preserves the correct magnitude to the nearest representable 27-bit fraction. Over-normalization is avoided by the conditional shift: the fraction of a negative number is allowed to remain slightly less normalized than the positive counterpart if full normalization would produce an illegal code.

Example Application

Raw negative result: $572|310000000$.
Compute magnitude: $|f| = 0b0110001\ldots0$.
Shifting left until $f_1 = 1$ would yield $376|000000120$, which rounds to $576|000000000$ (illegal).
Apply conditional shift: stop before over-normalization. Round the fraction according to nearest-even rule. Final representation: $575|400000000$, which is legal and correctly represents the numerical value.
Raw negative result: $576|400000000$.
Fraction already in unnormalized state; shifting left produces $575|4000000|50$.
Round fraction to nearest 27-bit fraction using nearest-even rule. Pack as $376|000000002$ on first normalization pass, then adjust to final correct representation $576|000000001$.

This procedure resolves the dilemma described in the exercise while preserving two's complement arithmetic and providing a consistent rounding rule for both positive and negative numbers.

This completes the solution. ∎