TAOCP 4.2.1 Exercise 13

Section 4.2.1: Single-Precision Calculations

Exercise 13. [30] When doing "interval arithmetic" we don't want to round the results of a floating point computation; we want rather to implement operations such as $\triangledown$ and $\triangle$, which give the tightest possible representable bounds on the true sum:

$$u \triangledown v \le u + v \le u \triangle v.$$

How should the algorithms of this section be modified for such a purpose?

Verified: yes
Solve time: 2m15s

Setup

Let $\beta=10$, let the precision be $p$, and let the floating point operations be rounded to the nearest representable number, as in Algorithm N. For interval arithmetic we seek two operations,

$$ u \triangledown v,\qquad u \triangle v, $$

such that

$$ u \triangledown v \le u+v \le u \triangle v, $$

with the lower bound as large as possible and the upper bound as small as possible among all representable floating point numbers.

The same principle applies to subtraction, multiplication, and division. The problem is therefore to modify the normalization and rounding procedures so that every operation produces either the greatest representable number not exceeding the exact result, or the least representable number not less than the exact result.

Solution

Algorithm A already computes an exact intermediate value before step N5, except for the truncations that occur while shifting in step A5 when finite registers are used. Therefore the essential modification concerns the rounding step.

For the upper operation $\triangle$, replace step N5 of Algorithm N by directed rounding upward.

Suppose that after normalization we have

$$ f=s\sum_{j=1}^\infty d_j\beta^{-j}, \qquad s=\pm1, $$

with

$$ 0\le d_j<\beta. $$

Let

$$ f^{(p)}=s\sum_{j=1}^p d_j\beta^{-j} $$

be the truncation to $p$ digits.

$$ f=f^{(p)}, $$

no change is made. Otherwise:

If $s>0$, replace $f$ by

$$ f^{(p)}+\beta^{-p}. $$

If $s<0$, replace $f$ by

$$ f^{(p)}. $$

The result is the least representable floating point number greater than or equal to the exact value. If the increment by $\beta^{-p}$ produces $|f|=1$, step N4 is executed exactly as before; this corresponds to rounding overflow.

For the lower operation $\triangledown$, reverse the direction:

If $s>0$, replace $f$ by

$$ f^{(p)}. $$

If $s<0$, replace $f$ by

$$ f^{(p)}-\beta^{-p}. $$

Again, no change is made when the exact value is already representable.

These rules produce respectively the greatest representable number not exceeding the exact value and the least representable number not less than it.

The same modification applies to multiplication and division. Every arithmetic routine should carry one additional piece of information during the computation, namely whether any nonzero digit has been discarded. In Algorithm A this information arises during step A5; in multiplication and division it arises during the formation of low-order digits beyond the $p$th place. The directed rounding decision at step N5 depends only on the sign and on whether discarded digits were nonzero.

Underflow and overflow must also be directed. If an exact positive result is too small to normalize, then:

$$ u \triangledown v = 0, $$

while

$$ u \triangle v $$

must be the least positive normalized floating point number. For a negative exact result the roles are reversed.

Similarly, if the exact result exceeds the largest representable positive number, then the upper bound overflows and must be represented by a positive infinity convention or an overflow indicator; the lower bound becomes the largest representable finite number. Negative overflow is handled symmetrically.

Thus the algorithms of this section are modified only in the final rounding stage and in the interpretation of exceptional cases: instead of rounding to nearest, the routines round monotonically downward or upward according to the desired interval endpoint.

Verification

Consider a positive exact value

$$ x=f^{(p)}+\theta\beta^{-p}, \qquad 0<\theta<1. $$

The representable numbers adjacent to $x$ are

$$ f^{(p)} \quad\text{and}\quad f^{(p)}+\beta^{-p}. $$

Hence

$$ f^{(p)}\le x\le f^{(p)}+\beta^{-p}, $$

and no representable number lies strictly between them. Therefore:

$$ x^\downarrow=f^{(p)}, \qquad x^\uparrow=f^{(p)}+\beta^{-p}, $$

are respectively the tightest lower and upper bounds.

For a negative exact value,

$$ x=-\left(g^{(p)}+\theta\beta^{-p}\right), \qquad 0<\theta<1, $$

the ordering reverses:

$$ -\left(g^{(p)}+\beta^{-p}\right)\le x\le -g^{(p)}. $$

Hence downward rounding subtracts one unit in the last place, while upward rounding truncates.

Therefore the modified algorithms satisfy

$$ u \triangledown v \le u+v \le u \triangle v, $$

and the bounds are optimal among all representable floating point numbers.

This completes the proof.

∎