TAOCP 4.5.4 Exercise 7
Algorithm D uses, for each modulus $m_i$, a table that indicates whether a residue class modulo $m_i$ can occur as a quadratic residue.
Section 4.5.4: Factoring into Primes
Exercise 7. [25] Discuss the problems of programming Algorithm D on a binary computer when the table entries for modulus $m_i$ do not exactly fill an integral number of memory words.
Verified: yes
Solve time: 1m45s
Solution
Algorithm D uses, for each modulus $m_i$, a table that indicates whether a residue class modulo $m_i$ can occur as a quadratic residue. If $m_i$ is fixed, the table contains $m_i$ bits. When $m_i$ is not a multiple of the machine word length $w$, the last word of the table is only partially occupied.
A straightforward implementation stores each table in a separate block whose length is
$$ \left\lceil \frac{m_i}{w} \right\rceil $$
words. Then the bit corresponding to residue $r$ is obtained from
$$ \text{word}=\left\lfloor \frac{r}{w}\right\rfloor,\qquad \text{bit}=r\bmod w. $$
This method wastes
$$ w\left\lceil \frac{m_i}{w}\right\rceil-m_i $$
bits per table, but every access requires only a fixed address computation and a mask operation. The program is simple and fast.
A denser representation concatenates all tables without padding. Let
$$ B_i=\sum_{j<i} m_j $$
be the starting bit position of the table for modulus $m_i$. To test residue $r$, the required bit is at position
$$ B_i+r. $$
The corresponding word number and bit number are
$$ \left\lfloor \frac{B_i+r}{w}\right\rfloor, \qquad (B_i+r)\bmod w. $$
This removes all wasted space, but some table entries lie across word boundaries. Every access must therefore include the additional offset $B_i$, and extraction is more complicated. On many binary machines the extra arithmetic costs more time than the memory saved.
A compromise is to group several tables into a larger block whose total length is an integral number of words. The wasted space is then reduced, while access remains simpler than in the completely packed representation.
Hence the principal programming issue is the tradeoff between storage and speed. Padding each table to a word boundary wastes a small number of bits but yields the simplest and fastest code; dense packing minimizes memory usage but complicates indexing and bit extraction because table boundaries no longer coincide with word boundaries. ∎