Fast Unrounded Scaling: Proof by Ivy
Russ Cox January 19, 2026 research.swtch.com/fp-proof
Posted on Monday, January 19, 2026. PDF
My post βFloating-Point Printing and Parsing Can Be Simple And Fastβ depends on fast unrounded scaling, defined as:
β¨xβ©=β2xβ||(2xβ β2xβ)uscale(x,e,p)=β¨xΒ·2eΒ·10pβ©
The unrounded form of xββ, β¨xβ©, is the integer value of βxβ concatenated with two more bits: first, the βΒ½ bitβ from the binary representation of x (the bit representing 2β1; 1 if xββxββ₯Β½; or equivalently, β2xβmod2); and second, a βsticky bitβ that is 1 if any bits beyond the Β½ bit were 1.
These are all equivalent definitions, using the convention that a boolean condition is 1 for true, 0 for false:
β¨xβ©=β¦
Fast Unrounded Scaling: Proof by Ivy
Russ Cox January 19, 2026 research.swtch.com/fp-proof
Posted on Monday, January 19, 2026. PDF
My post βFloating-Point Printing and Parsing Can Be Simple And Fastβ depends on fast unrounded scaling, defined as:
β¨xβ©=β2xβ||(2xβ β2xβ)uscale(x,e,p)=β¨xΒ·2eΒ·10pβ©
The unrounded form of xββ, β¨xβ©, is the integer value of βxβ concatenated with two more bits: first, the βΒ½ bitβ from the binary representation of x (the bit representing 2β1; 1 if xββxββ₯Β½; or equivalently, β2xβmod2); and second, a βsticky bitβ that is 1 if any bits beyond the Β½ bit were 1.
These are all equivalent definitions, using the convention that a boolean condition is 1 for true, 0 for false:
β¨xβ©=βxβ||(xββxββ₯Β½)||(xββxββ0,Β½)(β||βisbitconcatenation)=βxβ||(xββxββ₯Β½)||(2xβ β2xβ)=β2xβ||(2xβ β2xβ)=β4xβ|(2xβ β2xβ)(β|βisbitwiseOR)=β4xβ|(4xβ β4xβ)
The uscale operation computes the unrounded form of xΒ·2eΒ·10p, so it needs to compute the integer β2Β·xΒ·2eΒ·10pβ and then also whether the floor truncated any bits. One approach would be to compute 2Β·xΒ·2eΒ·10p as an exact rational, but we want to avoid arbitrary-precision math. A faster approach is to use a floating-point approximation for 10p: 10pβππΒ·2ππ, where ππ is 128 bits. Assuming x<264, this requires a single 64Γ128β192-bit multiplication, implemented by two full-width 64Γ64β128-bit multplications on a 64-bit computer.
The algorithm, which we will call Scale, is given integers x, e, and p subject to certain constraints and operates as follows:
Scale(x,e,p):
- Let ππ=β127ββlog210βpβ.
- Let ππ=β10p/2ππβ, looked up in a table indexed by p.
- Let b=bits(x), the number of bits in the binary representation of x.
- Let m=e+ππ+bβ1.
- Let π‘ππ=βxΒ·ππ/2m/2bβ,ππππππ=βxΒ·ππ/2bβmod2m,πππ‘π‘ππ=xΒ·ππmod2b. Put another way, split xΒ·ππ into π‘ππ||ππππππ||bottom where πππ‘π‘ππ is b bits, ππππππ is m bits, and π‘ππ is the remaining bits.
- Return β¨(π‘ππ||ππππππ)/2m+1β©, computed as π‘ππ||(ππππππβ 0) or as π‘ππ||(ππππππβ₯2).
The initial uscale implementation in the main post uses (ππππππβ 0) in its result, but an optimized version uses (ππππππβ₯2).
This post proves both versions of Scale correct for the x, e, and p needed by the three floating-point conversion algorithms in the main post. Those algorithms are:
FixedWidth converts floating-point to decimal. It needs to call Scale with a 53-bit x, eβ[β1137,960], and pβ[β307,341], chosen to produce a result rβ[0,2Β·1018), which is at most 61 bits (62-bit output; b=53,mβ₯128β62=66).
Short also converts floating-point to decimal. It needs to call Scale with a 55-bit x, eβ[β1137,960], and pβ[β292,324], chosen to produce a result rβ[0,2Β·1018), still at most 61 bits. (62-bit output; b=55,mβ₯128β62=66).
Parse converts decimal to floating-point. It needs to call Scale with a 64-bit x and pβ[β343,289], chosen to produce a result rβ[0,254), which is at most 54 bits (55-bit output; b=64,mβ₯128β55=73).
The βoutputβ bit counts include the Β½ bit but not the sticky bit. Note that for a given x and p, the maximum result size determines a relatively narrow range of possible e.
To start the proof, consider a hypothetical algorithm Scaleβ that is the same as Scale except using exact real numbers. (Technically, only rationals are required, so this could be implemented, but it is only a thought experiment.)
Scaleβ(x,e,p):
- Let ππ=β127ββlog210βpβ.
- Let ππβ=10p/2ππ. (Note: ππβ is an exact value, not a ceiling.)
- Let b=bits(x).
- Let m=βeβpeβbβ1.
- Let π‘ππβ=βxΒ·ππβ/2m/2bβ,ππππππβ=βxΒ·ππ/2bβmod2m,πππ‘π‘ππβ=xΒ·ππmod2b. (Note: π‘ππβ and ππππππβ are integers, but πππ‘π‘ππβ is an exact value that may not be an integer.)
- Return β¨(π‘ππβ||ππππππβ||πππ‘π‘ππβ)/2b+m+1β©, computed as π‘ππβ||(ππππππβ||πππ‘π‘ππββ 0).
Using exact reals makes it straighforward to prove Scaleβ correct.
Lemma 1. Scaleβ computes uscale(x,e,p).
Proof. Expand the math in the final result:
π‘ππβ=βxΒ·ππβ/2m/2bβ[definitionofπ‘ππβ]=βxΒ·10p/2ππ/2m/2bβ[definitionofππβ]=βxΒ·10p/2ππ/2βeβππβbβ1/2bβ[definitionofm]=βxΒ·10pΒ·2βππ+e+ππ+b+1βbβ[rearranging]=βxΒ·10pΒ·2e+1β[simplifying]=β2Β·xΒ·2eΒ·10pβ[rearranging]ππππππββ 0orπππ‘π‘ππββ 0=βxΒ·ππ/2bβmod2mβ 0orxmod2bβ 0[definitionofππππππβ,πππ‘π‘ππβ]=xΒ·ππβmod2m+bβ 0[simplifying]=βxΒ·ππβ/2m+bββ xΒ·ππβ/2m+b[definitionoffloorandmod]=β2Β·xΒ·2eΒ·10pββ 2Β·xΒ·2eΒ·10p[reusingexpansionofxΒ·ππβ/2m/2babove]Scaleβ=π‘ππβ||(ππππππββ 0orπππ‘π‘ππββ 0)[definitionofScaleβ]=β2Β·xΒ·2eΒ·10pβ||β2Β·xΒ·2eΒ·10pββ 2Β·xΒ·2eΒ·10p[applyingprevioustwoexpansions]=β¨xΒ·2eΒ·10pβ©[definitionofβ¨...β©]=uscale(x,e,p)[definitionofscale]
So Scaleβ(x,e,p) computes uscale(x,e,p). β
Next we can establish basic conditions that make Scale correct.
Lemma 2. If π‘ππ=π‘ππβ and (ππππππβ 0)β‘(ππππππβ||πππ‘π‘ππββ 0), then Scale computes uscale(x,e,p).
Proof. Scaleβ(x,e,p)=π‘ππβ||(ππππππβ||πππ‘π‘ππββ 0), while Scale(x,e,p)=π‘ππ||(ππππππβ 0). If π‘ππ=π‘ππβ and (ππππππβ 0)β‘(ππππππβ||πππ‘π‘ππββ 0), then these expressions are identical. Since Scaleβ(x,e,p) computes uscale(x,e,p) (by Lemma 1), so does Scale(x,e,p). β
Now we need to show that π‘ππ=π‘ππβ and (ππππππβ 0)β‘(ππππππβ||πππ‘π‘ππββ 0) in all cases. We will also show that ππππππβ 1 to justify using ππππππβ₯2 in place of ππππππβ 0 when thatβs convenient.
Note that ππ=βππββ=ππβ+Ξ΅0 for Ξ΅0β[0,1), and so:
xΒ·ππ=xΒ·(ππβ+Ξ΅0)=xΒ·ππβ+Ξ΅1,Ξ΅1=xΒ·Ξ΅0β[0,2b)π‘ππ||ππππππ||πππ‘π‘ππ=π‘ππβ||ππππππβ||πππ‘π‘ππβ+Ξ΅1
The proof analyzes the effect of the addition of Ξ΅1 to the ideal result π‘ππβ||ππππππβ||πππ‘π‘ππβ. Since πππ‘π‘ππβ is b bits and Ξ΅1 is at most b bits, adding Ξ΅1>0 always causes πππ‘π‘ππβ πππ‘π‘ππβ. (Talking about the low b bits of a real number is unusual; we mean the low b integer bits followed by all the fractional bits: xΒ·ππβmod2b.)
The question is whether that addition overflows and propagates a carry into ππππππ or even π‘ππ. There are two main cases: exact results (ππππππβ||πππ‘π‘ππβ=0) and inexact results (ππππππβ||πππ‘π‘ππββ 0).
Exact Results
Exact results have no error, making them match Scaleβ exactly.
Lemma 3. For exact results, Scale computes uscale(x,e,p) and ππππππβ 1.
Proof. For an exact result, ππππππβ||πππ‘π‘ππβ=0, meaning 2Β·xΒ·2eΒ·10p is an integer and the sticky bit is 0. Since πππ‘π‘ππβ is b zero bits, adding Ξ΅1 affects πππ‘π‘ππ but does not carry into ππππππ or π‘ππ. Therefore π‘ππ=π‘ππβ and ππππππ=ππππππβ=0. The latter, combined with πππ‘π‘ππβ=0, makes (ππππππβ 0)β‘(ππππππβ||πππ‘π‘ππββ 0) trivially true (both sides are false). By Lemma 2, Scale is correct. And since ππππππ=0, ππππππβ 1. β
Inexact results are more work. We will reduce the correctness to a few conditions on ππππππ.
Lemma 4. For inexact results, if ππππππβ 0, then Scale(x,e,p) computes uscale(x,e,p).
Proof. For an inexact result, ππππππβ||πππ‘π‘ππββ 0. The only possible change from ππππππβ to ππππππ is a carry from the error addition πππ‘π‘ππβ+Ξ΅1 overflowing πππ‘π‘ππ. That carry is at most 1, so ππππππ=(ππππππβ+Ξ΅2)mod2m for Ξ΅2β[0,1]. An overflow into π‘ππ leaves ππππππ=0. If ππππππβ 0 then there can be no overflow, so π‘ππ=π‘ππβ. By Lemma 2, Scale computes uscale(x,e,p). β
For some cases, it will be more convenient to prove the range of ππππππβ instead of the range of ππππππ. For that we can use a variant of Lemma 4.
Lemma 5. For inexact results, if ππππππββ[1,2mβ2] then Scale(x,e,p) computes uscale(x,e,p).
Proof. If ππππππββ[1,2mβ2], then ππππππβ+Ξ΅2β[1,2mβ1], so the mod in ππππππ=(ππππππβ+Ξ΅2)mod2m does nothing (there is no overflow and wraparound), so ππππππ=ππππππβ+Ξ΅2β₯1. By Lemma 4, Scale(x,e,p) computes uscale(x,e,p). β
A related lemma helps with middleβ 1.
Lemma 6. For inexact results, if ππππππββ[2,2mβ2], then ππππππβ₯2.
Proof. Again there is no overflow, so ππππππβ₯ππππππββ₯2. β
Now we need to prove either that ππππππββ[2,2mβ2] or that ππππππβ 0 for all inexact results. We will consider four cases:
- [Small Positive Powers] pβ[0,27] and bβ€64.
- [Small Negative Powers] pβ[β27,β1] and bβ€64.
- [Large Powers, Printing] pβ[β400,β28]βͺ[28,400], bβ€55, mβ₯66.
- [Large Powers, Parsing] pβ[β400,β28]βͺ[28,400], bβ€64, mβ₯73. Small Positive Powers
Lemma 7. For inexact results and pβ[0,27] and bβ€64, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Proof. 5p<263, so the non-zero bits of ππβ fits in the high 63 bits. That implies that the b+128-bit product xΒ·ππβ=π‘ππβ||ππππππβ||πππ‘π‘ππβ ends in 65 zero bits. Since bβ€64, that means πππ‘π‘ππβ=0 and ππππππββs low bit is zero.
Because the result is inexact, ππππππβ||πππ‘π‘ππββ 0, which implies ππππππββ 0 (since πππ‘π‘ππβ=0). Since ππππππββs low bit is zero, ππππππββ[2,2mβ2]. By Lemma 5, Scale(x,e,p) computes uscale(x,e,p). By Lemma 6, ππππππβ 1. β
Lemma 8. For inexact results and pβ[β27,β1] and bβ€64, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Proof. Scaling by 2e cannot introduce inexactness, since it just adds or subtracts from the exponent. The only inexactness must come from 10p, specifically the 5p part. Since p<0 and 1/5 is not exactly representable in a binary fraction, the result is inexact if and only if xmod5βpβ 0 (remember that βp is positive!).
Since ππββ[2127,2128) and 5βp<2β2, ππβ=5βpΒ·2k for some kβ₯130. Since m+bβ€128, π‘ππβ=βxΒ·2kβ(m+b)/5βpβ and ππππππβ||πππ‘π‘ππβ=2m+bΒ·(xΒ·2kβ(m+b)mod5βp)/5βp. That is, ππππππβ||πππ‘π‘ππβ encodes some non-zero binary fraction with denominator 5βp. Note also that xΒ·ππ is b+128 bits and the output is at most 64 bits we have mβ₯64, so 2mΒ·2β63β₯2.
That implies
ππππππβ||πππ‘π‘ππββ2m+bΒ·(5β27,1β5β27)β2m+bΒ·(2β63,1β2β63)ππππππββ2mΒ·(2β63,1β2β63)β(2,2mβ2)
By Lemma 5 and Lemma 6, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1. β
That leaves pβ[β400,β28]βͺ[28,400]. There are not many ππ to checkβunder a thousandβbut there are far too many x to exhaustively test that ππππππβ₯2 for all of them. Instead, we will have to be a bit more clever.
It would be simplest if we could prove that all possible ππ and all xβ[1,264) result in a non-zero middle, but that turns out not to be the case.
For example, using p=β29, x=0x8e151cee6e31e067 is a problem, which we can verify using the Ivy calculator:
# hex x is the hex formatting of x (as text)
op hex x = '#x' text x
# spaced adds spaces to s between sections of 16 characters
op spaced s = (count s) <= 18: s; (spaced -16 drop s), ' ', -16 take s
# pe returns the binary exponent for 10**p.
op pe p = -(127+ceil 2 log 10**-p)
# pm returns the 128-bit mantissa for 10**p.
op pm p = ceil (10**p) / 2**pe p
spaced hex (pm -29) * 0x8e151cee6e31e067
-- out --
0x7091bfc45568750f 0000000000000000 d81262b60aa6e8b7
We might perhaps think the problem is that 10β29 is too close to the small negative powers, but positive powers break too:
spaced hex (pm 31) * 0x93997b98618e62a1
-- out --
0x918b5cd9fd69fdc5 0000000000000000 6d00000000000000
We might yet hope that the zeros were not caused by an error carry; then as long as we force the inexact bit to 1, we could still use the high bits. And indeed, for both of the previous examples, the zeros are not caused by an error carry: ππππππβ is all zeros. But that is not always the case. Here is a middle that is zero due to an error carry that overflowed into the top bits:
spaced hex (pm 62) * 0xd5bc71e52b31e483
spaced hex ((10**62) * 0xd5bc71e52b31e483) >> (pe 62)
-- out --
0xcfd352e73dc6ddc3 0000000000000000 774bd77b38816199
0xcfd352e73dc6ddc2 ffffffffffffffff e6fdb9b19804952a
Instead of proving the completely general case, we will have to pick our battles and focus on the specific cases we need for floating-point conversions.
We donβt need to try every possible input width below the maximum b. Looking at Scale, it is clear that the inputs x and xΒ·2k have the same π‘ππ and ππππππ, and also that πππ‘π‘ππ(xΒ·2k)=πππ‘π‘ππ(x)Β·2k. Since the middles are the same, the condition ππππππβ₯2 has the same truth value for both inputs. So we can limit our analysis to maximum-width b-bit inputs in [2bβ1,2b). Similarly, we can prove that ππππππβ₯2 for mβ₯k by proving it for m=k: moving more bits from the low end of π‘ππ to the high end of ππππππ cannot make ππππππ a smaller number.
Proving that ππππππβ₯2 for the cases we listed above means proving:
- [Printing] (bβ€55, mβ₯66.) For all large p and all xβ[254,255): xΒ·ππmod255+66=121β₯255+1=56.
- [Parsing] (bβ€64, mβ₯73.) For all large p and all xβ[263,264): xΒ·ππmod264+73=137β₯264+1=65.
To prove these two conditions, we are going to write an Ivy program to analyze each ππ separately, proving that all relevant x satisfy the condition.
Ivy has arbitrary-precision rationals and lightweight syntax, making it a convenient tool for sketching and testing mathematical algorithms, in the spirit of Iversonβs Turing Award lecture about APL, βNotation as a Tool of Thought.β Like APL, Ivy uses strict right-to-left operator precedence: 1+2*3+4 means (1+(2*(3+4))), and floor 10 log f means floor (10 log f). Operators can be prefix unary like floor or infix binary like log. Each of the Ivy displays in this post is executable: you can edit the code and re-run them by clicking the Play button (ββΆοΈβ). A full introduction to Ivy is beyond the scope of this post; see the Ivy demo for more examples.
Weβve already started the proof program above by defining pm and pe. Letβs continue by defining a few more helpers.
First letβs define is, an assertion for basic testing of other functions:
# is asserts that x === y.
op x is y =
x === y: x=x
print x 'β ' y
1 / 0
(1+2) is 3
(2+2) is 5
-- out --
4 β 5
-- err --
input:1: division by zero
If the operands passed to is are not equal (the triple === does full vaue comparison), then is prints them out and divides by zero to halt execution.
Next, we will set Ivyβs origin to 0 (instead of the default 1), meaning iota starts counting at 0 and array indexes start at 0, and then we will define seq x y, which returns the list of integers [x,y].
)origin 0
# seq x y = (x x+1 x+2 ... y)
op seq (x y) = x + iota 1+y-x
(seq -2 4) is -2 -1 0 1 2 3 4
Now we are ready to start attacking our problem, which is to prove that for a given ππ, b, and m, for all x, ππππππ||πππ‘π‘ππ=xΒ·ππmod2b+mβ₯2b+1, implying ππππππβ₯2, at which point we can use Lemma 4.
We will proceed in two steps, loosely following an approach by Vern Paxson and Tim Peters (the βRelated Workβ section explains the differences). The first step is to solve the βmodular searchβ problem of finding the minimum xβ₯0 (the βfirstβ x) such that xΒ·cmodmβ[ππ,hi]. The second step is to use that solution to solve the βmodular minimumβ problem of finding an x in a given range that minimizes xΒ·cmodm.
Modular Search
Given constants c, m, ππ, and hi, we want to find the minimum xβ₯0 (the βfirstβ x) such that xΒ·cmodmβ[ππ,hi]. This is an old programming contest problem, and I am not sure whether it has a formal name. There are multiple ways to derive a GCD-like efficient solution. The following explanation, based on one by David WΓ€rn, is the simplest I am aware of.
Here is a correct O(m) iterative algorithm:
op modfirst (c m lo hi) =
xr x cx mx = 0 0 1 0
:while 1
# (A) xr β€ hi but perhaps xr < lo.
:while xr < lo
xr x = xr x + c cx
:end
xr <= hi: x
# (B) xr - c < lo β€ hi < xr
:while xr > hi
xr x = xr x + (-m) mx
:end
lo <= xr: x
# (C) xr < lo β€ hi < xr + m
x >= m: -1
:end
The algorithm walks x forward from 1, maintaining π₯π=xΒ·cmodm:
- When π₯π is too small, it adds c to π₯π and increments x (cx=1).
- When π₯π is too large, it subtracts m from π₯π and leaves x unchanged (mx=0).
- When x reaches m, it gives up: there is no answer.
This loop is easily verified to be correct:
-
It starts with x=0 and considers successive x one at a time.
-
While doing that, it maintains π₯π correctly:
-
If π₯π is too small, we must add a c (and increment x).
-
If π₯π is too large, we must subtract an m (and leave x alone).
-
If π₯πβ[ππ,hi], it notices and stops.
The only problem with this modfirst is that it is unbearably slow, but we can speed it up.
At (A), xrβ€hi, established by the initial xr=0 or by the end of the previous iteration.
At (B), π₯πβc<ππβ€hi<π₯π. Because π₯πβc<ππ, subtracting mβ₯c will make π₯π too small; that will always be followed by at least βm/cβ additions of c. So we might as well replace m with βm+cΒ·βm/cβ, speeding future trials. We will also have to update ππ₯, to make sure x is maintained correctly.
At (C), π₯πβ€hi<π₯π+m, and by a similar argument, we might as well replace c with cβmΒ·βc/mβ, updating ππ₯ as well.
Making both changes to our code, we get:
op modfirst (c m lo hi) =
xr x cx mx = 0 0 1 0
:while 1
# (A) xr β€ hi but perhaps xr < lo.
:while xr < lo
xr x = xr x + c cx
:end
xr <= hi: x
# (B) xr - c < lo β€ hi < xr
m mx = m mx + (-c) cx * floor m/c
m == 0: -1
:while xr > hi
xr x = xr x + (-m) mx
:end
lo <= xr: x
c cx = c cx + (-m) mx * floor c/m
c == 0: -1
# (C) xr < lo β€ hi < xr+m
:end
Notice that the loop is iterating (among other things) m=mmodc and c=cmodm, the same as Euclidβs GCD algorithm, so O(logc) iterations will zero c or m. The old test for xβ₯m (made incorrect by modifiying m) is replaced by checking for c or m becoming zero.
Finally, we should optimize away the small while loops by calculating how many times each will be executed:
op modfirst (c m lo hi) =
xr x cx mx = 0 0 1 0
:while 1
# (A) xr β€ hi but perhaps xr < lo.
xr x = xr x + c cx * ceil (0 max lo-xr)/c
xr <= hi: x
# (B) xr - c < lo β€ hi < xr
m mx = m mx + (-c) cx * floor m/c
m == 0: -1
xr x = xr x + (-m) mx * ceil (0 max xr-hi)/m
lo <= xr: x
c cx = c cx + (-m) mx * floor c/m
c == 0: -1
# (C) xr < lo β€ hi < xr+m
:end
Each iteration of the outer while loop is now O(1), and the loop runs at most O(logc) times, giving a total time of O(logc), dramatically better than the old O(m).
We can reformat the code to highlight the regular structure:
op modfirst (c m lo hi) =
xr x cx mx = 0 0 1 0
:while 1
xr x = xr x + c cx * ceil (0 max lo-xr)/c ; xr <= hi : x
m mx = m mx + (-c) cx * floor m/c ; m == 0 : -1
xr x = xr x + (-m) mx * ceil (0 max xr-hi)/m ; lo <= xr : x
c cx = c cx + (-m) mx * floor c/m ; c == 0 : -1
:end
(modfirst 13 256 1 5) is 20 # 20*13 mod 256 = 4 β [1, 5]
(modfirst 14 256 1 1) is -1 # impossible
We can also check that modfirst finds the exact answer from case 2, namely powers of five zeroing out the middle.
(modfirst (pm -3) (2**128) 1 (2**64)) is 125
spaced hex 125 * (pm -3)
-- out --
0x40 0000000000000000 0000000000000042
Now we can solve the problem of finding the xβ[π₯πππ₯,π₯πππ] that minimizes xΒ·cmodm.
Define the notation xR=xΒ·cmodm (the βresidueβ of x). We can construct xβ[π₯πππ,π₯πππ₯] with minimal xR with the following greedy algorithm.
- Start with x=π₯πππ.
- Find the first yβ[x+1,π₯πππ₯] such that yR<xR.
- If no such y exists, return x.
- Set x=y.
- Go to step 2.
The algorithm finds the right answer, because it starts at π₯πππ and then steps through every succesively better answer along the way to π₯πππ₯. The algorithm terminates because every search is finite and every step moves x forward by at least 1. The only detail remaining is how to implement step 2.
For any x and y, (xRβyR)modm=(xβy)R, because multiplication distributes over subtraction. Call that the subtraction lemma.
Finding the first yβ[x+1,π₯πππ₯] with yR<xR is equivalent to finding the first dβ[1,π₯πππ₯βx] with (x+d)R<xR. By the subtraction lemma, dR=((x+d)RβxR)modm, so we are looking for the first dβ₯1 with dRβ[mβxR,mβ1]. Thatβs what modfirst does, except it searches dβ₯0. But 0R=0 and we will only search for ππβ₯1, so modfirst can safely start its search at 0.
Note that if dRβ[mβ(x+d)R,mβ1], the next iteration will choose the same dβany better answer would have been an answer to the original search. So after finding d, we should add it to x as many times as we can.
The full algorithm is then:
- Start with x=π₯πππ.
- Use
modfirstto find the first dβ₯0 such that dRβ[mβxR,mβ1]. - If no d exists or x+d>π₯πππ₯, stop and return x. Otherwise continue.
- Let s=mβdR, the number we are effectively subtracting from xR.
- Let n be the smaller of β(π₯πππ₯βx)/dβ (the most times we can add d to x before exceeding our limit) and βxR/sβ (the most times we can subtract s from xR before wrapping around).
- Set x=x+dΒ·n.
- Go to step 2.
In Ivy, that algorithm is:
op modmin (xmin xmax c m) =
x = xmin
:while 1
xr = (x*c) mod m
d = modfirst c m, m - xr 1
(d < 0) or (x+d) > xmax: x
s = m - (d*c) mod m
x = x + d * floor ((xmax-x)/d) min xr/s
:end
(modmin 10 25 13 255) is 20
The running time of modmin depends on what limits n. If n is limited by (π₯πππ₯βx)/d then the next iteration will not find a usable d, since any future d would have to be bigger than the one we just found, and there wonβt be room to add it. On the other hand, if n is limited by xR/s, then it means we reduced xR at least by half. That limits the number of iterations to log2m, and since modfirst is O(logm), modmin is O(log2m).
The subtraction lemma and modfirst let us build other useful operations too. One obvious variant of modmin is modmax, which finds the xβ[π₯πππ,π₯πππ₯] that maximizes xR and also runs in O(log2m).
We can extend modmin to minimize xRβ₯ππ instead, by stepping to the first xRβ₯ππ before looking for improvements:
op modminge (xmin xmax c m lo) =
x = xmin
:if (xr = (x*c) mod m) < lo
d = modfirst c m (lo-xr) ((m-1)-xr)
d < 0: :ret -1
x = x + d
:end
:while 1
xr = (x*c) mod m
d = modfirst c m (m-(xr-lo)) (m-1)
(d < 0) or (x+d) > xmax: x
s = m - (d*c) mod m
x = x + d * floor ((xmax-x)/d) min (xr-lo)/s
:end
op modmin (xmin xmax c m) = modminge xmin xmax c m 0
(modmin 10 25 13 255) is 20
(modminge 10 25 13 255 6) is 21
(modminge 1 20 13 255 6) is 1
(modminge 10 20 255 255 1) is -1
We can also invert the search to produce modmax and modmaxle:
op modmaxle (xmin xmax c m hi) =
x = xmin
:if (xr = (x*c) mod m) > hi
d = modfirst c m (m-xr) ((m-xr)+hi)
d < 0: :ret -1
x = x + d
:end
:while 1
xr = (x*c) mod m
d = modfirst c m 1 (hi-xr)
(d < 0) or (x+d) > xmax: x
s = (d*c) mod m
x = x + d * floor ((xmax-x)/d) min (hi-xr)/s
:end
op modmax (xmin xmax c m) = modmaxle xmin xmax c m (m-1)
(modmax 10 25 13 255) is 19
(modmaxle 10 25 13 255 200) is 15
Another variant is modfind, which finds the first xβ[π₯πππ,π₯πππ₯] such that xRβ[ππ,hi]. It doesnβt need a loop at all:
op modfind (xmin xmax c m lo hi) =
x = xmin
xr = (x*c) mod m
(lo <= xr) and xr <= hi: x
d = modfirst c m, (lo hi - xr) mod m
(d < 0) or (x+d) > xmax: -1
x+d
(modfind 21 100 13 256 1 10) is 40
We can also build modfindall, which finds all the xβ[π₯πππ,π₯πππ₯] such that xRβ[ππ,hi]. Because there might be a very large number, it stops after finding the first 100.
op modfindall (xmin xmax c m lo hi) =
all = ()
:while 1
x = modfind xmin xmax c m lo hi
x < 0: all
all = all, x
(count all) >= 100: all
xmin = x+1
:end
(modfindall 21 100 13 256 1 10) is 40 79 99
Because modfind and modfindall both call modfind O(1) times, they both run in O(logm) time.
Modular Proof
Now we are ready to analyze individual powers.
For a given ππ, b, and m, we want to verify that for all xβ[2bβ1,2b), we have ππππππβ₯2, or equivalently, ππππππ||πππ‘π‘ππ=xΒ·ππmod2b+mβ₯2b+1. We can use either modmin or modfind to do this. Letβs use modmin, so we can show how close we came to failing.
Weβll start with a function check1 to check a single power, and show1 to format its result:
# (b m) check1 p returns (p pm x middle fail) where pm is (pm p).
# If there is a counterexample to p, x is the first one,
# middle is (x*pm)'s middle bits, and fail is 1.
# If there is no counterexample, x middle fail are 0 0 0.
op (b m) check1 p =
x = modmin (2**b-1) ((2**b)-1) (pm p) (2**b+m)
middle = ((x * pm p) mod 2**b+m) >> b
p (pm p) x middle (middle < 2)
# show1 formats the result of check1.
op show1 (p pm x middle fail) =
p (hex pm) (hex x) (hex middle) ('.β'[fail])
show1 64 64 check1 200
-- out --
200 0xa738c6bebb12d16cb428f8ac016561dc 0xffe389b3cdb6c3d0 0x34 .
For p=200, no 64-bit input produces a 64-bit middle less than 2.
On the other hand, for p=β1, we can verify that check1 finds a multiple of 5 that zeroes the middle:
show1 64 64 check1 -1
-- out --
-1 0xcccccccccccccccccccccccccccccccd 0x8000000000000002 0x0 β
Letβs check more than one power, gathering the results into a matrix:
op (b m) check ps = mix b m check1@ ps
op show table = mix show1@ table
show 64 64 check seq 25 35
-- out --
25 0x84595161401484a00000000000000000 0x8000000000000000 0x0 β
26 0xa56fa5b99019a5c80000000000000000 0x8000000000000000 0x0 β
27 0xcecb8f27f4200f3a0000000000000000 0x8000000000000000 0x0 β
28 0x813f3978f89409844000000000000000 0xec03c1a1aa24cc97 0x1 β
29 0xa18f07d736b90be55000000000000000 0xe06076f9cb96fe0d 0x5 .
30 0xc9f2c9cd04674edea400000000000000 0xfbd9be9d5bc8934e 0x1 β
31 0xfc6f7c40458122964d00000000000000 0x93997b98618e62a1 0x0 β
32 0x9dc5ada82b70b59df020000000000000 0xd0808609f474615a 0x2 .
33 0xc5371912364ce3056c28000000000000 0xc97002677c2de03f 0x0 β
34 0xf684df56c3e01bc6c732000000000000 0xc97002677c2de03f 0x0 β
35 0x9a130b963a6c115c3c7f400000000000 0xfd073be688a7dbaa 0x3 .
Now letβs write code to show just the start and end of a table, to avoid very long outputs:
op short table =
(count table) <= 15: table
(5 take table) ,% ((count table[0]) rho box '...') ,% (-5 take table)
short show 64 64 check seq 25 95
-- out --
25 0x84595161401484a00000000000000000 0x8000000000000000 0x0 β
26 0xa56fa5b99019a5c80000000000000000 0x8000000000000000 0x0 β
27 0xcecb8f27f4200f3a0000000000000000 0x8000000000000000 0x0 β
28 0x813f3978f89409844000000000000000 0xec03c1a1aa24cc97 0x1 β
29 0xa18f07d736b90be55000000000000000 0xe06076f9cb96fe0d 0x5 .
... ... ... ... ...
91 0x9d174b2dcec0e47b62eb0d64283f9c77 0xde861b1f480d3b9e 0x1 β
92 0xc45d1df942711d9a3ba5d0bd324f8395 0xe621cb57290a897f 0x0 β
93 0xf5746577930d6500ca8f44ec7ee3647a 0xa6bc10ca9dd53eff 0x1 β
94 0x9968bf6abbe85f207e998b13cf4e1ecc 0xd011cce372153a65 0x0 β
95 0xbfc2ef456ae276e89e3fedd8c321a67f 0xa674a3e92810fb84 0x0 β
We can see that most powers are problematic for b=64, m=64, which is why weβre not trying to prove that general case.
Letβs make it possible to filter the table down to just the bad powers:
op sel x = x sel iota count x
op bad table = table[sel table[;4] != 0]
short show bad 64 64 check seq -400 400
-- out --
-400 0x95fe7e07c91efafa3931b850df08e739 0xe4036416c4b21bd6 0x0 β
-399 0xbb7e1d89bb66b9b8c77e266516cb2107 0xe4036416c4b21bd6 0x0 β
-398 0xea5da4ec2a406826f95daffe5c7de949 0xe4036416c4b21bd6 0x0 β
-397 0x927a87139a6841185bda8dfef9ceb1ce 0xfcdbd01bdf2d3eb2 0x0 β
-395 0xe4df730ea142e5b60f857dde6652f5d1 0x99535e222a18bc6d 0x0 β
... ... ... ... ...
395 0x8f2bd39f334827e8c5874cc0ec691ba0 0xa462c66df06d90e3 0x0 β
397 0xdfb47aa8c020be5bb4a367ed71643b2a 0x90ae62dc5a2282dd 0x0 β
398 0x8bd0cca9781476f950e620f466dea4fb 0xd0be819cb0f1092e 0x0 β
399 0xaec4ffd3d61994b7a51fa93180964e39 0xa6fece16f3f40758 0x0 β
400 0xda763fc8cb9ff9e58e67937de0bbe1c7 0x8598a4df299005e0 0x0 β
Now we have everything we need. Letβs write a function to try to prove that scale is correct for a given b and m.
op prove (b m) =
table = bad b m check (seq -400 -28), (seq 28 400)
what = 'b=', (text b), ' m=', (text m), ' t=', (text 127-m), '+Β½'
(count table) == 0: print 'β
proved ' what
print 'β disproved' what
print short show table
This function builds a table of all the bad powers for b,m. If the table is empty, it prints that the settings have been proved. If not, it prints that the settings are unproven and prints some of the questionable powers.
If we try to prove 64,64, we get many unproven powers.
prove 64 64
-- out --
β disproved b=64 m=64 t=63+Β½
-400 0x95fe7e07c91efafa3931b850df08e739 0xe4036416c4b21bd6 0x0 β
-399 0xbb7e1d89bb66b9b8c77e266516cb2107 0xe4036416c4b21bd6 0x0 β
-398 0xea5da4ec2a406826f95daffe5c7de949 0xe4036416c4b21bd6 0x0 β
-397 0x927a87139a6841185bda8dfef9ceb1ce 0xfcdbd01bdf2d3eb2 0x0 β
-395 0xe4df730ea142e5b60f857dde6652f5d1 0x99535e222a18bc6d 0x0 β
... ... ... ... ...
395 0x8f2bd39f334827e8c5874cc0ec691ba0 0xa462c66df06d90e3 0x0 β
397 0xdfb47aa8c020be5bb4a367ed71643b2a 0x90ae62dc5a2282dd 0x0 β
398 0x8bd0cca9781476f950e620f466dea4fb 0xd0be819cb0f1092e 0x0 β
399 0xaec4ffd3d61994b7a51fa93180964e39 0xa6fece16f3f40758 0x0 β
400 0xda763fc8cb9ff9e58e67937de0bbe1c7 0x8598a4df299005e0 0x0 β
For printing, we need to prove b=55,m=66.
prove 55 66
-- out --
β
proved b=55 m=66 t=61+Β½
It works! In fact we can shorten the middle to 64 bits before things get iffy:
prove 55 66
prove 55 65
prove 55 64
prove 55 63
prove 55 62
-- out --
β
proved b=55 m=66 t=61+Β½
β
proved b=55 m=65 t=62+Β½
β
proved b=55 m=64 t=63+Β½
β disproved b=55 m=63 t=64+Β½
167 0xd910f7ff28069da41b2ba1518094da05 0x7b6e56a6b7fd53 0x0 β
β disproved b=55 m=62 t=65+Β½
167 0xd910f7ff28069da41b2ba1518094da05 0x7b6e56a6b7fd53 0x0 β
201 0xd106f86e69d785c7e13336d701beba53 0x68224666341b59 0x1 β
211 0xf356f7ebf83552fe0583f6b8c4124d44 0x69923a6ce74f07 0x0 β
Lemma 9. For pβ[β400,β28]βͺ[28,400], bβ€55, and mβ₯66, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Proof. We calculated above that ππππππβ₯2. By Lemma 4, Scale(x,e,p) computes uscale(x,e,p). β
For parsing, we want to prove b=64, m=73.
prove 64 73
-- out --
β
proved b=64 m=73 t=54+Β½
It also works! But weβre right on the edge. Shortening the middle by one bit breaks the proof:
prove 64 73
prove 64 72
-- out --
β
proved b=64 m=73 t=54+Β½
β disproved b=64 m=72 t=55+Β½
-93 0x857fcae62d8493a56f70a4400c562ddc 0xf324bb0720dbe7fe 0x1 β
Lemma 10. For pβ[β400,β28]βͺ[28,400], bβ€64, and mβ₯73, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Proof. We calculated above that ππππππβ₯2. By Lemma 4, Scale(x,e,p) computes uscale(x,e,p). β
Bonus: 64-bit Input, 64-bit Output?
We donβt need full 64-bit input and 64-bit output, but if we did, there is a way to make it work at only a minor performance cost. It turns out that for 64-bit input and 64-bit output, for any given p, considering all the inputs x that cause ππππππ=0, either all the middles overflowed or none of them did. So we can use a lookup table to decide how to interpret ππππππ=0.
The implementation steps would be:
- Note that the proofs remain valid without ππππππβ 1.
- Donβt make use of the ππππππβ 1 optimization in the
uscaleimplementation. - When p is large, force the sticky bit to 1 instead of trying to compute it from ππππππ.
- When ππππππ=0 for a large p, consult a table of hint bits indexed by p to decide whether ππππππ has overflowed. If so, decrement π‘ππ.
Here is a proof that this works.
First, define topdiff which computes the difference between π‘ππ and π‘ππβ for a given b,m,p.
# topdiff computes top - topβ.
op (b m p) topdiff x =
top = (x * pm p) >> b+m
topβ = (floor x * (10**p) / 2**pe p) >> b+m
top - topβ
Next, define hint, which is like check1 in that it looks for counterexamples. When it finds counterexamples, it computes topdiff for each one and reports whether they all match, and if so what their value is.
# (b m) hint p returns (p pm x middle fail) where pm is (pm p).
# If there is a counterexample to p, x is the first one,
# middle is (x*pm)'s middle bits, and fail is 1, 2, or 3:
# 1 if all counterexamples have top = topR
# 2 if all counterexamples have have top = topR+1
# 3 if both kinds of counterexamples exist or other counterexamples exist
# If there is no counterexample, x middle fail are 0 0 0.
op (b m) hint p =
x = modmin (2**b-1) ((2**b)-1) (pm p) (2**b+m)
middle = ((x * pm p) mod 2**b+m) >> b
middle >= 1: p (pm p) x middle 0
all = modfindall (2**b-1) ((2**b)-1) (pm p) (2**b+m) 0 ((2**b)-1)
diffs = b m p topdiff@ all
equal = or/ diffs == 0
carry = or/ diffs == 1
other = ((count all) >= 100) or or/ (diffs != 0) and (diffs != 1)
p (pm p) x middle ((1*equal)|(2*carry)|(3*other))
Finally, define hints, which is like show check. It gets the hint results for all large p and prints a summary of how many were in four categories: no hints needed, all hints 0, all hints 1, mixed hints.
op hints (b m) =
table = mix b m hint@ (seq -400 -28), (seq 28 400)
(box 'b=', (text b), ' m=', (text m)), (+/ mix table[;4] ==@ iota 4)
Now we can try b=64,m=64:
hints 64 64
-- out --
b=64 m=64 452 184 110 0
The output says that 452 powers donβt need hints, 184 need a hint that π‘ππβ=π‘ππ, and 110 need a hint that π‘ππβ=π‘ππβ1. Crucially, 0 need conflicting hints, so the hinted algorithm works for b=64,m=64.
Of course, that leaves 64 top bits, and since one bit is the Β½ bit, this is technically only a 63-bit result. (If you only needed a truncated 64-bit result instead of a rounded one, you could use eβ1 and read the Β½ bit as the final bit of truncated result.)
It turns out that hints are not enough to get a full 64 bits plus a Β½ bit, which would leave a 63-bit middle. In that case, there turn out to be 63 powers where ππππππ=0 is ambiguous:
hints 64 63
-- out --
b=64 m=63 241 283 159 63
However, if you only have 63 bits of input, then you can have the full 64-bit output:
hints 63 64
-- out --
b=63 m=64 601 86 59 0
Theorem 1. For the cases used in the printing and parsing algorithms, namely pβ[β400,400] with (printing) bβ€55,mβ₯66 and (parsing) bβ€64,mβ₯73, Scale is correct and ππππππβ 1.
Proof. We proved these five cases:
Lemma 3. For exact results, Scale computes uscale(x,e,p) and ππππππβ 1.
Lemma 7. For inexact results and pβ[0,27] and bβ€64, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Lemma 8. For inexact results, pβ[β27,β1], and bβ€64, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Lemma 9. For pβ[β400,β28]βͺ[28,400], bβ€55, and mβ₯66, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Lemma 10. For pβ[β400,β28]βͺ[28,400], bβ€64, and mβ₯73, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
The result follows directly from these. β
The proof we just finished is the most precise analysis we can do. It enables tricks like the hint table for 64-bit input and 64-bit output. However, for printing and parsing, we donβt need to be quite that precise. We can reduce the four inexact cases to two by analyzing the exact ππβ values instead of our rounded ππ values. We will show that the spacing around the exact integer results is wide enough that all the non-exact integers canβt have middles near 0 or 2m. The idea of proving a minimal spacing around the exact integer results is due to Michel Hack, although the proof is new.
Specifically, we can use the same machinery we just built to prove that ππππππββ[2,2m+2] for inexact results with pβ[β400,400], eliminating the need for Lemmas 7 and 8 by generalizing Lemmas 9 and 10. To do that, we analyze the non-zero results of xΒ·ππβmod2b+m. (If that expression is zero, the result is exact, and we are only analyzing the inexact case.)
Letβs start by defining gcd and pmR, which returns pn pd such that ππβ=ππ/ππ.
op x gcd y =
not x*y: x+y
x > y: (x mod y) gcd y
x <= y: x gcd (y mod x)
(15 gcd 40) is 5
op pmR p =
e = pe p
num = (10**(0 max p)) * (2**-(0 min e))
denom = (10**-(0 min p)) * (2**(0 max e))
num denom / num gcd denom
(pmR -5) is (2**139) (5**5)
Letβs also define a helper zlog that is like log2|x| except that zlog 0 is 0.
op zlog x =
x == 0: 0
2 log abs x
(zlog 4) is 2
(zlog 0) is 0
Now we can write an exact version of check1. We want to analyze xΒ·ππβmod2b+m, but to use integers, instead we analyze xR=xΒ·ππmodππΒ·2b+m. We find the xβ[π₯πππ,π₯πππ₯] with minimal xR>0 and the yβ[π₯πππ,π₯πππ₯] with maximal yR. Then we convert those to ππππππ values by dividing by ππΒ·2b.
op (b m) check1R p =
pn pd = pmR p
xmin = 2**b-1
xmax = (2**b)-1
M = pd * 2**b+m
x = modminge xmin xmax pn M 1
x < 0: p 0 0 0 0
y = modmax xmin xmax pn M
xmiddle ymiddle = float ((x y * pn) mod M) / pd * 2**b
p x y xmiddle ((xmiddle < 2) or (ymiddle > M-2))
op (b m) checkR ps = mix b m check1R@ ps
show1 64 64 check1 200
show1 64 64 check1R 200
-- out --
200 0xa738c6bebb12d16cb428f8ac016561dc 0xffe389b3cdb6c3d0 0x34 .
200 0xffe389b3cdb6c3d0 0x8064104249b3c03e 0x1.9c2145p+05 .
op proveR (b m) =
table = bad b m checkR (seq -400 400)
what = 'b=', (text b), ' m=', (text m), ' t=', text ((127-1)-m),'+Β½'
(count table) == 0: print 'β
proved ' what
print 'β disproved' what
print short show table
proveR 55 66
proveR 55 62
proveR 64 73
proveR 64 72
-- out --
β
proved b=55 m=66 t=60 + Β½
β disproved b=55 m=62 t=64 + Β½
167 0x7b6e56a6b7fd53 0x463bc17af3f48e 0x1.817b1cp-02 β
201 0x68224666341b59 0x588220995c452a 0x1.8e0a91p-02 β
211 0x69923a6ce74f07 0x597216983bdc1a 0x1.14fbd3p-03 β
221 0x404a552daaaeea 0x50ad765f4fd461 0x1.de3812p+00 β
β
proved b=64 m=73 t=53 + Β½
β disproved b=64 m=72 t=54 + Β½
-93 0xf324bb0720dbe7fe 0xc743006eaf2d0e4f 0x1.3a8eb6p+00 β
The failures that proveR finds mostly correspond to the failures that prove found, except that proveR is slightly more conservative: the reported failure for p=221 is a false positive.
prove 55 66
prove 55 62
prove 64 73
prove 64 72
-- out --
β
proved b=55 m=66 t=61+Β½
β disproved b=55 m=62 t=65+Β½
167 0xd910f7ff28069da41b2ba1518094da05 0x7b6e56a6b7fd53 0x0 β
201 0xd106f86e69d785c7e13336d701beba53 0x68224666341b59 0x1 β
211 0xf356f7ebf83552fe0583f6b8c4124d44 0x69923a6ce74f07 0x0 β
β
proved b=64 m=73 t=54+Β½
β disproved b=64 m=72 t=55+Β½
-93 0x857fcae62d8493a56f70a4400c562ddc 0xf324bb0720dbe7fe 0x1 β
Lemma 11. For pβ[β400,400], bβ€55, and mβ₯66, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Proof. Our Ivy code proveR 55 66 confirmed that ππππππββ[2,2mβ2]. By Lemma 5 and Lemma 6, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1. β
Lemma 12. For pβ[β400,400], bβ€64, and mβ₯73, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1.
Proof. Our Ivy code proveR 64 73 confirmed that ππππππββ[2,2mβ2]. By Lemma 5 and Lemma 6, Scale(x,e,p) computes uscale(x,e,p) and ππππππβ 1. β
Theorem 2. For the cases used in the printing and parsing algorithms, namely pβ[β400,400] with (printing) bβ€55,mβ₯66 and (parsing) bβ€64,mβ₯73, Scale is correct and ππππππβ 1.
Proof. Follows from Lemma 3, Lemma 11, and Lemma 12. β
Parts of this proof have been put together in different ways for other purposes before, most notably to prove that exact truncated scaling can be implemented using 128-bit mantissas in floating-point parsing and printing algorithms. This section traces the history of the ideas as best I have been able to determine it. In these summaries, I am using the terminology and notation of this postβsuch as top, middle, bottom, xR and ππββfor consistency and ease of understanding. Those terms and notations do not appear in the actual related work.
This section is concerned with the proof methods in these papers and only touches on the actual algorithms to the extent that they are relevant to what was proved. The main postβs related work discusses the algorithms in more detail.
Paxson 1991
β Vern Paxson, βA Program for Testing IEEE Decimal-Binary Conversionβ, class paper 1991.
The earliest work that I have found that linked modular minimization to floating-point conversions is Paxsonβs 1991 paper, already mentioned above and written for one of William Kahanβs graduate classes. Paxson credits Tim Peters for the modular minimization algorithms, citing an email discussion on validgh!numeric-interest@uunet.uu.net in April 1991:
In the following section we derive a modular equation which if minimized produces especially difficult conversion inputs; those that lie as close as possible to exactly half way between two representable outputs. We then develop the theoretical framework for demonstrating the correctness of two algorithms developed by Tim Peters for solving such a modular minimization problem in O(log(N)) time.
I have been unable to find copies of the numeric-interest email discussion.
Peters broke down the minimization problem into a two step process, which I followed in this proof. Using this postβs notation (xR=xΒ·cmodm), the two steps in Paxsonβs paper (with two algorithms each) are:
- FirstModBelow: Find the first xβ₯0 with xRβ€hi. FirstModAbove: Find the first xβ₯0 with xRβ₯ππ.
- ModMin: Find the xβ[π₯πππ,π₯πππ₯] that maximizes xRβ€hi. ModMax: Find the xβ[π₯πππ,π₯πππ₯] that minimizes xRβ₯ππ.
(The names ModMin and ModMax seem inverted from their definitions, but perhaps βMinβ refers to finding something below a limit and βMaxβ to finding something above a limit. They are certainly inverted from this postβs usage.)
In contrast, this postβs algorithms are;
modfirst: Find the first xβ₯0 with xRβ[ππ,hi].modfind: Find the first xβ[π₯πππ,π₯πππ₯] with xRβ[ππ,hi].modmin: Find the xβ[π₯πππ,π₯πππ₯] that minimizes xR.modminge: Find the xβ[π₯πππ,π₯πππ₯] that minimizes xRβ₯ππ.modmax: Find the xβ[π₯πππ,π₯πππ₯] that maximizes xR.
It is possible to use modfirst to implement Paxsonβs FirstModBelow and FirstModAbove, and vice versa, so they are equivalent in power.
In Paxsonβs paper, the implementation and correctness of FirstModBelow and FirstModAbove depend on computing the convergents of continued fractions of c/m and proving properties about them. Specifically, the result of FirstModBelow must be the denominator of a convergent or semiconvergent in the continued fraction for c/m, so it suffices to find the last even convergent p2i/q2i such that (q2i)R>hi but (q2(i+1))R<hi, and then compute the correct q2i+kΒ·q2i+1 by looking at how much (q2i+1)R subtracts from (q2i)R and subtracting it just enough times. I had resigned myself to implementing this approach before I found David WΓ€rnβs simpler proof of the direct GCD-like approach in modfirst. The intermediate steps in GCD(p,q) are exactly the continued