SH4's FSRRA

Wed Aug 03, 2016 1:38 am

About a year ago, MetalliC asked me to look at some dumps of the FSRRA (Floating Point Square Reciprocal Approximate) instruction used by the SH4A architecture. The emulation-related motivation for it being that the custom CPU used by the Dreamcast/Naomi systems implements some instructions from the SH4A architecture (particularly FSRRA), and the current inadequate emulation of the instruction was thought to be causing some FPU-related problems in the Dreamcast emulation in Demul.

The FSRRA takes as input a single-precision IEEE 754 floating point number (basically a "float" in modern programming languages of the C family) and output another single-precision number approaching the reciprocal of the square root of the input (1/sqrt(x)). What's the difficulty of emulating this instruction? After all, a mathematical function is the same wherever you use it, so calculating it with whatever good math library we could put our hands on should be enough, right? Well, the issue with the FSRRA is the final "A" in his name; the SH4 is trading precision for speed; the instruction executes in just 1 clock cycle (wow!), but the resulting output is not the best approximation to the real function with the available precision given by a float, just a (pretty good) approximation to +-2 float-epsilons of that optimal approximation (an epsilon in this context is the numerical weight of the lowest bit of the mantisa of the floating point number); obtaining a bit-perfect emulation of the instruction requires understanding how the values are being calculated in order to match the (rather complex) pattern of approximation errors.

At this point, I'm almost a month within this issue, and while this is still a WIP of uncertain future, I have amassed a good amount of interesting results which show some light on the problem. As some people (particularly some devs of the team) seem to enjoy this kind of technical discussions, even if they are not interested in joining the fun, I will briefly sum up what I know at this point.

To start with, the dump provided by MetalliC contains the outputs for all the positive normalized numbers in single-precision. The first tests showed that, as expected, the instruction is managing different scales of input numbers coherently and, when two given inputs just differ in an even number of powers of two, the output will be identical but for the power-of-two exponent of the output. So we can, for the rest of this discussion, just focus on the [0.25,1) interval, which will produce results in the (1,2] interval. That interval is [0x3e800000, 0x3f800000) in IEEE 754 format. A IEEE 754 single-precision number comprises a bit sign (the highest, b31), 7 exponent bits (b23-b30) and 23 mantisa bits (b0-b22), with a 24th (the highest one) being implicit and derived from the exponent bits). The range we are interested in comprises two different exponent values:[0x3e800000-0x3effffff] ([0.25,0.5)) and [0x3f000000-0x3f7fffff] ([0.5,1)); in each of them the lowest mantisa bit has different weight. This justify why I study them separately in many tests (as you will see below). Besides, when I need to obtain a reference value for comparison purposes, I just obtain a double approximation with C++'s library (which, when compared with the single precision values, can be considered as the "real" value for all reasonable purposes); that's what I mean by "C++" in some of the graphs below. Enough preliminaries; let's jump to the interesting stuff.

How good is the approximation to the real function given by FSSRA; well, at a bird's glance, pretty good:

: results2.png (4.07 KiB) Viewed 172912 times

But what if we look at the errors? Here is a sample of them:

: results4.png (10.16 KiB) Viewed 172912 times

Now the problems start to be evident; in the reference interval of inputs [0.25,1), the output belongs to the [1,2] range, so, except in the 0.25 extreme, the epsilon of the output is 2^-23 or approximately 1.19e-7; an optimal approximation the the real value should give us errors in the [-5.6e-8, 5.6e-8] range, and that's clearly not what the FSRRA us giving us.

To get some insight on the distribution of errors, we could show the infinite-precision errors when calculated against the "real" values:

: results9.png (8.1 KiB) Viewed 172912 times

However, given that the difference with the optimal single-precision approximation is technically just a integer number of epsilons away from that optimal one, we could maybe get a better picture from the distribution of such deviations:

: results17.png (6.68 KiB) Viewed 172912 times

In a problem as this, trying to understand those errors by looking at individual inputs is hopeless; our hope of success rest on being able to understand the general picture of how those errors are distributed in order to understand how the calculation done by the hardware works. In order to continue gaining insight about the nature of those errors, I decided to look next at the difference of consecutive errors ("consecutive" here means for inputs just differing a single epsilon); the hope is to be able to catch some "jumps" which could give hints about how the data is being calculated.

So, to start with, what should be expecting? That is, if we had the optimal single-precision approximations, compared it with the real values, and calculated those differences, what should be obtain? Take a look to the distribution of the theoretical values:

: results14.png (7.04 KiB) Viewed 172912 times

For every set of data, you can see two branches, separated by a full epsilon (1.19e-7); why two? basically, you are comprising two full ranges of exponents in the input in just one range in the output, so sometimes, when going from a point to the next one, the output will be different (giving a point in the second branch, but on others, the output will be the same (with greater error in the second one, giving a point in the first branch). The outlier points at the extremes of the branches are just numerical artifacts due to the way in which I'm aggregating the data; just ignore them.

Ok, that was the expected optimal distribution; what about the FSRRA? Look:

: results11.png (7.2 KiB) Viewed 172912 times

Similar, but not exactly equal; given that the function is decreasing, and that we know the FSRRA makes errors, getting the third branch is not excessively surprising; if the FSRRA approximation is monotonically decreasing, it's to be expected that, when the function goes down one step (in single-point precision), it can happen that sometimes the function goes down or not in par with the optimal values; sometimes, the FSRRA approximation goes down when the optimal one does not, or the reverse. That should create {0,+1,-1} epsilons of difference in these kind of values, so getting three branches is not unexpected. However, just in the lower left corner, there are a couple of very interesting outliers. They correspond to the following pairs:

Code: Select all

3eb3ffff-3eb40000
3edbffff-3edc0000

Cool, that's interesting; clearly, something is happening at that power of two boundaries. That seems to suggest that, either at multiples of 0x40000 or at some finer binary subdivision, the way the data is calculated is changing. In order to confirm it, an autocorrelation analysis of the errors resulted to be determinant:

: results16.png (9.36 KiB) Viewed 172912 times

Finally, we had reach a nontrivial result; the graph show very clear peaks; there are 32 in each main interval, which, in the IEEE 754 format, correspond to the 0x40000 multiples the previous test had suggested. The alternation of signs of the peaks can look strange, but there are reasonable explanations for it. By instance, if you suppose that in each of those 32 subintervals, the function is approximation through interpolation. In most approximation formulae, differences of values are used, so you can perfectly have a (n approximation to) extreme value used in a subinterval which one sign, and the same value used in the next one with the contrary sign.

Digression: Up to this point, naively, I had expected for the calculation done by the SH4 to involve Newton iterations. For 1/sqrt(x), you would apply Newton's method to the function (for a fixed input x) f(y) = 1/(y*y*x)-1; the resulting Newton iteration is, then, y_(n+1) = y_n*(3/2-y*y*(x/2)). If, say, each of the subintervals is using a different initial value, then a couple of iterations is enough to give us single-precision reasonable values; if the initial iteration involves something more elaborate (a linear approximation maybe?) a single iteration should be enough. However, that guess was problematic; to start with, Newton iterations are just too good, so usually you will either get far from the real values, or you will nail it for big intervals, which wasn't the observed behaviour; besides, the errors given by Newton's iterations are consistently biased towards the positive (see the graph below for f(y) for x=0.5; Newton's method will shoot you from a certain approximated point in the tangent direction until reaching the x axis; you will always land on the left side of the zero), and that is not what the logged data show, so a correction should be done finally. All this is problematic when you take into account that a single Newton iteration requires 3 multiplications an a sum (remember the precision-speed tradeoff).

: y2x.png (4.24 KiB) Viewed 172912 times

So, let's suppose some kind of polynomial interpolation is being used in each subinterval. Which one would I choose if I were to design this approximation that way? If speed is a limiting factor (it is), it would make sense to search for the minimum-degree polynomial which can provide us with an error in the expected required precision. To start making some numbers, let's notice that for the [0.25,0.5) interval, each of the 32 subintervals correspond to a 2^-7 range. Can we use a quadratic polynomial (at the cost of two multiplications and two sums, and the multiplications could be simpler than normal ones as you only need to take into account the 18 lowest bits of the mantisa) to approximate the real function on it? In the literature, you can find some error formulae for different types of analytic interpolation and approximation polynomials. However, if you are serious about getting the best, you wouldn't accept anything but the minimax quadratic approximation (the quadratic polynomial which minimizes the maximum approximation error). Remez algorithm(https://en.wikipedia.org/wiki/Remez_algorithm) is a convenient way to calculate it, so I took the first subinterval just at 0.25 (where the derivatives are highest, so it's expected to be the most difficult to approximate) and tried it. The results would give some equioscillating errors (https://en.wikipedia.org/wiki/Equioscillation_theorem) of about 5e-7, slightly less than 5 epsilons. Good try, but not good enough (remember the +-2 epsilons errors of the FSRRA).

Should be start considering cubic polynomials at this point? Well, there was an intermediate approach, you could start calculating the reciprocal of a quartic root with a quadratic polynomial and then square it, at the cost of three multiplications and two sums. Why someone would even think on such strange approach? Because the reciprocal quartic indeed behaves more nicely in the reference [0.25,1) interval (his derivatives are lower), so the quadratic approximation works better on it. After giving it a try, Remez's algorithm produced a equioscillation error of about 1.5e-7 with quadratic polynomials; however, when you finally square the number, the relative error doubles, so you get an error of about +-3e-7. Close, but still not enough.

So cubics were at this point the kings of the hill in my mental evaluations. It doesn't really matter which flavor if cubic you use (Hermite interpolation in the subinterval? Taylor approximation from the center?), most/all cubics give error terms which would fall below the expected error range. Of course, that's only the errors due to the approximation of the function by the polynomial; you still must add the errors due to be using finite-precision operations (which can be significative, and a real headache to reverse engineer, I should add).

So, how to check if cubic interpolation was indeed being used? Most polynomial approximants create very characteristic error graphs in the corresponding interval, so, once discovered the subinterval structure, I decided to aggregate all the errors from different subintervals to see if a clear picture arised (the errors are indexed by the lower bits and aggregated in an attempt to smooth the graph):

: results19.png (14.29 KiB) Viewed 172912 times

Oops. Unexpected; even ignoring the high-frequency noise (probably due to finite-precision errors), that's not what you would expect from many cubic approximants. The fact that it seems to better approximate the function in the center could point to a Taylor polynomial expanded from the center of the subinterval. But, wait a minute, are we sure the number of errors is lower in the center; maybe there is just cancelation of positive and negative errors. Let's take a look this time to the aggregation of the absolute errors:

: results20.png (17.3 KiB) Viewed 172912 times

Funky. What was going on? A more detailed analysis of the errors (which I'm not showing here) shown that the data was clearly biased in each subinterval (the errors on some subintervals would be biased toward the positives; some others toward the negatives), so, OK, maybe aggregating the data from different subintervals wasn't a good idea, and it's adding to the confusion. I decided to focus in single subintervals then.

Let's focus in the first subinterval ([0x3e800000,0x3e840000), or [0.25,0,25+1/128)). What would be the errors expected in the optimal single-precision approximation. Obviously, all the errors should be in the [-5.6e-7, 5.6e-7] interval:

: results36_2.png (6.14 KiB) Viewed 172912 times

And what are the errors of FSRRA?

: results36.png (9.96 KiB) Viewed 172912 times

Can I match that behaviour with a cubic? If we forget about the high-frequency noise, sure:

: results36_3.png (5.84 KiB) Viewed 172912 times

Beyond the general shape, there are two noticeable observations. First, even if this can't be seen in the graph, there are lots of consecutive runs of inputs which show some alternating error behaviour (let's say, the even will give errors in a clear progression, while the odd will give a parallel progression separated by an epsilon). This behaviour, by itself, is not easily replicated (trust me, I tried

). Somehow, the lowest bit of the input is having a big role in the error for whatever reason. Still, segregating the inputs by the lowest bit, the behaviour is more normal, so we could study both separately and then, hopefully, see what's going on...

The second noticeable thing are those curious sawteeth that can be seen at the start of the interval. That seemed some kind of numerical artifact that could throw some light on the internal calculations, but I haven't had much success understanding why they happen.

However, I noticed the following fact: the next subinterval would show some kind of symmetric behaviour, with the sawteeth at the right and the general shape of the function inverted. Look at the first two intervals plotted side by side:

: results40.png (9.38 KiB) Viewed 172912 times

Indeed, that general behaviour is replicated in all the remaining subintervals, with the even ones having the sawteeth at the left and the odd ones at the right. Look at the four first subintervals:

: results41.png (9.53 KiB) Viewed 172912 times

These are the full 32 intervals in the [0.25,0.5) range:

: results42.png (11.21 KiB) Viewed 172912 times

That symmetry is too meaningful so as to be casual, so it makes sense to guess that, indeed, those subintervals are somehow coupled. Let's try to play with them... What if we let the first subinterval as is but invert the second subinterval and shift it to match the first:

: results43.png (9.7 KiB) Viewed 172912 times

Cool, but the sign inversion is difficult to justify. What if I just swap the position of both subintervals (plotting the second at the left) and shifting vertically to match both:

: results44.png (9.99 KiB) Viewed 172912 times

The graph isn't so nicely, but at this point in time, I suspect this is closest to the true. Why? Due to a couple of reasons:

First, the general shape of this is just what the error term of a cubic approximant should look like. Second, there is a clear interpretation of what is happening numerically: till seeing this, I was supposing that the subintervals were 2^18 points wide, using the lowest 18 bits (b0-b17); what's happening in my opinion is that, indeed, 19 bits are being used (so there are 16 subintervals per each of the reference intervals), and the 19th (b18) is being used/extended as a sign bit. So what I previously considered the second subinterval is just the negative part of this new wider subinterval. What happens with the sawteeth artifacts? In this interpretation, that is happening with the lowest numbers (in absolute value); my guess is that the existence of small numbers in the intermediate calculations coupled with a biased truncation could create some similar artifacts (even if I also guess that it could be difficult to match them).

And this is the point in which the research is. More info is expected to follow as history unfolds.

Wed Aug 03, 2016 9:34 pm

I have done some tests today, and there are new significant conclusions. First, when trying to combine the two first subintervals to obtain a cubic approximant like I said in the end of the first post, things didn't matched well. The problem is that, when joining the two chunks like I said, a discontinuity arise; even if you correct it, the discontinuity in the higher derivatives make the cubic approximant go astray. Even a 'forced oscillation cubic Chebyshev' approximant will make error of order 1e-5. So the idea collapses.

But then, I noticed something... I did various attempts trying different approximation strategies, in order to see if some of them were qualitatively similar to what had been observed in the logged data. The results of one of them simply didn't have much sense and, after a while wondering where the problem lurked, I discovered I had been compiling that and previous tests with the -O2 option in gcc, and some kind of floating-point optimization was affecting some of my attempts. When redoing them, I saw this in one of them:

: results46.png (17.68 KiB) Viewed 172871 times

As indicated, I was trying to use linear interpolation plus a single iteration of Newton's method. Yes, the general shape is not equal, but you can clearly see the sawteeth.

Then, I went back to a previous failed test in which I had tried to reverse a hypothetical last Newton iteration, and, after discovering that the results were also affected by the gcc optimizations, I got much better results than before.

So the new king of the hill is a linear approximation plus a single Newton's iteration. This give new life to an idea I discarded previously: the strange odd-even high frequency errors I observed in the data could be due to that Newton iteration usually uses x/2 as an input. So possibly the lowest bit is being discarded when applying the Newton iteration (but, I would guess, not when doing the linear approximation), which could (hopefully) explain that strange pattern.

Fri Nov 04, 2016 9:45 pm

Have you seen this algorithm?

https://en.wikipedia.org/wiki/Fast_inverse_square_root

-Frank

Tue Dec 27, 2016 12:50 pm

Sure. The FSRRA is far more precise, though.

I still think this is the result of linear interpolation plus a single Newton step, but the coefficient in the Newton step must be modified so as to try to minimize the error. An unmodified Newton step wouldn't create errors with both signs as we see in the FSRRA data.

Hopefully, I will find some time in the future to resume this work. Not in the short term, though.

Fri Mar 02, 2018 5:51 pm

Balrog pointed out there are a few Hitachi patents which may cover FSRRA and might explain how the algorithm works:

https://patents.google.com/patent/US5293558
https://patents.google.com/patent/US5631858
https://patents.google.com/patent/US5956263

LN

Mon Mar 19, 2018 1:25 pm

While we're necroing interesting posts... I've just found out Sega in model 1/2 uses for a fast inverse an approximate inverse and a newton-raphson step. The newton-raphson step needs 2/x and 1/(x*x), which are both provided by a rom. So, what if there's an internal rom that provides not one but *two* values for newton-raphson?

OG.

SH4's FSRRA

SH4's FSRRA

Re: SH4's FSRRA

Re: SH4's FSRRA

Re: SH4's FSRRA

Re: SH4's FSRRA

Re: SH4's FSRRA