The FSRRA takes as input a single-precision IEEE 754 floating point number (basically a "float" in modern programming languages of the C family) and output another single-precision number approaching the reciprocal of the square root of the input (1/sqrt(x)). What's the difficulty of emulating this instruction? After all, a mathematical function is the same wherever you use it, so calculating it with whatever good math library we could put our hands on should be enough, right? Well, the issue with the FSRRA is the final "A" in his name; the SH4 is trading precision for speed; the instruction executes in just 1 clock cycle (wow!), but the resulting output is not the best approximation to the real function with the available precision given by a float, just a (pretty good) approximation to +-2 float-epsilons of that optimal approximation (an epsilon in this context is the numerical weight of the lowest bit of the mantisa of the floating point number); obtaining a bit-perfect emulation of the instruction requires understanding how the values are being calculated in order to match the (rather complex) pattern of approximation errors.

At this point, I'm almost a month within this issue, and while this is still a WIP of uncertain future, I have amassed a good amount of interesting results which show some light on the problem. As some people (particularly some devs of the team) seem to enjoy this kind of technical discussions, even if they are not interested in joining the fun, I will briefly sum up what I know at this point.

To start with, the dump provided by MetalliC contains the outputs for all the positive normalized numbers in single-precision. The first tests showed that, as expected, the instruction is managing different scales of input numbers coherently and, when two given inputs just differ in an even number of powers of two, the output will be identical but for the power-of-two exponent of the output. So we can, for the rest of this discussion, just focus on the [0.25,1) interval, which will produce results in the (1,2] interval. That interval is [0x3e800000, 0x3f800000) in IEEE 754 format. A IEEE 754 single-precision number comprises a bit sign (the highest, b31), 7 exponent bits (b23-b30) and 23 mantisa bits (b0-b22), with a 24th (the highest one) being implicit and derived from the exponent bits). The range we are interested in comprises two different exponent values:[0x3e800000-0x3effffff] ([0.25,0.5)) and [0x3f000000-0x3f7fffff] ([0.5,1)); in each of them the lowest mantisa bit has different weight. This justify why I study them separately in many tests (as you will see below). Besides, when I need to obtain a reference value for comparison purposes, I just obtain a double approximation with C++'s library (which, when compared with the single precision values, can be considered as the "real" value for all reasonable purposes); that's what I mean by "C++" in some of the graphs below. Enough preliminaries; let's jump to the interesting stuff.

How good is the approximation to the real function given by FSSRA; well, at a bird's glance, pretty good: But what if we look at the errors? Here is a sample of them:

Now the problems start to be evident; in the reference interval of inputs [0.25,1), the output belongs to the [1,2] range, so, except in the 0.25 extreme, the epsilon of the output is 2^-23 or approximately 1.19e-7; an optimal approximation the the real value should give us errors in the [-5.6e-8, 5.6e-8] range, and that's clearly not what the FSRRA us giving us.

To get some insight on the distribution of errors, we could show the infinite-precision errors when calculated against the "real" values:

However, given that the difference with the optimal single-precision approximation is technically just a integer number of epsilons away from that optimal one, we could maybe get a better picture from the distribution of such deviations:

In a problem as this, trying to understand those errors by looking at individual inputs is hopeless; our hope of success rest on being able to understand the general picture of how those errors are distributed in order to understand how the calculation done by the hardware works. In order to continue gaining insight about the nature of those errors, I decided to look next at the difference of consecutive errors ("consecutive" here means for inputs just differing a single epsilon); the hope is to be able to catch some "jumps" which could give hints about how the data is being calculated.

So, to start with, what should be expecting? That is, if we had the optimal single-precision approximations, compared it with the real values, and calculated those differences, what should be obtain? Take a look to the distribution of the theoretical values: For every set of data, you can see two branches, separated by a full epsilon (1.19e-7); why two? basically, you are comprising two full ranges of exponents in the input in just one range in the output, so sometimes, when going from a point to the next one, the output will be different (giving a point in the second branch, but on others, the output will be the same (with greater error in the second one, giving a point in the first branch). The outlier points at the extremes of the branches are just numerical artifacts due to the way in which I'm aggregating the data; just ignore them.

Ok, that was the expected optimal distribution; what about the FSRRA? Look: Similar, but not exactly equal; given that the function is decreasing, and that we know the FSRRA makes errors, getting the third branch is not excessively surprising; if the FSRRA approximation is monotonically decreasing, it's to be expected that, when the function goes down one step (in single-point precision), it can happen that sometimes the function goes down or not in par with the optimal values; sometimes, the FSRRA approximation goes down when the optimal one does not, or the reverse. That should create {0,+1,-1} epsilons of difference in these kind of values, so getting three branches is not unexpected. However, just in the lower left corner, there are a couple of very interesting outliers. They correspond to the following pairs:

Code: Select all

```
3eb3ffff-3eb40000
3edbffff-3edc0000
```

Finally, we had reach a nontrivial result; the graph show very clear peaks; there are 32 in each main interval, which, in the IEEE 754 format, correspond to the 0x40000 multiples the previous test had suggested. The alternation of signs of the peaks can look strange, but there are reasonable explanations for it. By instance, if you suppose that in each of those 32 subintervals, the function is approximation through interpolation. In most approximation formulae, differences of values are used, so you can perfectly have a (n approximation to) extreme value used in a subinterval which one sign, and the same value used in the next one with the contrary sign.

Digression: Up to this point, naively, I had expected for the calculation done by the SH4 to involve Newton iterations. For 1/sqrt(x), you would apply Newton's method to the function (for a fixed input x) f(y) = 1/(y*y*x)-1; the resulting Newton iteration is, then, y_(n+1) = y_n*(3/2-y*y*(x/2)). If, say, each of the subintervals is using a different initial value, then a couple of iterations is enough to give us single-precision reasonable values; if the initial iteration involves something more elaborate (a linear approximation maybe?) a single iteration should be enough. However, that guess was problematic; to start with, Newton iterations are just too good, so usually you will either get far from the real values, or you will nail it for big intervals, which wasn't the observed behaviour; besides, the errors given by Newton's iterations are consistently biased towards the positive (see the graph below for f(y) for x=0.5; Newton's method will shoot you from a certain approximated point in the tangent direction until reaching the x axis; you will always land on the left side of the zero), and that is not what the logged data show, so a correction should be done finally. All this is problematic when you take into account that a single Newton iteration requires 3 multiplications an a sum (remember the precision-speed tradeoff).

So, let's suppose some kind of polynomial interpolation is being used in each subinterval. Which one would I choose if I were to design this approximation that way? If speed is a limiting factor (it is), it would make sense to search for the minimum-degree polynomial which can provide us with an error in the expected required precision. To start making some numbers, let's notice that for the [0.25,0.5) interval, each of the 32 subintervals correspond to a 2^-7 range. Can we use a quadratic polynomial (at the cost of two multiplications and two sums, and the multiplications could be simpler than normal ones as you only need to take into account the 18 lowest bits of the mantisa) to approximate the real function on it? In the literature, you can find some error formulae for different types of analytic interpolation and approximation polynomials. However, if you are serious about getting the best, you wouldn't accept anything but the minimax quadratic approximation (the quadratic polynomial which minimizes the maximum approximation error). Remez algorithm(https://en.wikipedia.org/wiki/Remez_algorithm) is a convenient way to calculate it, so I took the first subinterval just at 0.25 (where the derivatives are highest, so it's expected to be the most difficult to approximate) and tried it. The results would give some equioscillating errors (https://en.wikipedia.org/wiki/Equioscillation_theorem) of about 5e-7, slightly less than 5 epsilons. Good try, but not good enough (remember the +-2 epsilons errors of the FSRRA).

Should be start considering cubic polynomials at this point? Well, there was an intermediate approach, you could start calculating the reciprocal of a quartic root with a quadratic polynomial and then square it, at the cost of three multiplications and two sums. Why someone would even think on such strange approach? Because the reciprocal quartic indeed behaves more nicely in the reference [0.25,1) interval (his derivatives are lower), so the quadratic approximation works better on it. After giving it a try, Remez's algorithm produced a equioscillation error of about 1.5e-7 with quadratic polynomials; however, when you finally square the number, the relative error doubles, so you get an error of about +-3e-7. Close, but still not enough.

So cubics were at this point the kings of the hill in my mental evaluations. It doesn't really matter which flavor if cubic you use (Hermite interpolation in the subinterval? Taylor approximation from the center?), most/all cubics give error terms which would fall below the expected error range. Of course, that's only the errors due to the approximation of the function by the polynomial; you still must add the errors due to be using finite-precision operations (which can be significative, and a real headache to reverse engineer, I should add).

So, how to check if cubic interpolation was indeed being used? Most polynomial approximants create very characteristic error graphs in the corresponding interval, so, once discovered the subinterval structure, I decided to aggregate all the errors from different subintervals to see if a clear picture arised (the errors are indexed by the lower bits and aggregated in an attempt to smooth the graph): Oops. Unexpected; even ignoring the high-frequency noise (probably due to finite-precision errors), that's not what you would expect from many cubic approximants. The fact that it seems to better approximate the function in the center could point to a Taylor polynomial expanded from the center of the subinterval. But, wait a minute, are we sure the number of errors is lower in the center; maybe there is just cancelation of positive and negative errors. Let's take a look this time to the aggregation of the absolute errors:

Funky. What was going on? A more detailed analysis of the errors (which I'm not showing here) shown that the data was clearly biased in each subinterval (the errors on some subintervals would be biased toward the positives; some others toward the negatives), so, OK, maybe aggregating the data from different subintervals wasn't a good idea, and it's adding to the confusion. I decided to focus in single subintervals then.

Let's focus in the first subinterval ([0x3e800000,0x3e840000), or [0.25,0,25+1/128)). What would be the errors expected in the optimal single-precision approximation. Obviously, all the errors should be in the [-5.6e-7, 5.6e-7] interval:

And what are the errors of FSRRA? Can I match that behaviour with a cubic? If we forget about the high-frequency noise, sure:

Beyond the general shape, there are two noticeable observations. First, even if this can't be seen in the graph, there are lots of consecutive runs of inputs which show some alternating error behaviour (let's say, the even will give errors in a clear progression, while the odd will give a parallel progression separated by an epsilon). This behaviour, by itself, is not easily replicated (trust me, I tried ). Somehow, the lowest bit of the input is having a big role in the error for whatever reason. Still, segregating the inputs by the lowest bit, the behaviour is more normal, so we could study both separately and then, hopefully, see what's going on...

The second noticeable thing are those curious sawteeth that can be seen at the start of the interval. That seemed some kind of numerical artifact that could throw some light on the internal calculations, but I haven't had much success understanding why they happen.

However, I noticed the following fact: the next subinterval would show some kind of symmetric behaviour, with the sawteeth at the right and the general shape of the function inverted. Look at the first two intervals plotted side by side: Indeed, that general behaviour is replicated in all the remaining subintervals, with the even ones having the sawteeth at the left and the odd ones at the right. Look at the four first subintervals:

These are the full 32 intervals in the [0.25,0.5) range: That symmetry is too meaningful so as to be casual, so it makes sense to guess that, indeed, those subintervals are somehow coupled. Let's try to play with them... What if we let the first subinterval as is but invert the second subinterval and shift it to match the first:

Cool, but the sign inversion is difficult to justify. What if I just swap the position of both subintervals (plotting the second at the left) and shifting vertically to match both: The graph isn't so nicely, but at this point in time, I suspect this is closest to the true. Why? Due to a couple of reasons:

First, the general shape of this is just what the error term of a cubic approximant should look like. Second, there is a clear interpretation of what is happening numerically: till seeing this, I was supposing that the subintervals were 2^18 points wide, using the lowest 18 bits (b0-b17); what's happening in my opinion is that, indeed, 19 bits are being used (so there are 16 subintervals per each of the reference intervals), and the 19th (b18) is being used/extended as a sign bit. So what I previously considered the second subinterval is just the negative part of this new wider subinterval. What happens with the sawteeth artifacts? In this interpretation, that is happening with the lowest numbers (in absolute value); my guess is that the existence of small numbers in the intermediate calculations coupled with a biased truncation could create some similar artifacts (even if I also guess that it could be difficult to match them).

And this is the point in which the research is. More info is expected to follow as history unfolds.