Intro First things first, but not necessarily in that order.

The C++ standard, for once, clearly states that a non-type template-parameter shall not be declared to have floating point, class, or void type. So, we have this annoying situation where compilers will happily transform & fold expressions involving floating points or templates but are forbidden to provide any coupling.

That's what this thing deals with, providing various means to inject floating point values on one end as non-type template-parameters, perform some arithmetic and then lower back the result. Incidentally it also allows to transform floating point literals to their machine representation, something that generally involve run-time initialization on most compilers.

One of the design goal being to make the whole process transparent, operations are in strict compliance with IEEE-754 where/when possible; the result should match down to the last ulp. That said, some details were left out in this initial implementation: there's only one rounding mode, one precision, no subnormals and no exception flags; there's also signed zeroes and infinities but only one kind of NaN, which can be made signalling or not.

This initial release licensed under the GPL, authored by Thierry Berger-Perrin <tbptbp@gmail.com> in days of grace of early 2008 and to be found here (dist), is known to work on g++, msvc8 and icc10.

Interface

float_t<sig, exp, sign, [cat]>
sig significand
exp biased exponent
sign set for negatives
cat class this value belongs to { normal, zero, inf, NaN }
 
MF_FLOAT(integral, fraction)
integral signed integral constant
fraction unsigned integral constant; leading zeroes are significative

The first method is exact and allows easy access to special values by ways of the 'cat' parameter, the second cannot under all circumstances - but then shouldn't stray too far from what you'd get out of std::atof.

Operations

Binary
add, sub, mul, div
eq, gt, lt, gte, lte usual comparison predicates.
copysign copysign(x, y) returns x*sign(y), even in presence of NaNs.
 
Unary
negate, abs
sqrt
trunc succinctly, trunc(x) = float(int(x)).
floor returns the largest integral value not greater than x.

After each sequence point floating points are categorized and their category enforced.

Note: a silly C++ operator coating, and other associated syntactic sugar, is also provided for tinkering. It is not meant to be exercised seriously as it inherently defeats the purpose.

Gotchas

G++
MSVC
  • inane enumeration underlying type handling requiring a non standard extension*
    see also "n2347, strongly typed enums"
  • doesn't fold ldexp, a more fragile machine representation baking sequence has to be used*
  • many many false positive warnings*
  • weird bug involving 'operator +'*
ICC
  • crash*
  • extremely long compilation times
* kludged

Conformance

This package tries hard to match what you'd typically get out of a compiler with SSE codegen enabled, in single precision; that's why there's no denormals, exception flags and only one kind of NaN (so called indeterminate).

If the WANT_SIGNALING_NANS macro is set NaN will, on validation, produce a compilation time error - error which will certainly be clearer if C++0x's static_assert is available. No distinction is ever made between the signalling and quiet variety.

Even if, as it is, floating points are wired to single precision (23bits significand, 8bits biased exponent, 1bit sign), it would be easy to re-target for a narrower variant like fp16 (famous last words). On the other hand, it would be much more involved for wider variants because of the need to operate on data twice as large as the representation when compilers, natively, merely know how to handle up to 64bits.

The binary ops add, sub, mul, div, sqrt have been checked against the augmented UCB corpus, as found in IeeeCC754 (NaNs remapped to indeterminate, filtered out denormals and non available rounding modes), and show no failure in 2213 tests.

Other operations haven't been so thoroughly tested because of a blatant lack of motivation, and it didn't make much sense to begin with for such a hack as the MF_FLOAT macro.

Examples

//beware of collisions when injecting: float_t (tr1/cmath) and trunc (cmath)
using namespace ::metafloat;
using namespace ops;
using ::metafloat::float_t;

typedef float_t<0, 0, 0> zero_p; // +0.f
typedef float_t<0, 0, 1> zero_n; // -0.f

if (!eq<zero_p, zero_n>::value) abort(); // +0f == -0.f
if (eq<div<zero_p, zero_n>::type, div<zero_p, zero_n>::type>::value) 
	abort(); // x != x iff x == NaN
if (!eq<mul<MF_FLOAT(25,0), MF_FLOAT(0,01)>::type, MF_FLOAT(0,25)>::value)
	abort(); // post rounding carry propagation

typedef MF_FLOAT(1,0) one;         // 1.f
typedef MF_FLOAT(1,1) one_dot_one; // 1.1f
if (!eq< add< sub<one_dot_one, ops::trunc<one_dot_one>::type >::type, one >::type, one_dot_one >::value)
	abort();

/*
 *
 * insert non stupid examples herein.
 *
 */
				    

Extro Not intended for use underwater.

There is little doubt the whole exercise is rather disgusting, dubious and in fact quite wrong. But it has some practical use, at least for me and to be fair i wasn't even the first misguided fool to give it a try: for that credit goes to Edward Rosten.

As with any initial release, it's a bit rough, incomplete and inconsistent: it lacks a few operations mandated or recommended by IEEE-754 - notably sqrt, rem and scalb - and i'm not satisfied with the current mechanism for 'instantiation' (of values).

So feel free to throw any and all your suggestions, insults and patches my way.

Random links:
Computer Arithmetic and Numerical Techniques, IEEE-754 References, How to Read Floating Point Numbers Accurately, On The Design of IEEE Compliant Floating Point Units, An Improved Algorithm for High-Speed Floating Point Addition, Revising ANSI/IEEE Std 754-1985, Extra Floating Point Features, Lecture notes about floating point arithmetic, Integer and Floating Point Arithmetic, The MPFR Library, IEEE-754 Floating-Point Conversion