I have quite recently learnt things about floating point math, or, to be clear, about IEEE 754.

So, I thought I knew most things about them. And then sadly I read way too many bad sources and got quite confused (some good as well!). To get to the bottom of it, I did what I always do; try it out myself. Debug it!!

This is in the specific context of game development. So, fast, and perhaps not extremly precise? Also x86/x64 specific. And I only know the options name/existance in VS2015.


  • Make sure you have the SSE2 option turned on.
  • Floating Point Model:
    • For release: Fast (/fp:fast)
    • For debug: Strict (/fp:strict)
      For debug, also turn on floating point exception (/fp:except)
  • Paste this code into your init code somewhere:
_controlfp_s(0, _RC_CHOP, _MCW_RC);
_controlfp_s(0, _DN_FLUSH, _MCW_DN);

// In debug code, to find & fix broken data/math.

Longer version:
So, a float (I’ll only talk about floats this time; doubles are bigger, takes longer until they might “break” etc..) is made up of:

  • 1 bit Sign
  • 8 bits Exponent
  • 23 bits Mantissa

Go here and play with it http://www.h-schmidt.net/FloatConverter/IEEE754.html !

Floating point numbers work like this (for normalized numbers):
sign * 2^exponent * mantissa

  • sign here being used as -1 and 1
  • exponent value first has 127 subtracted from it (to get whole range from tiny to huge).
  • the mantissa has an invisible 1. in front of it.

Special cases:

  • exponent being 0xff.
    • mantissa being zero. This means it’s a positive or negative infinity.
    • mantissa being not zero. This is a NaN.
  • exponent being zero.
    • mantissa being zero: you have the number 0.0f. (Can also be -0.0f, they are not the same).
    • mantissa being not zero. Denormalized number.
      Goes from 1.1754942E-38 to 1.4E-45 (positive or negative).
      Main thing about them though, they mess up your floating point math, and, they are very slow.

And the “cure” for the issues with inf, NaN and denormals:
Set the Floating Point Model to:

  • For release: Fast (/fp:fast) This might do some rounding the wrong way, but, in my short little test, almost 2x faster.
  • For debug: Strict (/fp:strict)
    For debug, also turn on floating point exception (/fp:except)

Tell your compiler to use SSE2 (the following control flags needs this, implicit for x64).
Then add this code:

// So you don't have to deal with infinite numbers.   _RC_DOWN works as well.
_controlfp_s(0, _RC_CHOP, _MCW_RC); 
// So you don't have to deal with denormals.
_controlfp_s(0, _DN_FLUSH, _MCW_DN);
#if defined(_DEBUG)
    // Catch any float exceptions. And fix the math/data.

One thing you can do with floats: compare them to each other as if they are integers. Mostly.
This might help, if nothing else to help you play with floats:

struct Float
	struct FloatParts {
		unsigned int mantissa : 23;
		unsigned int exponent :  8;
		unsigned int sign     :  1;
	union {
		float f;
		unsigned int i;
		FloatParts parts;