Short story: I did some profiling (I do a lot of profiling) and QueryPerformanceCounter showed up a lot more than..I felt it should. So, some reading up and testing later, I am now using rdtsc/rdtscp.

Longer story: A long time ago, before computers had more than one core, and if the CPU supported it, the fastest way to time things where by using the rdtsc instruction. The granularity you got was a lot higher than any other kind of timing instruction available at the time. Partly because it wasn’t really a timing instruction; it returned instructions executed since boot-up.

The problems:

  • If a CPU changes speed, the rate of instructions per second was not constant; hard to use for timing.

  • Multi-core. One cores rdtsc instruction might give a whole different value than another. Our thread might switch core, a lot. The cure for this; setting thread affinity (basically binding it to one specific core), which made the timing work, but was bad for everything else.

The way to handle timing in a “stable” manner was to just use QueryPerformanceCounter instead, and let Microsoft deal with whatever solution was doable/working/the best.

My problem is; QueryPerformanceCounter is not the most optimal solution for “newish” computers. (But it does works for all Windows-computers, which is awesome.) You do get less precise values though, and it takes longer to get them.

A while back Intel (AMD?) seemed to decided that it would be nice if rdtsc was usable in a multi-core-variable-frequency world, so Invariant TSC was born. You can ask your CPU if it supports it by calling the cpuid instruction with 0x80000007 as the argument, and then if edx has bit 8 set, your computer is using Invariant TSC, which means rdtsc can be used as a stable source for timing. They also introduced the rdtscp instruction that does a little bit more synchronization than the rdtsc instruction, but for “normal” timing in a game, rdtsc seems to be enough. (basically, rdtsc can be reordered, rdtscp will not)

On my machine (i7-4770K 3.5GHz) I did some quick tests: Timing the different between 2 calls to QueryPerformanceCounter / rdtsc / rdtscp without anything in between and with a Sleep(1) in between.

With Sleep(1):

  • QueryPerformanceCounter returned a diff of: ~3418

  • rdtsc(p) returned a diff of: ~3500000

Without Sleep(1):

  • QueryPerformanceCounter returned a diff of: 0

  • rdtscp returned a diff of: ~24

  • rdtsc returned a diff of: ~18

So, QueryPerformanceCounter returns basically the value rdtsc returns. Divided by 1000. And with more overhead.
I know which one I’d pick.

Bonus if you are using Linux, rdtscp apparently returns core id as well. Quite handy. ^_^

UPDATE 2016.03.28

1) rdtscp only returns core on Linux machines.

2) rdtscp doesn’t work properly om AMD machines. EVEN if cpuid says it supports TCM. :(

(at least on windows..) QueryPerformanceCounter it is.