Thursday, November 29, 2007

One Thread Per Core

Way back in 1990, when I was performance tuning Ingres on a multi-processor Sequent machine, I found that using processor affinity - ie tieing the main process to a particular CPU gave the best performance. Intuitively I figured that this was due to not having to context switch but never had the time to prove my theory.

More recently, numa issues have been surfacing again in the search for low latency.

A friend of mine has recenly been looking at k-way sorting and the thread scheduling again has come into play. I have always been curious about the analysis of Numa architecture but never got to spend that much time on it. In the old days of single core, context would have to be shared over the bus, Numa was a great advance, introducing a high performance inter-cpu/die channel with about 6GB/s throughput. Now with multicore it's all on the die but appears to be still an issue as the following results show from a friend show:


The k-way merge sort looks pretty good, almost twice as fast as a mono-threaded quicksort. Here some results for 10 millions numeric (double) values to sort :

sort 1st part done (parallel segments) in 3672ms
sort 2nd part done in 3157ms niter=19999993
Total time for k-way merge sorting 10000000 double values=6922 ms
Total time for monoquicksorting 10000000 double values=13609 ms


Now if you look at the performance variation when sorting 2 millions items, it appears that the best performance occurs when the number of parallel sorting threads equals the number of core available from the CPU (my box is a mono quad-core Xeon 5355 2.6Ghz) :

2 threads used:
sort 1st part done (parallel segments) in 1031ms
sort 2nd part done in 469ms niter=1999999
Total time for k-way merge sorting 2000000 double values=1562 ms
Total time for monoquicksorting 2000000 double values=1984 ms

3 threads used :
sort 1st part done (parallel segments) in 688ms
sort 2nd part done in 563ms niter=3333329
Total time for k-way merge sorting 2000000 double values=1281 ms
Total time for monoquicksorting 2000000 double values=2015 ms

4 threads used :
sort 1st part done (parallel segments) in 516ms
sort 2nd part done in 594ms niter=3999995
Total time for k-way merge sorting 2000000 double values=1125 ms
Total time for monoquicksorting 2000000 double values=2000 ms


8 threads used :
sort 1st part done (parallel segments) in 437ms
sort 2nd part done in 781ms niter=5999991
Total time for k-way merge sorting 2000000 double values=1234 ms
Total time for monoquicksorting 2000000 double values=2047 ms

16 threads used :
sort 1st part done (parallel segments) in 391ms
sort 2nd part done in 906ms niter=7999975
Total time for k-way merge sorting 2000000 double values=1313 ms
Total time for monoquicksorting 2000000 double values=2031 ms

2 comments:

Jordan Zimmerman said...

FYI - I've published my k-way implementation at: http://sourceforge.net/projects/kwaymerge

Bob said...

I'd guess the processor affinity for threads works well because it improves the cache hit rates over the runtime of the sort. Intel has tools (vtune?) that might be able to quantify this or disprove my theory.

With a run time in excess of a second, there will doubtlessly be some context switches to run OS timers and interrupts. The key question is which CPU picks up your thread after the inevitable context switch. If it's the last one you ran on, there's a reasonable chance that some of your working set may still be left in that CPU's cache. If not, it'll surely have to be reloaded while the CPU is stalled. Today's CPUs are so much faster than memory that cache misses can have a big impact on benchmark results. CPU affinity for threads should help to keep that miss rate down and performance up.