Benchmarking recent improvements in parallelism

Over the last few months we’ve been making various improvements to the performance of parallel programs with GHC. I thought I’d post a few benchmarks so you can see where we’ve got to. This is a fairly random collection of 6 parallel benchmarks (all using par/seq-style parallelism rather than explicit threading with forkIO). The main point here is that I just took the programs unchanged – I haven’t made any attempt to modify the programs themselves to  make them parallelize better (although other people might have done so in the past), the focus here has been on changing the GHC runtime to optimize these existing programs. The programs come mostly from old benchmarks for the GUM implementation of Parallel Haskell, and the sources can be found here.

  • matmult: matrix multiply
  • parfib: our old friend fibonacci, in parallel
  • partree: some operations on a tree in parallel
  • prsa: decode an RSA-encoded message in parallel
  • ray: a ray-tracer
  • sumeuler:  sum . map euler

Here are the results.  The first column of numbers is the time taken for GHC 6.10.1 to run the programs on one CPU, and the following three columns are the difference in elapsed time when the programs are run on 4 CPUs (actually 4 cores of my 8-core x86_64 box) with respectively GHC 6.8.3, 6.10.1, and my current working version (HEAD + a couple of patches).

  Program   6.10.1   6.8.3 -N4  6.10.1 -N4  ghc-simonmar -N4
  matmult     8.55   -60.0%     -63.7%       -72.0%
   parfib     9.65   -72.6%     -70.2%       -76.3%
  partree     8.03   +26.4%     +52.7%       -40.7%
     prsa     9.52   +13.8%     -44.1%       -68.2%
      ray     7.04   +16.5%     +11.8%       +28.0%
 sumeuler     9.64   -71.2%     -73.1%       -74.0%
  -1 s.d.    -----   -68.8%     -71.6%       -78.0%
  +1 s.d.    -----   +20.1%      +6.4%       -27.1%
  Average    -----   -38.8%     -45.0%       -59.9%

The target is -75%: that’s a speedup of 4 on 4 cores. As you can see, 6.10.1 is already doing better than 6.8.3, but the current version has made some dramatic improvements and is getting close to the ideal speedup on several of the programs.  Something odd is going on with ray, I don’t know what yet!

Here’s a summary of the improvements we made:

  • Lock-free work-stealing queues for load-balancing of sparks (par). This work was originally done by Jost Berthold during his internship at MSR in the summer of 2008, and after further improvements was merged into the GHC mainline after the 6.10.1 release.
  • Improvements to parallel GC: we now use the same threads for GC as for executing the program, and have made improvements to the barrier (stopping threads to do GC), and improvements to affinity (making sure each GC thread traverses data local to that CPU).  Some of this has yet to hit the mainline, but it will shortly.
  • Eager blackholing: this reduces the chance that multiple threads repeat the same work in a parallel program.  It’s  a compile-time option (-feager-blackholing in the HEAD) and it costs a little execution time to turn it on, but it can improve parallelism quite a lot.
  • Running sparks in batches. Previously, each time we run a spark we created a new thread for it. Threads are lightweight, but the cost can still be high relative to the size of the spark. So now we have Haskell threads that repeatedly run sparks (stealing from other CPUs if necessary) until there are no more sparks to run, eliminating the context-switch and thread-creation overhead for sparks. This means we can push the granularity quite a lot: parfib speeds up even with a very low threshold now.

We’re on the lookout for more parallel benchmarks: each new program we find tends to stress the runtime in a different way, so the more code we have, the better.  Even if (or especially if) your program doesn’t go faster on a multicore – send it to us and we’ll look into it.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Benchmarking recent improvements in parallelism

  1. jonathanturner says:

    You say “actually 4 cores of my 8-core x86_64 box”. I’d love to see the speed ups there and how ghc’s improvements scale onto all 8 cores as well.

  2. simonmar says:

    Yes – this box tends to be busy with other people doing things so I get more reliable results when only using 4 cores, but I’ll try to find a quiet time to do the 8-core measurements.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s