Yielding more improvements in parallel performance

GHC’s parallel GC makes heavy use of hand-written spinlocks. These are basically mutexes like those provided by the Unix pthreads API or equivalently Windows CrticicalSections, except that they have no support for blocking the thread and waking up, they just spin until the lock is acquired. I did it this way for a few reasons:

  1. I sometimes need to acquire a lock on one thread and release in on another. This is not supported by traditional mutexes.
  2. We expect all threads to be running and contentions to be short.
  3. I like to know exactly what code is running.

Unfortunately when using all the cores on the machine, it is common that one or more of our threads gets descheduled, and assuption (2) no longer holds. When this happens, one or more of the threads can be spinning waiting for the descheduled thread, and no progress is made (the CPU just gets warmer) until the OS decides to reschedule it, which might be a whole time slice.

This is something we knew about (see ticket #3553), but didn’t have a good solution for, and was recently encountered in another context here. It has been called the “last core parallel slowdown” and seems to affect Linux more than other OSs, presumably due to the way the Linux scheduler works. In my experience the effect is far more dramatic when using the 8th core of an 8-core box than when using the second core of a dual-core.

This problem was present in GHC 6.10, but is exacerbated in GHC 6.12 because we now do minor GCs in parallel, which can mean hundreds of all-core synchronisations per second. The reason we do minor GCs in parallel is for locality; the results in our ICFP’09 paper clearly show the performance benefits here.

Using traditional mutexes instead of our hand-rolled spinlocks would help, but it’s not as simple as just swapping out our spinlocks for mutexes: as noted above, sometimes we acquire one of these locks on
one thread and release it on another, and pthreads doesn’t allow that. It might be possible to restructure things such that this doesn’t happen, but I haven’t found a good way yet. An alternative is to use condition variables, but this lead to a severe reduction in performance when I tried it (see ticket #3553).

So as an experiment I tried adding an occasional call to ‘yield’ (sched_yield on Unix, SwitchToThread on Windows) inside the code that acquires a spinlock, with a tunable number of spins between each call to yield (I’m using 1000). I also added a ‘yield’ in the GC’s wait loop, so that threads with no work to do during parallel GC will repeatedly yield between searching for work.

To my surprise, this change helped not only the “last core” case, but also parallel performance across the board. I imagine one reason for this is that the yields help reduce contention on the memory bus,
particularly in the parallel GC.

Here are the results, measureing the benchmark programs used in our ICFP’09 paper. First, using all 8 cores of an 8-core, running 64-bit programs on Fedora 9, comparing GHC before and after the patch:

        Program   Elapsed
           gray    -40.3%
         mandel    -41.4%
        matmult    -12.7%
         parfib     -3.3%
        partree     -4.9%
           prsa    -14.6%
            ray    -10.9%
       sumeuler     -1.8%
 Geometric Mean    -17.7%

Now, using only 7 cores of the 8-core:

        Program   Elapsed
           gray    -15.6%
         mandel    -18.8%
        matmult     -5.5%
         parfib     +2.9%
        partree     -5.0%
           prsa     -9.3%
            ray     +1.7%
       sumeuler     -0.9%
 Geometric Mean     -6.6%

we even see a benefit when not using all the cores.

Here’s the difference between 7 and 8 cores, both with the new patch:

        Program   Elapsed
           gray    +39.1%
         mandel     -8.0%
        matmult     -4.1%
         parfib    -10.0%
        partree     -1.3%
           prsa    -15.3%
            ray    +37.3%
       sumeuler    -11.4%
 Geometric Mean     +1.5%

So the “last core” problem affects only two of our benchmarks (ray and gray), whereas the others all now improve when adding the last core.

This nicely illustrates the problem of extrapolating from a single benchmark – programs vary greatly in their behaviour, in my experience it’s impossible to find a single program that is “representative”. Using larger programs usually doesn’t help: often large programs tend to have small inner-loops. This group of 8 programs is quite meager, and expanding it is something I’d like to do (send me your parallel programs!).

Here’s the comparison on a dual-core, using 32-bit programs on Ubuntu Karmic, comparing GHC before and after the patch:

        Program  Elapsed
           gray   -17.2%
         mandel   -13.4%
        matmult    -6.7%
         parfib    +0.4%
        partree    -1.5%
           prsa    -1.0%
            ray    +1.6%
       sumeuler    -8.7%
 Geometric Mean    -6.0%

And since this is such a trivial patch, we can merge it into 6.12.2, which should hopefully be in the next Haskell Platform release.

This is really an interim solution, since the real solution is not to do “stop-the-world” GC at all, at least for minor collections. That’s something we’re also working on (hopefully for GHC 6.14), but in the meantime this patch gives some nice improvements for the 6.12 line, and shows that you can actually push a stop-the-world design quite a long way.

About these ads
This entry was posted in Uncategorized. Bookmark the permalink.