New paper: Runtime Support for Multicore Haskell

Posted on March 3, 2009 by simonmar

Here’s a paper on the internals of GHC’s parallelism support, showing some nice improvements in parallel performance over GHC 6.10.1: “Runtime Support for Multicore Haskell“. Abstract:

Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wall-clock performance. We describe just such an implementation, and quantitatively explore some of the complex design tradeoffs that make such implementations hard to build. Our measurements are necessarily detailed and specific, but
they are reproducible, and we believe that they offer some general insights.

This entry was posted in Uncategorized. Bookmark the permalink.

18 Responses to New paper: Runtime Support for Multicore Haskell

goalieca says:

March 3, 2009 at 6:28 pm

Every foreign call is assumed blocking so the HEC is given up each and everytime. A costly assumption. Would it be nice to have a “pure” annotation to avoid this?
simonmar says:

March 4, 2009 at 10:00 am

The “unsafe” annotation on foreign declarations avoids this overhead. An “unsafe” foreign call may therefore prevent other threads on the same CPU from running until it completes. See our paper on the FFI and concurrency for more details: http://www.haskell.org/~simonmar/papers/conc-ffi.pdf
Pingback: Playing with GHC’s parallel runtime « Control.Monad.Writer
Pingback: Improved Means For Achieving Deteriorated Ends / Trying to Stay Ahead
Tomas Janousek says:

March 5, 2009 at 12:30 pm

The psum example on page 2 does not typecheck, does it? sum xs is Int, but parList rwhnf is Strategy [a].
simonmar says:

March 5, 2009 at 1:11 pm

Good point, I’ll fix the example. Thanks.
Dave Bayer says:

March 10, 2009 at 9:15 pm

Great paper! Which list (e.g. GHC or cafe) is best for extended discussions of these methods?

In contrast to -N7 for 8 cores, I’m getting best results using an extra thread or two, e.g. -N3 for 2 cores. Go figure. (This is stock 6.10.1; clearly I have to migrate to the daily builds!)

I’m writing code that reverse-search visits large numbers of objects, e.g 66,960,965,307 atomic lattices on 6 atoms in 14 hours. My inner recursions are so tight that simply passing an extra parameter costs me 50%; it doesn’t surprise me that adding a few lines to parallelize this code has even more overhead. But what happens when one adds a variable amount of work to each object visited? Ideally this “payload” work scales linearly with the cores; one has already paid off the overhead of parallelism, independent of the work done. For real-world work, the factors that interfere with true “payload” linearity (e.g. GC sync stalls) should be a greater concern than reducing the initial overhead of setting up parallelism in the first place. The tyranny of benchmarks overlooks this effect.

I crave an -RTS option to limit the number of sparks. Single-threaded reverse search uses log space; the computation exhausts space if one creates too many sparks. I’ve worked around this by unrolling loops and very selectively sparking. I’d be happy to unsafely peak at what might have once been the spark count; even if my glimpse is off by half a second this would help as a heuristic. Far more efficient (and purer!) would be to turn `par` into a no-op whenever the spark count exceeds an -RTS maximum.
simonmar says:

March 11, 2009 at 9:48 am

The number of sparks is fixed currently at 8k, any new sparks after this are discarded. If we move to the WEAK policy as per the paper, then sparks won’t change the space behaviour of the program (unless they get executed, of course).
Dave Bayer says:

March 16, 2009 at 3:02 am

I’m curious about the evidence supporting the use of bang patterns in the definition of start, in parBufferWHNF. This has thrown me on every reading of this paper; I felt that their benefit should be obvious, but it wasn’t to me.

I ran 5×4 trials, roughly 40s each, on test code with and without the use of bangs, with 6.10.1 and 6.11.20090311. Tossing out the min and max, averaging the middle three trials, I found leaving out the bang patterns was 1.2% faster in 6.10.1, and 2.1% faster in 6.11.20090311. (6.11 was only 1.7% faster than 6.10.1, but I worked hard to amortize the parallel overhead, and I greatly appreciate the progress in keeping this overhead minimized.)

If everyone else sees the bangs as a win, then I have an odd test case, but I felt I should ask…
simonmar says:

March 16, 2009 at 11:11 am

Admittedly I didn’t measure this. I’d expect the bangs to elimiate a little overhead in `parBufferWHNF`, so perhaps what you’re seeing here is a scheduling anomaly. When we have ThreadScope fully working we’ll be able to investigate.
Dave Bayer says:

March 16, 2009 at 5:23 pm

I’ve also been puzzling over parBufferWHNF itself. It would seem that it goes into a black hole until the last spark is claimed, at which point several other cores are still busy on other sparks. In the worst case with uneven grain sizes, the first spark is still running, and no new sparks are created until the first spark completes? If so, cores are going to waste.

In any case, a “checkbook” buffer makes more sense to me. For those old enough to remember, when one gets to one’s last book of checks, there’s a slip on _top_ (not after the last check) for ordering more checks.

On the same test case that I used in my previous comment, this is a 1% speedup. As I decrease my grain size, the difference climbs to 5%. At the smallest grain size, one sees a dramatic difference in the processor loads: parBufferWHNF sputters at less that 100% on each core, but doesn’t waste the cycles it gets. My alternative maintains 100% core loads, but only comes out 5% faster; somehow, cycles are going to waste. There’s something I’m not understanding here.

Here is my “checkbook” buffering code. I can’t find wordpress documentation on formatting code for comments, and there’s no preview, so I have no idea how this will look:

— iterate par on a list

pars :: [a] -> b -> b
pars [] y = y
pars (x:xs) y = x `par` pars xs y

— create initial n sparks, then self-replicating blocks of n+1 sparks

batch :: Int -> Strategy [a]
batch n xs = ys `pars` more zs `pseq` () where
(ys, zs) = splitAt n xs
more [] = ()
more us = more ws `par` vs `pars` () where (vs, ws) = splitAt n us
Simon Marlow says:

March 17, 2009 at 3:36 pm

I’m not sure I follow your analysis of parBufferWHNF, but I wouldn’t be surprised to discover it has problems. One problem I see is that only the main thread creates the sparks; another problem is that the buffer size is fixed up front.

I’d be interested to try out your alternative when I get a chance.
Dave Bayer says:

March 19, 2009 at 1:42 pm

The issue that worries me is that once the main thread has to wait for a list element, it goes into a black hole, and doesn’t come out until GC or there are no more sparks. Thus, parBufferWHNF is continually running out of sparks, and only then reawakening the main thread. My alternative generally never runs out of sparks, so the main thread only awakens at GC; perhaps its unconsumed input is hastening GC, worsening performance.

One wants a sliding window of sparks that take responsibility both for making sure that there are always new sparks available, and making sure that the finished work gets consumed in a timely manner, all while respecting the “variable on the left of par” rule to avoid space leaks. Then it would be ok that the main thread goes into a black hole. I can see ways to write this, but not as a generic canned strategy.

—

I’ve verified your “use 3 of 4 cores”, “use 7 of 8 cores” principle on a number of machines. Using all N cores is sometimes no improvement on N-1 cores, and sometimes far worse. The one exception is Mac OS X; I’m routinely getting 1.85x to 1.9x speedups on my MacBook with two cores, once I calm down the system, unplug external devices and so forth. Or is “use 2 of 2 cores” always ok?

Can eight core Mac OS X users make effective use of all eight cores? The computers that I build myself are intended to be GHC boxes, and I’m ok if they simply freeze until my computations finish. Is there a way to jigger Linux so that it does a better job of staying out of the way of GHC? Is there a better OS?
simonmar says:

March 20, 2009 at 9:37 am

if parBufferWHNF evaluates a list element that is under evaluation by another thread, it has to wait for that thread only, and then it can continue. I’m not sure what you mean by “goes into a black hole, and doesn’t come out until GC or there are no more sparks”. Are you referring to the black holes that we use to prevent duplicate work, or something else?

I agree that we generally don’t want the main thread tripping over the worker threads. I’ve seen some evidence of this with ThreadScope, however it does seem to resolve itself and the program settles down into smooth parallel execution.

One problem with having a spark that creates more sparks is that it tends to rely on either FIFO or LIFO spark ordering.
Dave Bayer says:

March 20, 2009 at 10:45 am

Yes, I’m referring to black holes to prevent duplicate work.

My only understanding comes from reading your paper a number of times, then making experiments and seeing if my code speeds up. I just meditated on each use of the word “black” a number of times, to see if I had misread your paper earlier. I still have the impression that if at any point the main thread needs a value that has already been started by a spark, then the main thread blocks. According to your explicitly stated priority scheme in the paper, the main thread will then only unblock at GC, or if there are temporarily no more sparks.

I’d be thrilled if I’m wrong; I will have helped you to find an ambiguity in your paper.

Here’s what I think happens, after multiple reads of your paper:

I’m using 3 of 4 cores. I use parBufferWHNF with a buffer size of N >> 3 to process each element of a list, with the main thread consuming the list of results. Suppose that the work per list element is identical, and uses so little memory that GC is not necessary. Then all N sparks will run in turn, and the main thread will awaken to find an idle processor that pulled it out of the black hole, with the other two processors working on sparks N-1 and N. They’ll finish with no more sparks available, and wait around for the main thread to wake up and create more sparks.

This is completely independent of N. Now, the main thread can race through consuming at least the first N-2 list elements, and creating N-2 sparks all at once, before it “catches up with the pack” and block on another list element already being evaluated. Had I matched N with the number of cores I’m using, the main thread would awaken roughly once per list element, which would be less efficient. At least with N large, we amortize the cost of having other processors stall while the main thread makes more sparks.

If you can pick apart where I’m wrong here, I’d be happy to point to the passages in the paper that gave me this impression. Perhaps now my “checkbook” analogy makes sense? Order more sparks (checks) before all the current ones run out…

(I’m worrying over a few percent here. The big picture is I’m thrilled to see my code run in parallel with so little effort on my part!)
simonmar says:

March 23, 2009 at 9:59 am

You’re right that if the main thread blocks on a black hole, it won’t wake up again until there are no more sparks or GC happens.

We only run into problems after a while, because the main thread is creating sparks at a rate R, but they are being consumed at a rate (N-1)*R (where N is the number of CPUs). The buffer serves to delay the inevitable. Only when the sparks run out will the main thread get blocked, because sparks should be taken from list elements towards the tail of the list in preference.

So I agree that we need to be able to create more sparks on demand.

Would you mind taking this discussion to the mailing list? This is a crappy UI for discussions.
Dave Bayer says:

March 23, 2009 at 10:19 am

Which list? E.g. GHC, cafe, … (That was my very first question above; when you didn’t answer, I thought you wanted the discussion here! 😉
simonmar says:

March 23, 2009 at 10:26 am

glasgow-haskell-users would be the best place, I think.

New paper: Runtime Support for Multicore Haskell

18 Responses to New paper: Runtime Support for Multicore Haskell

Leave a comment

Archives

Meta

New paper: Runtime Support for Multicore Haskell

Share this:

Related

18 Responses to New paper: Runtime Support for Multicore Haskell

Leave a comment

Archives

Meta