New paper: Parallel Performance Tuning for Haskell

Here’s our Haskell Symposium paper about parallel profiling with GHC and ThreadScope:

Parallel Performance Tuning for Haskell (Don Jones Jr., Simon Marlow, Satnam Singh) Haskell ’09: Proceedings of the second ACM SIGPLAN symposium on Haskell, Edinburgh, Scotland, ACM, 2009

Abstract:

Parallel Haskell programming has entered the mainstream with support now included in GHC for multiple parallel programming models, along with multicore execution support in the runtime. However, tuning programs for parallelism is still something of a black art. Without much in the way of feedback provided by the runtime system, it is a matter of trial and error combined with experience to achieve good parallel speedups.

This paper describes an early prototype of a parallel profiling system for multicore programming with GHC. The system comprises three parts: fast event tracing in the runtime, a Haskell library for reading the resulting trace files, and a number of tools built on this library for presenting the information to the programmer. We focus on one tool in particular, a graphical timeline browser called ThreadScope.

The paper illustrates the use of ThreadScope through a number of case studies, and describes some useful methodologies for parallelizing Haskell programs.

Posted in Uncategorized | Leave a comment

The new GHC build system is here!

The new GHC build system has been now been merged in.  GHC developers can look forward to increases in productivity and faster build times thanks to the new non-recursive make design.

Here are some quick stats:

Lines of build-system code (including Makefile and Haskell code):

  • old build system: 7793
  • new build system: 5766 (about 2000 fewer lines, or a 26% reduction)

Furthermore, this doesn’t count the code for ‘cabal make’, which is still in Cabal but is no longer used by GHC.

Time to validate with -j2 (the default; test suite is still single-threaded):

  • old:  28 mins
  • new: 28 mins

Single and dual-core builds don’t see much difference.  However, adding more cores starts to demonstrate the improved parallelism: validate with -j4 (still single-threaded test suite):

  • old: 25.3 mins
  • new: 24.0 mins

Parallelism in the new build system is a lot better. It can build libraries in parallel with each other, profiled libraries in parallel with non-profiled libraries, and even libraries in parallel with stage 2. There’s very little explicit ordering in the new build system, we only tell make about dependencies.

Time to do ‘make’ when the tree is fully up-to-date:

  • old: 2m 41s
  • new: 4.1s

Time to do ‘make distclen’:

  • old: 5.7s
  • new: 1.0s

We also have all-new build-system documentation.

The biggest change you’ll notice is that the build system now expresses all the dependencies, so whatever you change, you should be able to say ‘make’ to bring everything up to date. Sometimes this can result in more rebuilding than you were expecting, or more than is strictly necessary, but it should save time in the long run as we run
into fewer problems caused by things being inconsistent or out-of-date in the build.

We stretched GNU make to its limits. On the whole it performed pretty well: even for a build of this size, the time and memory consumed by make itself is negligible. The most annoying problem we encountered was the need to split the build into phases to work around GNU make’s lack of support for dependencies between included makefiles.

On the whole I’m now convinced that non-recursive make is not only useful but practical, provided you stick to some clear idioms in your build-system design.

Posted in Uncategorized | Leave a comment

New paper: Runtime Support for Multicore Haskell

Here’s a paper on the internals of GHC’s parallelism support, showing some nice improvements in parallel performance over GHC 6.10.1: “Runtime Support for Multicore Haskell“.   Abstract:

Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wall-clock performance. We describe just such an implementation, and quantitatively explore some of the complex design tradeoffs that make such implementations hard to build. Our measurements are necessarily detailed and specific, but
they are reproducible, and we believe that they offer some general insights.

Posted in Uncategorized | 18 Comments

Ever wondered how big a closure is?

Somebody asked me yesterday how to find out how big the runtime representation of a type is. I hacked this up using the internal unpackClosure# primitive:

{-# LANGUAGE MagicHash,UnboxedTuples #-}
module Size where

import GHC.Exts
import Foreign

unsafeSizeof :: a -> Int
unsafeSizeof a = 
  case unpackClosure# a of 
    (# x, ptrs, nptrs #) -> 
      sizeOf (undefined::Int) + -- one word for the header
        I# (sizeofByteArray# (unsafeCoerce# ptrs) 
             +# sizeofByteArray# nptrs)

Try it in GHCi:

Prelude> :!ghc -c Size.hs
Prelude> :l Size
Ok, modules loaded: Size.
Prelude Size> unsafeSizeof 3.3
16
Prelude Size> unsafeSizeof "a"
24
Prelude Size> unsafeSizeof (1,2,3,4)
40
Prelude Size> unsafeSizeof True
8
Prelude Size> unsafeSizeof $! Data.Complex.(:+) 2 3
24
Prelude Size> unsafeSizeof $! 3
16
Prelude Size> unsafeSizeof $! (3::Integer)
16
Prelude Size> unsafeSizeof $! (3::Int)
16
Prelude Size> unsafeSizeof $! (2^64::Integer)
24

I’m on a 64-bit machine, obviously. It doesn’t always do the right thing, but for ordinary algebraic types it should work most of the time. Remember to use $!, as the size returned for an unevaluated thunk is always just one word (unpackClosure# doesn’t work for thunks).

Posted in Uncategorized | 5 Comments

Benchmarking recent improvements in parallelism

Over the last few months we’ve been making various improvements to the performance of parallel programs with GHC. I thought I’d post a few benchmarks so you can see where we’ve got to. This is a fairly random collection of 6 parallel benchmarks (all using par/seq-style parallelism rather than explicit threading with forkIO). The main point here is that I just took the programs unchanged – I haven’t made any attempt to modify the programs themselves to  make them parallelize better (although other people might have done so in the past), the focus here has been on changing the GHC runtime to optimize these existing programs. The programs come mostly from old benchmarks for the GUM implementation of Parallel Haskell, and the sources can be found here.

  • matmult: matrix multiply
  • parfib: our old friend fibonacci, in parallel
  • partree: some operations on a tree in parallel
  • prsa: decode an RSA-encoded message in parallel
  • ray: a ray-tracer
  • sumeuler:  sum . map euler

Here are the results.  The first column of numbers is the time taken for GHC 6.10.1 to run the programs on one CPU, and the following three columns are the difference in elapsed time when the programs are run on 4 CPUs (actually 4 cores of my 8-core x86_64 box) with respectively GHC 6.8.3, 6.10.1, and my current working version (HEAD + a couple of patches).

------------------------------------------------------------
  Program   6.10.1   6.8.3 -N4  6.10.1 -N4  ghc-simonmar -N4
------------------------------------------------------------
  matmult     8.55   -60.0%     -63.7%       -72.0%
   parfib     9.65   -72.6%     -70.2%       -76.3%
  partree     8.03   +26.4%     +52.7%       -40.7%
     prsa     9.52   +13.8%     -44.1%       -68.2%
      ray     7.04   +16.5%     +11.8%       +28.0%
 sumeuler     9.64   -71.2%     -73.1%       -74.0%
  -1 s.d.    -----   -68.8%     -71.6%       -78.0%
  +1 s.d.    -----   +20.1%      +6.4%       -27.1%
  Average    -----   -38.8%     -45.0%       -59.9%

The target is -75%: that’s a speedup of 4 on 4 cores. As you can see, 6.10.1 is already doing better than 6.8.3, but the current version has made some dramatic improvements and is getting close to the ideal speedup on several of the programs.  Something odd is going on with ray, I don’t know what yet!

Here’s a summary of the improvements we made:

  • Lock-free work-stealing queues for load-balancing of sparks (par). This work was originally done by Jost Berthold during his internship at MSR in the summer of 2008, and after further improvements was merged into the GHC mainline after the 6.10.1 release.
  • Improvements to parallel GC: we now use the same threads for GC as for executing the program, and have made improvements to the barrier (stopping threads to do GC), and improvements to affinity (making sure each GC thread traverses data local to that CPU).  Some of this has yet to hit the mainline, but it will shortly.
  • Eager blackholing: this reduces the chance that multiple threads repeat the same work in a parallel program.  It’s  a compile-time option (-feager-blackholing in the HEAD) and it costs a little execution time to turn it on, but it can improve parallelism quite a lot.
  • Running sparks in batches. Previously, each time we run a spark we created a new thread for it. Threads are lightweight, but the cost can still be high relative to the size of the spark. So now we have Haskell threads that repeatedly run sparks (stealing from other CPUs if necessary) until there are no more sparks to run, eliminating the context-switch and thread-creation overhead for sparks. This means we can push the granularity quite a lot: parfib speeds up even with a very low threshold now.

We’re on the lookout for more parallel benchmarks: each new program we find tends to stress the runtime in a different way, so the more code we have, the better.  Even if (or especially if) your program doesn’t go faster on a multicore – send it to us and we’ll look into it.

Posted in Uncategorized | 2 Comments

Explicit Stack Traces

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo

Argh!  When working on GHC, or in any Haskell in general, it is occasionally useful to know something about what was going on when a program crashed.  Currently we have several ways of doing this; for example the ghci-debugger or using the cost-center-stacks that are part of the profiling tools in GHC.  For my internship here at MSR, I’ve been looking at putting another tool in the GHC-users toolbelt, that of selective explicit-call-stack (ecs) traces (in a theme of things described on this wiki page); and today I’ve had a breakthrough in terms of using the tool I’ve been working on to actually help debug a real issue I was having in GHC! So, our panic above was coming from this bit of source:

varIdInfo :: Var -> IdInfo
varIdInfo (GlobalId {idInfo_ = info}) = info
varIdInfo (LocalId {idInfo_ = info}) = info
varIdInfo other_var = pprPanic "idInfo" (ppr other_var)

What the ecs-extension I’m working on does is allow you to add an annotation (an awesome new feature from that’s been recently folded into GHC Head) onto this function to say that I want to debug this function.

import GHC.ExplicitCallStack.Annotations
{-# ANN varIdInfo Debug #-}

And because there’s a few things that need neatening up in the implementation, the pprPanic needs rewriting slightly:

varIdInfo :: Var -> IdInfo
varIdInfo (GlobalId {idInfo_ = info}) = info
varIdInfo (LocalId {idInfo_ = info}) = info
varIdInfo other_var = throwStack (\s -> pprPanic ("idInfo\n" ++ show s) (ppr other_var) :: SomeException)

Recompiling this source (for the curious, recompiling meant building a stage3 ghc), passing in a -fexplicit-call-stack argument to GHC will essentially create a version of this function that tells all callers of this function to pass in their source locations, and will make the pprPanic print out this location.

What this really means is that the original panic becomes slightly more helpful:

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo
in varIdInfo, basicTypes/Var.lhs:238,30

I’ve been told that just having one level of stack information is enough to help debug many problems; in this case it wasn’t quite, but we can just go an add a Debug annotation to varIdInfo to get more stack information.  If a function that has a Debug annotation calls a function with a Debug annotation, the stack is grown an extra level.  So, lather-rinse-repeat enough times, and it’s possible to get quite a lot of information.

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo
in varIdInfo, basicTypes/Var.lhs:238,30
in idInfo, basicTypes/Id.lhs:168,10
in idInlinePragma, basicTypes/Id.lhs:633,37
in preInlineUnconditionally, simplCore/SimplUtils.lhs:619,12
in simplNonRecE, simplCore/Simplify.lhs:964,5
in simplLam, simplCore/Simplify.lhs:925,13
in simplExprF', simplCore/Simplify.lhs:754,5
in simplExprF, simplCore/Simplify.lhs:741,5
in completeCall, simplCore/Simplify.lhs:1120,24
in simplVar, simplCore/Simplify.lhs:1032,29
in simplExprF', simplCore/Simplify.lhs:746,39
...
in simplExprF', simplCore/Simplify.lhs:750,39
...
in simplLazyBind, simplCore/Simplify.lhs:339,33
in simplRecOrTopPair, simplCore/Simplify.lhs:295,5
in simplTopBinds, simplCore/Simplify.lhs:237,35
in simplifyPgmIO, simplCore/SimplCore.lhs:629,5
in simplifyPgm, simplCore/SimplCore.lhs:562,22
in doCorePass, simplCore/SimplCore.lhs:156,40

But what, I hear you cry, are those “…”‘s?  Well since Haskell uses recursion to implement iteration, it would be a very bad thing if we tried to keep a stack that included all sites recursively entered.  Instead, we have a rule that any call site may appear in the stack only once, and if it would appear multiple times, only the newest one is kept, with any others being replaced by “…”s.  Here we can see that simplExprF’ recursively called itself a few times.

Anyways, there are several issues to solve before this is completely usable for prime time, so I can’t promise when it’ll be available; but just thought I’d share what I’m working on;

Tristan (ToRA|MSR)

Posted in Uncategorized | 3 Comments

Redesigning GHC’s build system

Over recent months we’ve been migrating more and more of GHC’s build system from make (+GNU extensions) into Cabal.  It seemed like the obvious thing to do: Cabal is a generic build and packaging system for Haskell libraries, and GHC’s build system was duplicating much of the functionality of Cabal.  What’s more, many libraries needed two sets of metadata to describe them: one set for GHC’s build system, and another set for Cabal, so that users could build and install updated versions of these libraries outside of the GHC build.

So in changing GHC’s build system to use Cabal under the hood, we eliminated some duplicated effort, which is always a good thing.  We got to piggyback off the Cabal project, which is ticking along nice thanks to the efforts of Duncan Coutts and many other contributors.

But it wasn’t perfect, even from the outset.  We knew we wouldn’t be able to use Cabal as it stands.  For one thing, Cabal uses ghc --make,  which doesn’t build in parallel, and we really want parallel builds (building GHC can take a while).  Secondly, when working on GHC we often want to build individual modules, sometimes with extra flags, just to try something out.  Cabal doesn’t support this, it can only build the whole package.  Thirdly, the GHC build system has a lot of configuration information that we need to propagate into Cabal somehow.  So the solution we came up with for all this was to have Cabal generate a Makefile from the package metadata, so that we could build the package using make.  Make knows how to build in parallel, it knows how to make just a single target, and it can import configuration information from other makefiles.

This was a cute hack (or an ugly one, depending on your point of view), but it certainly has its downsides: for one thing, it’s really hard to modify the code that generates the Makefile, as it’s part of Cabal.  For someone new to GHC, it’s even hard to find the code that generates the Makefile.  The build system is getting less maleable.  We hadn’t really fixed the duplication problem either, since we’d had to write another build system, in the form of an auto-generated Makefile.  Users began to complain.

Using Cabal is still morally the right thing to do.  It’s just that the marriage of Cabal with our existing Makefile-based build system isn’t a happy one.  If Cabal becomes a better make than make, we will be able to use it more effectively, but for now we have decided to back off and redesign the build system in the light of the recent problems we’ve had.

GHC’s build system has always been something of a struggle to use.  Building GHC is a 2-stage bootstrap process, where we first build the “stage 1″ compiler, use that to build a bunch of libraries, and then use stage 1 + libraries to build stage 2, which is the compiler that we actually install.  There are a lot of dependencies between different parts of the tree, most of which are not expressed explicitly in the build system.  Partly this is because we use recursive make, which as many will know is considered to be harmful.  The result is that we often need to “make clean” to fix inconsistencies caused by changing parts of the tree without rebuilding enough, or simply by pulling the latest changes into your tree.  We actually recommend doing “make distclean” before pulling new changes, which is obviously not ideal.

So we’re going to try the non-recursive make approach, coupled with a very light use of Cabal for preprocessing the library metadata into Makefile bindings (that way we don’t have to keep the metadata in two forms).  There’s a design sketch for the new build system on the wiki.  Over the next few weeks we’ll be working on the new build system in a repository branch.  Watch this space!

Posted in Uncategorized | 1 Comment