Ever wondered how big a closure is?

Somebody asked me yesterday how to find out how big the runtime representation of a type is. I hacked this up using the internal unpackClosure# primitive:

{-# LANGUAGE MagicHash,UnboxedTuples #-}
module Size where

import GHC.Exts
import Foreign

unsafeSizeof :: a -> Int
unsafeSizeof a = 
  case unpackClosure# a of 
    (# x, ptrs, nptrs #) -> 
      sizeOf (undefined::Int) + -- one word for the header
        I# (sizeofByteArray# (unsafeCoerce# ptrs) 
             +# sizeofByteArray# nptrs)

Try it in GHCi:

Prelude> :!ghc -c Size.hs
Prelude> :l Size
Ok, modules loaded: Size.
Prelude Size> unsafeSizeof 3.3
16
Prelude Size> unsafeSizeof "a"
24
Prelude Size> unsafeSizeof (1,2,3,4)
40
Prelude Size> unsafeSizeof True
8
Prelude Size> unsafeSizeof $! Data.Complex.(:+) 2 3
24
Prelude Size> unsafeSizeof $! 3
16
Prelude Size> unsafeSizeof $! (3::Integer)
16
Prelude Size> unsafeSizeof $! (3::Int)
16
Prelude Size> unsafeSizeof $! (2^64::Integer)
24

I’m on a 64-bit machine, obviously. It doesn’t always do the right thing, but for ordinary algebraic types it should work most of the time. Remember to use $!, as the size returned for an unevaluated thunk is always just one word (unpackClosure# doesn’t work for thunks).

Posted in Uncategorized | 5 Comments

Benchmarking recent improvements in parallelism

Over the last few months we’ve been making various improvements to the performance of parallel programs with GHC. I thought I’d post a few benchmarks so you can see where we’ve got to. This is a fairly random collection of 6 parallel benchmarks (all using par/seq-style parallelism rather than explicit threading with forkIO). The main point here is that I just took the programs unchanged – I haven’t made any attempt to modify the programs themselves to  make them parallelize better (although other people might have done so in the past), the focus here has been on changing the GHC runtime to optimize these existing programs. The programs come mostly from old benchmarks for the GUM implementation of Parallel Haskell, and the sources can be found here.

  • matmult: matrix multiply
  • parfib: our old friend fibonacci, in parallel
  • partree: some operations on a tree in parallel
  • prsa: decode an RSA-encoded message in parallel
  • ray: a ray-tracer
  • sumeuler:  sum . map euler

Here are the results.  The first column of numbers is the time taken for GHC 6.10.1 to run the programs on one CPU, and the following three columns are the difference in elapsed time when the programs are run on 4 CPUs (actually 4 cores of my 8-core x86_64 box) with respectively GHC 6.8.3, 6.10.1, and my current working version (HEAD + a couple of patches).

------------------------------------------------------------
  Program   6.10.1   6.8.3 -N4  6.10.1 -N4  ghc-simonmar -N4
------------------------------------------------------------
  matmult     8.55   -60.0%     -63.7%       -72.0%
   parfib     9.65   -72.6%     -70.2%       -76.3%
  partree     8.03   +26.4%     +52.7%       -40.7%
     prsa     9.52   +13.8%     -44.1%       -68.2%
      ray     7.04   +16.5%     +11.8%       +28.0%
 sumeuler     9.64   -71.2%     -73.1%       -74.0%
  -1 s.d.    -----   -68.8%     -71.6%       -78.0%
  +1 s.d.    -----   +20.1%      +6.4%       -27.1%
  Average    -----   -38.8%     -45.0%       -59.9%

The target is -75%: that’s a speedup of 4 on 4 cores. As you can see, 6.10.1 is already doing better than 6.8.3, but the current version has made some dramatic improvements and is getting close to the ideal speedup on several of the programs.  Something odd is going on with ray, I don’t know what yet!

Here’s a summary of the improvements we made:

  • Lock-free work-stealing queues for load-balancing of sparks (par). This work was originally done by Jost Berthold during his internship at MSR in the summer of 2008, and after further improvements was merged into the GHC mainline after the 6.10.1 release.
  • Improvements to parallel GC: we now use the same threads for GC as for executing the program, and have made improvements to the barrier (stopping threads to do GC), and improvements to affinity (making sure each GC thread traverses data local to that CPU).  Some of this has yet to hit the mainline, but it will shortly.
  • Eager blackholing: this reduces the chance that multiple threads repeat the same work in a parallel program.  It’s  a compile-time option (-feager-blackholing in the HEAD) and it costs a little execution time to turn it on, but it can improve parallelism quite a lot.
  • Running sparks in batches. Previously, each time we run a spark we created a new thread for it. Threads are lightweight, but the cost can still be high relative to the size of the spark. So now we have Haskell threads that repeatedly run sparks (stealing from other CPUs if necessary) until there are no more sparks to run, eliminating the context-switch and thread-creation overhead for sparks. This means we can push the granularity quite a lot: parfib speeds up even with a very low threshold now.

We’re on the lookout for more parallel benchmarks: each new program we find tends to stress the runtime in a different way, so the more code we have, the better.  Even if (or especially if) your program doesn’t go faster on a multicore – send it to us and we’ll look into it.

Posted in Uncategorized | 2 Comments

Explicit Stack Traces

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo

Argh!  When working on GHC, or in any Haskell in general, it is occasionally useful to know something about what was going on when a program crashed.  Currently we have several ways of doing this; for example the ghci-debugger or using the cost-center-stacks that are part of the profiling tools in GHC.  For my internship here at MSR, I’ve been looking at putting another tool in the GHC-users toolbelt, that of selective explicit-call-stack (ecs) traces (in a theme of things described on this wiki page); and today I’ve had a breakthrough in terms of using the tool I’ve been working on to actually help debug a real issue I was having in GHC! So, our panic above was coming from this bit of source:

varIdInfo :: Var -> IdInfo
varIdInfo (GlobalId {idInfo_ = info}) = info
varIdInfo (LocalId {idInfo_ = info}) = info
varIdInfo other_var = pprPanic "idInfo" (ppr other_var)

What the ecs-extension I’m working on does is allow you to add an annotation (an awesome new feature from that’s been recently folded into GHC Head) onto this function to say that I want to debug this function.

import GHC.ExplicitCallStack.Annotations
{-# ANN varIdInfo Debug #-}

And because there’s a few things that need neatening up in the implementation, the pprPanic needs rewriting slightly:

varIdInfo :: Var -> IdInfo
varIdInfo (GlobalId {idInfo_ = info}) = info
varIdInfo (LocalId {idInfo_ = info}) = info
varIdInfo other_var = throwStack (\s -> pprPanic ("idInfo\n" ++ show s) (ppr other_var) :: SomeException)

Recompiling this source (for the curious, recompiling meant building a stage3 ghc), passing in a -fexplicit-call-stack argument to GHC will essentially create a version of this function that tells all callers of this function to pass in their source locations, and will make the pprPanic print out this location.

What this really means is that the original panic becomes slightly more helpful:

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo
in varIdInfo, basicTypes/Var.lhs:238,30

I’ve been told that just having one level of stack information is enough to help debug many problems; in this case it wasn’t quite, but we can just go an add a Debug annotation to varIdInfo to get more stack information.  If a function that has a Debug annotation calls a function with a Debug annotation, the stack is grown an extra level.  So, lather-rinse-repeat enough times, and it’s possible to get quite a lot of information.

ghc.exe: panic! (the 'impossible' happened)
  (GHC version 6.11.20081202 for i386-unknown-mingw32):
        idInfo
in varIdInfo, basicTypes/Var.lhs:238,30
in idInfo, basicTypes/Id.lhs:168,10
in idInlinePragma, basicTypes/Id.lhs:633,37
in preInlineUnconditionally, simplCore/SimplUtils.lhs:619,12
in simplNonRecE, simplCore/Simplify.lhs:964,5
in simplLam, simplCore/Simplify.lhs:925,13
in simplExprF', simplCore/Simplify.lhs:754,5
in simplExprF, simplCore/Simplify.lhs:741,5
in completeCall, simplCore/Simplify.lhs:1120,24
in simplVar, simplCore/Simplify.lhs:1032,29
in simplExprF', simplCore/Simplify.lhs:746,39
...
in simplExprF', simplCore/Simplify.lhs:750,39
...
in simplLazyBind, simplCore/Simplify.lhs:339,33
in simplRecOrTopPair, simplCore/Simplify.lhs:295,5
in simplTopBinds, simplCore/Simplify.lhs:237,35
in simplifyPgmIO, simplCore/SimplCore.lhs:629,5
in simplifyPgm, simplCore/SimplCore.lhs:562,22
in doCorePass, simplCore/SimplCore.lhs:156,40

But what, I hear you cry, are those “…”‘s?  Well since Haskell uses recursion to implement iteration, it would be a very bad thing if we tried to keep a stack that included all sites recursively entered.  Instead, we have a rule that any call site may appear in the stack only once, and if it would appear multiple times, only the newest one is kept, with any others being replaced by “…”s.  Here we can see that simplExprF’ recursively called itself a few times.

Anyways, there are several issues to solve before this is completely usable for prime time, so I can’t promise when it’ll be available; but just thought I’d share what I’m working on;

Tristan (ToRA|MSR)

Posted in Uncategorized | 3 Comments

Redesigning GHC’s build system

Over recent months we’ve been migrating more and more of GHC’s build system from make (+GNU extensions) into Cabal.  It seemed like the obvious thing to do: Cabal is a generic build and packaging system for Haskell libraries, and GHC’s build system was duplicating much of the functionality of Cabal.  What’s more, many libraries needed two sets of metadata to describe them: one set for GHC’s build system, and another set for Cabal, so that users could build and install updated versions of these libraries outside of the GHC build.

So in changing GHC’s build system to use Cabal under the hood, we eliminated some duplicated effort, which is always a good thing.  We got to piggyback off the Cabal project, which is ticking along nice thanks to the efforts of Duncan Coutts and many other contributors.

But it wasn’t perfect, even from the outset.  We knew we wouldn’t be able to use Cabal as it stands.  For one thing, Cabal uses ghc --make,  which doesn’t build in parallel, and we really want parallel builds (building GHC can take a while).  Secondly, when working on GHC we often want to build individual modules, sometimes with extra flags, just to try something out.  Cabal doesn’t support this, it can only build the whole package.  Thirdly, the GHC build system has a lot of configuration information that we need to propagate into Cabal somehow.  So the solution we came up with for all this was to have Cabal generate a Makefile from the package metadata, so that we could build the package using make.  Make knows how to build in parallel, it knows how to make just a single target, and it can import configuration information from other makefiles.

This was a cute hack (or an ugly one, depending on your point of view), but it certainly has its downsides: for one thing, it’s really hard to modify the code that generates the Makefile, as it’s part of Cabal.  For someone new to GHC, it’s even hard to find the code that generates the Makefile.  The build system is getting less maleable.  We hadn’t really fixed the duplication problem either, since we’d had to write another build system, in the form of an auto-generated Makefile.  Users began to complain.

Using Cabal is still morally the right thing to do.  It’s just that the marriage of Cabal with our existing Makefile-based build system isn’t a happy one.  If Cabal becomes a better make than make, we will be able to use it more effectively, but for now we have decided to back off and redesign the build system in the light of the recent problems we’ve had.

GHC’s build system has always been something of a struggle to use.  Building GHC is a 2-stage bootstrap process, where we first build the “stage 1” compiler, use that to build a bunch of libraries, and then use stage 1 + libraries to build stage 2, which is the compiler that we actually install.  There are a lot of dependencies between different parts of the tree, most of which are not expressed explicitly in the build system.  Partly this is because we use recursive make, which as many will know is considered to be harmful.  The result is that we often need to “make clean” to fix inconsistencies caused by changing parts of the tree without rebuilding enough, or simply by pulling the latest changes into your tree.  We actually recommend doing “make distclean” before pulling new changes, which is obviously not ideal.

So we’re going to try the non-recursive make approach, coupled with a very light use of Cabal for preprocessing the library metadata into Makefile bindings (that way we don’t have to keep the metadata in two forms).  There’s a design sketch for the new build system on the wiki.  Over the next few weeks we’ll be working on the new build system in a repository branch.  Watch this space!

Posted in Uncategorized | 1 Comment

Comparing concurrent linked-list implementations in Haskell

Here’s a paper I wrote recently with Martin Sulzmann and Edmund Lam: Comparing the performance of concurrent linked-list implementations in Haskell (to be presented at DAMP’09).  The basic idea is to analyse the performance and scaling characteristics of the various concurrency synchronisation methods available in GHC: STM, MVars, and IORef with atomicModifyIORef, by using each of these to implement a concurrent linked-list abstraction.

The bottom line is that we found two orders of magnitude differences between the slowest (STM), and the fastest (IORef + atomicModifyIORef).  It’s not too surprising that STM is slow here, since each transaction is linear in the size of the data structure – this is a well-known problem that is the subject of a number of other research papers.  However, we did find we could make an STM-based solution that runs nearly as fast as the IORef version by adding a single operation: readTVarIO, which has the same behaviour as (atomically readTVar), but doesn’t have the overhead of setting up and committing a transaction.  Using a rewrite rule this could even be automated.

However, a major problem with all of the algorithms in our paper, except for the naive STM implementation, is that they don’t compose.  That’s what STM was for in the first place.  Fortunately there’s a way to have the best of both worlds: fast scalable concurrent data structures writen by Clever People(TM) using compare-and-swap, with the full composability of STM operations.  The idea is called transactional boosting and is due to Maurice Herlihy and Eric Koskinen. In some sense this is admitting defeat: we can’t get good performance from STM, so we’ll work around it. On the other hand, it seems unlikely that STM will ever achieve the performance of hand-tuned concurrent data structures, and after all, this technique just follows a long tradition in high-level programming languages of letting you embed low-level bits when you really need performance.

To implement this in GHC all we would need is a way to register actions to perform at transaction commit or abort time, which is something we’ve been meaning to implement for some time anyway.  Maybe there’s a budding GHC hacker out there who would like to take this on?

Posted in Uncategorized | Leave a comment

Bootstrapping cabal-install

After installing GHC 6.10.1, you probably want to get cabal-install installed as quickly as possible, so you can grab all that Haskell library goodness from Hackage with the minimum of fuss.  Once the Haskell Platform is available, this will be much easier, but until then we have to bootstrap cabal-install using just Cabal.

After a release I find myself doing this on multiple machines, so today I made a little script to automate it, and I thought I’d share it.  Obviously it’s just a quick hack, only works with GHC 6.10.1, has no error-checking etc., but it does work on both Windows and Unix.  Enjoy!

Update: Duncan Coutts pointed out to me that there’s a Windows cabal.exe binary available.


#!/bin/sh

set -e

for i in $TMPDIR /tmp c:/temp; do
  if test -d $i; then
    dir=$i
    break
  fi
  dir=$HOME
done

cd $dir
wget http://hackage.haskell.org/packages/archive/zlib/0.5.0.0/zlib-0.5.0.0.tar.gz
tar xzf zlib-0.5.0.0.tar.gz
cd zlib-0.5.0.0
ghc --make Setup
./Setup configure --user
./Setup build
./Setup register --inplace --user

cd $dir
wget http://hackage.haskell.org/packages/archive/HTTP/3001.1.4/HTTP-3001.1.4.tar.gz
tar xzf HTTP-3001.1.4.tar.gz
cd HTTP-3001.1.4
ghc --make Setup
./Setup configure --user
./Setup build
./Setup register --inplace --user

cd $dir
wget http://hackage.haskell.org/packages/archive/cabal-install/0.6.0/cabal-install-0.6.0.tar.gz
tar xzf cabal-install-0.6.0.tar.gz
cd cabal-install-0.6.0
ghc --make Setup
./Setup configure --user
./Setup build

# Don't need these libs any more...
ghc-pkg unregister zlib-0.5.0.0
ghc-pkg unregister HTTP-3001.1.4

# Now use cabal-install to install itself and its dependencies
./dist/build/cabal/cabal update
./dist/build/cabal/cabal install cabal-install

# Clean up...
cd $dir
rm -rf zlib-0.5.0.0*
rm -rf HTTP-3001.1.4*
rm -rf cabal-install-0.6.0*

echo "ALL DONE!"

Posted in Uncategorized | 7 Comments

GHC 6.10.1 is out!

It seems appropriate to initialise this blog with a release announcement, so here it is: the GHC developers commend the new GHC 6.10.1 release to the community!

There’s plenty of juicy stuff in this release, here’s the headlines:

  • New language Features
  • Type families have been completely re-implemented
  • Now comes with Haddock 2, which supports all GHC extensions
  • Parallel garbage collection
  • Base provides extensible exceptions
  • The GHC API is easier to use
  • External core (output only) now works again
  • Data Parallel Haskell (DPH) comes as part of GHC

Get all the details in the full release notes.

What’s more, this release has undergone more testing and polishing than ever before, so we hope this will be one of the highest-quality GHC releases ever.  Our test suite has over 2200 individual test programs, each run 6 ways as part of our automated nightly builds, and at release time we had zero unexpected failures (that is, all known bugs are logged in the bug tracker).  The Haskell Platform team have also been testing GHC 6.10.1 pre-releases against all of the packages on Hackage, to ensure that we can make this transition as smooth as possible.

You might notice some other improvements, such as parallel performance: there is ongoing work to analyse and improve the performance of parallel code, and some of that work made it into into 6.10.1 (but there’s a lot more to come, some of which can be found in the HEAD).

We have a new parallel garbage collector, which should probably be considered a “technology preview”: you get it by default when enabling multi-core execution with e.g. +RTS -N4, but keep an eye on the GC time with +RTS -s.  If you find that parallel GC isn’t helping, or is even making things worse, then tell us about it, and disable the parallel gc with -g1.

Enjoy, and send us your bugs and feedback!

Posted in Announcements | Leave a comment