Haskell High Performance Programming
上QQ阅读APP看书,第一时间看更新

Heap profiling

From the profiling report (+RTS -p) we were able to infer how much different cost centres allocated space, along with a rough estimate of time spent in cost centres in total during the program's lifetime. What if we wanted to see how space usage varies across that lifetime? That would be useful to pinpoint space leaks that manifest themselves only at certain events.

GHC includes a heap profiler, which put simply snapshots heap usage at small fixed intervals and generates a time-dependent report in the form of a .hp file. To enable the heap profiler for an executable, the same -prof flag for GHC is enough. Some limited heap profiling is also supported when compiled without profiling. The same cost centres used for time and allocation profiling are also used for heap profiling, if the heap profile is generated or narrowed down based on cost centres.

To extract a heap report, we need to use some of the -h family of Runtime System options. Those options are as follows:

-h<break-down> Heap residency profile (hp2ps) (output file <program>.hp)
   break-down: c = cost centre stack (default)
               m = module
               d = closure description
               y = type description
               r = retainer
               b = biography (LAG,DRAG,VOID,USE)

A subset of closures may be selected thusly:

-hc<cc>,...  specific cost centre(s) (top of stack only)
-hC<cc>,...  specific cost centre(s) (anywhere in stack)
-hm<mod>...  all cost centres from the specified modules(s)
-hd<des>,... closures with specified closure descriptions
-hy<typ>...  closures with specified type descriptions
-hr<cc>...   closures with specified retainers
-hb<bio>...  closures with specified biographies (lag,drag,void,use)

This help message is quite dense. Basically, there are two separate concepts: break-down and closure subset selection.

The break-down subset defines the kinds of thing we associate heap residencies with. That is, the space taken by heap objects is accumulated in one of the following:

  • cost centres (-hc): Pin-point heap residency to automatic or manual (SCC) cost centers.
  • per-module (-hm): Pin-point residency to Haskell modules.
  • closure descriptions (-hd): Constructors and GHC internals.
  • Type description (-hy): Types such as Double, String, Maybe, and so on. Unknown and polymorphic types are given some approximation.
  • Retainers (-hr): We'll discuss retainers shortly.
  • Biography (-hb): The state a heap object is in, also discussed later.

The usual strategy is to break the heap profile down by cost centres (-hc). The complementary concept to heap break-down is closure subsets. By restricting profiling to some subsets of closures, we can narrow down the parts of the program we are actually interested in. The same options apply for subset selection as for break-down.

There can be only one break-down option, but multiple closure subset selections. With break-down, we choose the level at which we wish to inspect the program; with closure subsets, we define the parts of the program we are interested in.

For example, this combination will generate a heap profile broken down by cost centres in the Main module of all values of type Double:

+RTS -hc -hmMain -hyDouble

You can also compile without profiling and still use the -h Runtime System option, at least with a recent GHC. This eliminates the profiling overhead, leaving only minimal overhead from taking residency samples from the heap, but then the profiler options are limited to a -h only. When profiling is disabled, a -h (long form -hT) is pretty much identical to -hd when profiling is enabled.

Cost centre-based heap profiling

Let's take a real example of heap profiling. The following program calculates the Taylor polynomial of degree 800 for the sin function. The approximation is given by this formula:

This is implemented by this program:

-- file: heap-profiling.hs

sin' :: Double -> Double
sin' x = go 0 x where
  go n x
    | n > precision = x
    | otherwise     = go (n + 1) $ x +
        (-1) ** n * x ** (2 * n + 1) / factorial (2 * n + 1)

  precision = 800

  factorial n = product [1..n]

main = print $ sum $ map sin' [0,0.1..1]

To extract a heap profile of this program, we compile it with:

ghc -rtsopts -prof -fprof-auto heap-profiling.hs

./heap-profiling +RTS -hc -i0.05

The -i0.05 flag sets the interval we want to sample the heap. By default, this is 0.1 seconds. We halve it for our program to extract enough details.

By default, when just -h is given, the heap profile is generated based on cost centres (-hc). The report is written in this case to a file named heap-profiling.hp. It is hard to deduce anything meaningful from the .hp file directly. Instead, GHC comes with a binary called hp2ps, which produces nice graphs from heap profile reports. We can generate a PostScript file from a .hp file with hp2ps:

hp2ps -c -d -e8in heap-profiling.hp

I included some additional arguments there to make the output more readable:

  • Colorized output with -c
  • Sorting by standard deviation (-d) will push more static bars to the bottom of the graph
  • The Encapsulated PostScript option (-e) will output in portrait mode (the default is landscape)

The graph is written in a file called heap-profiling.ps. It looks like this:

We see 11 spikes from 11 invocations of sin'. Due to the sampling interval, the spikes are not exactly even, and you might get fewer than 11 spikes on your machine. Crank up the sampling interval if there aren't enough details visible. If we decreased the interval a lot, we would see a lot more details. But this increases overhead, and finer details are not always better because they can obscure the bigger picture.

Looking at this graph, we can rest assured that our program isn't leaking memory over time. However, when we think about the space complexities of calculating a factorial or a series expansion (a sum, basically), those are constant-space. So our program is behaving suboptimally.

The problems in our program should be pretty easy to spot now. First, our factorial function is based on product, which, similar to sum, requires at least optimization-level -O to act in constant-space. The other thing is that, in go, the argument x is non-strict, which results in a chain of thunks being built.

With -O, strictness analyzer is smart enough to fix both of these problems. But if we don't want to rely on optimizer magic, we could fix our program with a bang and a strict fold:

-- file: heap-profiling-optimized.hs
{-# LANGUAGE BangPatterns #-}

import Data.List (foldl')

sin' :: Double -> Double
sin' x = go 0 x where
  go n !x
    | n > precision = x
    | otherwise     = go (n + 1) $ x +
        (-1) ** n * x ** (2 * n + 1) / factorial (2 * n + 1)

  precision = 800

  factorial n = foldl' (*) 1 [1..n]

main = print $ sum $ map sin' [0,0.1..1]

The optimized program produces a solid heap profile. The pinned data is data that the garbage collector cannot touch. Primitive data types are allocated pinned. From the new heap profile we can infer that we are not doing unnecessary allocations anymore:

Objects outside the heap

The heap profiler cannot catch all data. Data allocated in the C land belongs to this category. The internal representation of a ByteString type is such that some allocations happen outside what is reachable for the heap profiler. This can be puzzling, as the following example illustrates.

A simple key-based encryption schema is obtained by taking a key k of length n and plain text of any length. Split the plain text into chunks, p0pm, each with length n, truncating the last chunk. The first cipher text block is given by b0 = k `xor` p0, and for the rest, bn = b{n-1} `xor` pn.

In Haskell, one way to implement this cipher is:

-- file: encryption.hs

import qualified Data.ByteString as B
import Data.Bits (xor)
import System.Environment (getArgs)

encrypt :: B.ByteString -> B.ByteString -> B.ByteString
encrypt key plain = go key plain
  where
    keyLength = B.length key

    go k0 b
        | B.null b  = B.empty
        | otherwise =
            let (b0, bn) = B.splitAt keyLength b
                r0       = B.pack $ B.zipWith xor k0 b0
                in r0 `B.append` go b0 bn

Here we just loop over the ByteString type, split it into two halves, the first of which is of the same size as our key (or less if we have reached the end of the ByteString type). We XOR the first half and the key (the first round) or previous chunk (subsequent rounds). This becomes a chunk of output. Splitting a ByteString type is O(1), and I intentionally included some naive O(n) appending and intermediate lists there.

Decryption is symmetric to encryption. Here is the implementation:

decrypt :: B.ByteString -> B.ByteString -> B.ByteString
decrypt key plain = go key plain
  where
    keyLength = B.length key

    go k0 b
        | B.null b  = B.empty
        | otherwise =
            let (b0, bn) = B.splitAt keyLength b
                r0       = B.pack $ B.zipWith xor k0 b0
                in r0 `B.append` go r0 bn

Now let's see how well our implementation performs. Here's a simple main program that reads the key and input data from files:

main = do
    [action, keyFile, inputFile] <- getArgs
    key <- B.readFile keyFile
    input <- B.readFile inputFile
    case action of
        "encrypt" -> B.writeFile (inputFile ++ ".out") $ encrypt key input
        "decrypt" -> B.writeFile (inputFile ++ ".out") $ decrypt key input

Let's compile with profiling and optimizations:

ghc -O -rtsopts -prof -fprof-auto encryption.hs

To test at a good enough scale, I used a key size of 1 MB and plain text size of 24 MB. Those can be generated as random samples:

dd if=/dev/urandom of=key.bin bs=1M count=1
dd if=/dev/urandom of=plain.bin bs=1M count=24

Then run the program and convert the heap report into a graph:

./encryption +RTS -h -RTS encrypt key.bin plain.bin
hp2ps -c -e8in encryption.hp

(Note that, when reproducing this graph yourself, you might need to play with the interval setting to get similar granularity. For instance, try a slightly bigger value, for example -i0.005.)

This profile is quite peculiar. Those spikes are attributed to what is happening with B.pack $ B.zipWith … as that naively constructs from two ByteString a third ByteString via an intermediate list (which isn't optimized away in this case). But it doesn't make sense that there needs to be about 80 MB allocated simultaneously; there's a key of size 1 MB, one 24 MB plain text, and then the resulting ciphertext, also with size of 24 MB, so a total of about 30 MB is unknown. What is that extra?

Doing the math from the graph, we see that, for every 1 MB of output ciphertext produced, an extra 1 MB is allocated. The culprit is B.append and non-tail non-productive recursion.

When we call B.append, it will evaluate its first argument to whnf, collapsing the list representation and allocating the first 1 MB. Evaluating its second argument will allocate the next 1 MB. But the catch is that B.append is a copying operation, so it also allocates the 1 MB for the output in addition. Both the output and the blocks will be allocated until the final output ciphertext can be constructed, resulting in twice as much space used as necessary.

A better option, as discussed in the previous chapter, is to use the Data.ByteString.Builder module to produce the output. The necessary changes are minimal. An implementation of the encryption step is given here:

-- file: encryption-optimized.hs

import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Builder as Builder
import Data.Bits (xor)
import System.Environment (getArgs)

encrypt :: B.ByteString -> B.ByteString -> B.ByteString
encrypt key plain = L.toStrict $ Builder.toLazyByteString $ go key plain
  where
    keyLength = B.length key

    go k0 b
        | B.null b  = mempty
        | otherwise =
            let (b0, bn) = B.splitAt keyLength b
                r0       = mconcat $ map Builder.word8 $ B.zipWith xor k0 b0
                in r0 `mappend` go b0 bn

The heap profile is now very concise, as shown in the following screenshot:

This is about as good as we can get with strict ByteStrings. Of course, the schema allows for streaming so using lazy ByteStrings would be more appropriate.

Retainer profiling

Retainer profiling (+RTS -hr) is designed to help track down space leaks. The retainer profiler breaks down the heap by retainer sets. The system stack, thunks, and explicitly mutable objects are retainers. All live objects are retained by one or more retainer objects.

Let's take an example:

let xs = [1..100000]
    a  = sum xs

Here a is a retainer for xs. xs is not yet nothing but an unevaluated, cheap thunk (as is a). If we asked for length xs, forcing the evaluation of xs, then xs becomes a large evaluated structure in the heap, which is retained by a. The retainer profiler spots such retaining and reports a as a retainer for a lot of data. Of course, if we evaluated a to WHNF, then it would stop being a retainer, because constructors are not retainers.

Let's take another illustratives example and actually do some retainer profiling. The following performs two calculations on two distinct lists:

-- file: mean.hs

mean  xs = sum xs / fromIntegral (length xs)

sumlg xs = sum (map log xs)

main = do
    print $ mean [1..1000000]
    print $ sumlg [1..1000001]

Asking for a retainer profiling, we need to compile with profiling, execute with the Runtime System flag -hr, and finally generate a graph from the heap profile:

ghc -rtsopts -prof -fprof-auto mean.hs
./mean +RTS -hr -i0.02
hp2ps -e8in -c mean.hp

Looking at the program and the heap profile graph, the shape looks like what one would expect with no optimizations enabled: first, the first list is fully evaluated and retained as a whole, and then discarded as we have retrieved its mean. Then a similar process takes place for the second list. What is not quite clear though is what the (2)SYSTEM and (9)SYSTEM mean retainer sets are.

When we look at the graph more closely, we see that those system-things become retainers after the heap usage at our mean and sumlg functions have peaked, implying that our functions have pretty much done their work. After the peak, the list is consumed and it is up to the garbage collector to dispose of the list. There is a slight delay there, which is why the list shows up in a SYSTEM retainer before it is garbage-collected.

The retainer sets are numbered. A dump of all retainer sets is written to the file <program>.prof. The retainer sets in the graph correspond to these lines in that file:

SET 76 = {<Main.sumlg,Main.main,Main.CAF>}
SET 8 = {<Main.mean,Main.main,Main.CAF>}
SET 2 = {<SYSTEM.SYSTEM>}
SET 9 = {<SYSTEM.SYSTEM>, <Main.mean,Main.main,Main.CAF>}

We see that the retainers are given as cost centre stacks. This is crucial when pinpointing retainers to correct calle-sites.

Recall the ByteString encryption example from before. What would the retainer profile for that unoptimized program look like? Much to our surprise, it is in crucial ways different from the cost centre-based heap profile. Look at this:

Only some spikes (from the intermediate list) – where are the large ByteStrings? There sure were some in the other heap profile!

This is a shortcoming of the heap profiler, or a consequence of the foreign pointer representation used in the bytestring library. The retainer profiler is unable to follow ByteStrings to their retainers, and in fact it isn't a given that the cost centre-based report is always totally correct either. This is something that needs to be kept in mind when using libraries, such as bytestring, that use foreign pointers in their data structures.

It is a good idea to always check the Runtime System memory statistics with +RTS -s, as that will always give exact memory usage.

Biographical profiling

The last kind of break-down we can ask from GHC is the biographical breakdown. Biographical profiling sorts heap objects into four categories:

  • State LAG: From creation of the object to its first use
  • State USE: From first use of the object until its last use
  • State DRAG: From final use of the object until the last reference is dropped
  • State VOID: Object is never used in its lifetime

Here's an example biographical profile of the ByteString encryption program from before. Like in the retainer profile, here too the big ByteStrings are not shown. The now familiar spikes are shown as lagging, as expected of a large list that is constructed and then consumed. Then some big data structures seem to show up at the end of the program execution. Something that is used and something that is lagging:

I used the following Runtime System options in generating this graph:

./encrypt key.bin plain.bin +RTS -hb -i0.002

Again, experiment with the interval setting to find the correct granularity on your machine.

Looking at this graph, it's hard to say much about anything. A more useful use case for biographical profiling is to use biographical data in the subset selection, and break-down using some other criteria, such as cost centres.

A usual example would be to find cost centres that produce heap objects in DRAG or VOID states. This is achieved with:

program +RTS -hc -hbdrag,void

Then we could find retainers for those cost centres with the +RTS -hr -hccc1,cc2, and so on program. Note that GHC cannot currently do both biographical and retainer profiling simultaneously, so +RTS -hr -hbdrag,void is unfortunately not allowed.