Handling binary and textual data_Haskell High Performance Programming-QQ阅读轻小说男生网

上QQ阅读APP看书，第一时间看更新

Handling binary and textual data

The smallest piece of data is a bit (0 or 1), which is isomorphic to Bool (True or False). When you need just one bit, a Bool should be your choice. If you need a few bits, then a tuple of Bools will fit the purpose when performance is not critical. A [Bool] is sometimes convenient, but should only be chosen for convenience in some situations.

For high-performance binary data, you could define your own data type with strict Bool fields. But this has an important caveat, namely that Bool is not a primitive but an algebraic data type:

data Bool = False | True

The consequence is that you cannot unpack a Bool similar to how you could an Int or Double. In Haskell, Bool values will always be represented by pointers. Fortunately for many bit-fiddling applications, you can define a data type like this:

data BitStruct = BitStore !Bool !Bool !Bool

This will get respectable performance. However, if you need a whole array of bits it quickly becomes inconvenient to define a field per bit.

Representing bit arrays

One way to define a bit array in Haskell that still retains the convenience of Bool is:

import Data.Array.Unboxed
type BitArray = UArray Int Bool

This representation packs 8 bits per byte, so it's space efficient. See the following section on arrays in general to learn about time efficiency – for now we only note that BitArray is an immutable data structure, like BitStruct, and that copying small BitStructs is cheaper than copying BitArrays due to overheads in UArray.

Consider a program that processes a list of integers and tells whether they are even or odd counts of numbers divisible by 2, 3, and 5. We can implement this with simple recursion and a three-bit accumulator. Here are three alternative representations for the accumulator:

-- file: bitstore.hs 

{-# LANGUAGE BangPatterns #-}

import Data.Array.Unboxed
import Data.Bits (xor)
type BitTuple = (Bool, Bool, Bool)
data BitStruct = BitStruct !Bool !Bool !Bool deriving Show
type BitArray = UArray Int Bool

And the program itself is defined along these lines:

go :: acc -> [Int] -> acc
go acc              []     = acc
go (two three five) (x:xs) = go ((test 2 x `xor` two)
(test 3 x `xor` three)
(test 5 x `xor` five)) xs

test n x = x `mod` n == 0

I've omitted the details here. They can be found in the bitstore.hs file.

The fastest variant is BitStruct, then comes BitTuple (30% slower), and BitArray is the slowest (130% slower than BitStruct). Although BitArray is the slowest (due to making a copy of the array on every iteration), it would be easy to scale the array in size or make it dynamic. Note also that this benchmark is really on the extreme side; normally programs do a bunch of other stuff besides updating an array in a tight loop.

If you need fast array updates, you can resort to mutable arrays discussed later on. It might also be tempting to use Data.Vector.Unboxed.Vector Bool from the vector package, due to its nice interface. But beware that that representation uses one byte for every bit, wasting 7 bits for every bit.

Handling bytes and blobs of bytes

The next simplest piece of information after bits is a byte, which is eight bits. In Haskell, the Word8 type represents a byte. Often though, whole words are more useful. The Word type has the same size as Int, defined by the underlying architecture. Types Word16, Word32, and Word64 consist of respective numbers of bits.

Like a bit array, a byte array could be represented as a UArray. But a more standard solution is to use ByteString from the bytestring package. The bytestring package uses a blazingly fast pointer representation internally, while the API looks just like the API for standard lists.

Let's test how fast it really is:

-- file: bytestring-perf.hs

import qualified Data.ByteString as B
import System.IO (stdin)

go :: Int -> Int -> IO Int
go 0 s = return $! s
go n s = do bs <- B.hGet stdin (1024 * 1024)
            go (n-1) $! B.length bs + s

main = go 2048 0 >>= print

This program reads two gigabytes of binary data from its standard input in one megabyte chunks and prints the total of bytes read. Test it with this:

$ ghc -rtsopts -O bytestring-perf.hs
$ time ./bytestring-perf +RTS -s < /dev/zero

On my machine, the program takes 0.25 seconds and allocates about 2.1 gigabytes in heap – meaning there was hardly any space overhead from our use of ByteString and speed was respectable as well.

The Data.ByteString.ByteString datatype is strict, meaning that all bytes of a ByteString will be in memory. The Data.ByteString.Lazy module defines its own ByteString, which is lazy:

data ByteString = Empty
                | Chunk {-# UNPACK #-} !S.ByteString ByteString

Note

Note that you can unbox strict ByteStrings in your own data types as well.

Using lazy ByteStrings, we could rewrite our program as follows:

-- file: bytestring-lazy-perf.hs

import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString as S
import System.IO (stdin)

size = 2048 * 1024 * 1024

go :: Int -> [S.ByteString] -> Int
go s (c:cs) | s >= size = s
            | otherwise = go (s + S.length c) cs

main = do
    bs <- B.hGetContents stdin
    print $ go 0 (B.toChunks bs)

This program has very similar memory footprint to the strict ByteString version, but is about 20% slower. That slowdown comes from different chunk sizes. hGetContents uses a hard-coded chunk size of 32 KB (described in the documentation of ByteString). In our previous example, we used a chunk size of 1024 KB, which is a better fit when a lot of bytes are read in. If you changed the chunk size of the strict program variant to 32 KB, the difference between the strict and lazy variants would be negligible, though lazy ByteStrings produce more GC traffic.

Thanks to lazy ByteStrings, we could use hGetContents to get an infinite ByteString and turn our loop into a pure function. Pure code is in general more valuable than just raw performance.

Starting with bytestring 0.10.0.0, the Data.ByteString.Short module provides byte arrays with zero memory overhead. A normal ByteString has a memory overhead of a few Word, and a ByteString, once allocated, cannot be moved by GC. This means that multiple small ByteString could contribute to heap fragmentation, or wasted space. ShortByteString, on the other hand, can be moved by GC, but their API is not nearly as complete as the ByteString API, and should only be used for internal optimization.

Working with characters and strings

The standard Char data type is defined to hold any character of the ISO 10646 character set. Char represents every character with 31 bits.

The text representation chosen in Prelude and base libraries is String = [Char]. This representation has the convenient property that an understanding of, and operations on, lists carries over to Strings. Furthermore, it's trivial to write programs that process infinite data sequentially without any extra libraries.

Other than being convenient for the programmer, linked lists have a huge overhead, making them ill-suited for high performance string processing. Furthermore, String isn't totally Unicode-correct because some strings' lengths depend on their case.

We could fix performance with a UArray ix Char. However, this still wouldn't get us Unicode-correctness.

The bytestring package provides a simple 8-bit character interface for ByteStrings in separate Char8 modules (one for strict and lazy ByteStrings), which may sometimes be all you need if you are sure you're working with 8-bit (ISO 8859) encoded strings.

Using the text library

The library of choice for general text processing nowadays is text. Its API is designed to resemble String functions, but is faster and Unicode-correct. The Text datatype stores values UTF-16 encoded. That's 16 bits for most characters and 32 bits for obscure characters. Compare this to 31 bits in a Char.

Note

Note that due to the different representations, there is an overhead when converting Strings and Texts. It doesn't always make sense to convert from String to Text or vice versa.

Similar to the bytestring library, the text library provides strict and lazy variants under different modules. I/O operations with strict and lazy Text are provided under corresponding modules.

Unlike the bytestring library, the text library uses internally an array representation and employs a technique called stream fusion to eliminate the need for intermediate values. Basically this means that pipelines such as T.length . T.toUpper . T.init will be optimized into a single loop over the input value when optimizations are enabled. Functions that are fused away are indicated in the documentation of text with the phrase Subject to fusion.

The text-icu package provides bindings for the mature International Components for Unicode (ICU) library on top of text.

Builders for iterative construction

Builder abstractions can be used to efficiently compose multiple small chunks into one big ByteString, Text, or even String. The text and bytestring packages provide modules, Data.ByteString.Builder and Data.Text.Lazy.Builder, which define Builder types that compose as monoids.

Say we have a data type, Tree, defined by:

data Tree = Tree !(Int, Tree) !(Int, Tree)
          | Leaf !ByteString

We want a ByteString serialization of Tree values, so that for an example value (this requires enabling the OverloadedStrings extension):

Tree (1,Leaf "one") (2, Tree (3,Leaf "three") (4,Leaf "four"))

This encodes as [1:"one",2:[3:"three",4:"four"]].

We can accomplish this with the following function:

-- file: builder-encoding.hs

import Data.ByteString (ByteString)
import qualified Data.ByteString.Builder as B
import Data.Monoid ((<>))
import System.IO (stdout)

encodeTree :: Tree -> B.Builder
encodeTree (Tree (l1, t1) (l2, t2)) = B.charUtf8 '['
    <> B.intDec l1 <> B.charUtf8 ':'<> encodeTree t1
    <> B.charUtf8 ','
    <> B.intDec l2 <> B.charUtf8 ':'<> encodeTree t2
    <> B.charUtf8 ']'
encodeTree (Leaf bs) = B.charUtf8 '"'
    <> B.byteString bs <> B.charUtf8 '"'

I also added a main to test the encoder:

main = B.hPutBuilder stdout $ encodeTree $
    Tree (1,Leaf "one") (2, Tree (3,Leaf "three") (4,Leaf "four"))

The ByteString Builder skips all unnecessary intermediate data structures. A ready Builder value can be rendered as a lazy ByteString, meaning it can be consumed lazily. So it's completely possible to create even infinite ByteStrings with Builder. As a final bonus, if you are writing the resulting ByteString into a Handle, you can use hPutBuilder, which puts the result straight in the handle's buffer, skipping all intermediate allocations.

The Data.Text.Lazy.Builder API is similar to the ByteString Builder API. The biggest difference is that text Builders can be constructed only from Chars, and lazy and strict Texts, so they're clearly fit for textual data only.

Builders for strings

Strings are only lists of characters, and lists admit a rather elegant Builder type:

type Builder = [Char] -> [Char]

Turning a String into a Builder is accomplished by applying concatenation partially, and to execute the Builder we just apply it to the empty list,[]:

string :: String -> Builder
string str = (str ++)
toString :: Builder -> String
toString b = b []

With this representation, builders are concatenated with normal function composition (.). Now we can write the previous tree-encoding example using our string builder:

-- file: string-builder.hs

data Tree = Tree !(Int, Tree) !(Int, Tree)
          | Leaf !String

encodeTree :: Tree -> Builder
encodeTree (Tree (l1, t1) (l2, t2)) =
    string "[" . string (show l1) . string ":" . encodeTree t1 .
    string "," . string (show l2) . string ":" . encodeTree t2 . string "]"
    encodeTree (Leaf str) = string "\"" . string str . string "\""

main = putStrLn $ toString $ encodeTree $
Tree (1,Leaf "one") (2, Tree (3,Leaf "three") (4,Leaf "four"))

It's not hard to see that this builder also creates its result lazily. Coincidentally, the standard Show type class defines its serializations via:

type ShowS = String -> String

This is exactly the same as our Builder.