
Handling binary and textual data
The smallest piece of data is a bit (0 or 1), which is isomorphic to Bool
(True
or False
). When you need just one bit, a Bool should be your choice. If you need a few bits, then a tuple of Bools will fit the purpose when performance is not critical. A [Bool]
is sometimes convenient, but should only be chosen for convenience in some situations.
For high-performance binary data, you could define your own data type with strict Bool fields. But this has an important caveat, namely that Bool is not a primitive but an algebraic data type:
data Bool = False | True
The consequence is that you cannot unpack a Bool similar to how you could an Int or Double. In Haskell, Bool values will always be represented by pointers. Fortunately for many bit-fiddling applications, you can define a data type like this:
data BitStruct = BitStore !Bool !Bool !Bool
This will get respectable performance. However, if you need a whole array of bits it quickly becomes inconvenient to define a field per bit.
Representing bit arrays
One way to define a bit array in Haskell that still retains the convenience of Bool is:
import Data.Array.Unboxed type BitArray = UArray Int Bool
This representation packs 8 bits per byte, so it's space efficient. See the following section on arrays in general to learn about time efficiency – for now we only note that BitArray
is an immutable data structure, like BitStruct
, and that copying small BitStructs is cheaper than copying BitArrays due to overheads in UArray
.
Consider a program that processes a list of integers and tells whether they are even or odd counts of numbers divisible by 2, 3, and 5. We can implement this with simple recursion and a three-bit accumulator. Here are three alternative representations for the accumulator:
-- file: bitstore.hs {-# LANGUAGE BangPatterns #-} import Data.Array.Unboxed import Data.Bits (xor) type BitTuple = (Bool, Bool, Bool) data BitStruct = BitStruct !Bool !Bool !Bool deriving Show type BitArray = UArray Int Bool
And the program itself is defined along these lines:
go :: acc -> [Int] -> acc go acc [] = acc go (two three five) (x:xs) = go ((test 2 x `xor` two) (test 3 x `xor` three) (test 5 x `xor` five)) xs test n x = x `mod` n == 0
I've omitted the details here. They can be found in the bitstore.hs
file.
The fastest variant is BitStruct, then comes BitTuple (30% slower), and BitArray is the slowest (130% slower than BitStruct). Although BitArray is the slowest (due to making a copy of the array on every iteration), it would be easy to scale the array in size or make it dynamic. Note also that this benchmark is really on the extreme side; normally programs do a bunch of other stuff besides updating an array in a tight loop.
If you need fast array updates, you can resort to mutable arrays discussed later on. It might also be tempting to use Data.Vector.Unboxed.Vector Bool
from the vector
package, due to its nice interface. But beware that that representation uses one byte for every bit, wasting 7 bits for every bit.
Handling bytes and blobs of bytes
The next simplest piece of information after bits is a byte, which is eight bits. In Haskell, the Word8 type represents a byte. Often though, whole words are more useful. The Word type has the same size as Int, defined by the underlying architecture. Types Word16, Word32, and Word64 consist of respective numbers of bits.
Like a bit array, a byte array could be represented as a UArray
. But a more standard solution is to use ByteString
from the bytestring
package. The bytestring
package uses a blazingly fast pointer representation internally, while the API looks just like the API for standard lists.
Let's test how fast it really is:
-- file: bytestring-perf.hs import qualified Data.ByteString as B import System.IO (stdin) go :: Int -> Int -> IO Int go 0 s = return $! s go n s = do bs <- B.hGet stdin (1024 * 1024) go (n-1) $! B.length bs + s main = go 2048 0 >>= print
This program reads two gigabytes of binary data from its standard input in one megabyte chunks and prints the total of bytes read. Test it with this:
$ ghc -rtsopts -O bytestring-perf.hs $ time ./bytestring-perf +RTS -s < /dev/zero
On my machine, the program takes 0.25 seconds and allocates about 2.1 gigabytes in heap – meaning there was hardly any space overhead from our use of ByteString
and speed was respectable as well.
The Data.ByteString.ByteString
datatype is strict, meaning that all bytes of a ByteString will be in memory. The Data.ByteString.Lazy
module defines its own ByteString, which is lazy:
data ByteString = Empty | Chunk {-# UNPACK #-} !S.ByteString ByteString
Note
Note that you can unbox strict ByteStrings in your own data types as well.
Using lazy ByteStrings, we could rewrite our program as follows:
-- file: bytestring-lazy-perf.hs import qualified Data.ByteString.Lazy as B import qualified Data.ByteString as S import System.IO (stdin) size = 2048 * 1024 * 1024 go :: Int -> [S.ByteString] -> Int go s (c:cs) | s >= size = s | otherwise = go (s + S.length c) cs main = do bs <- B.hGetContents stdin print $ go 0 (B.toChunks bs)
This program has very similar memory footprint to the strict ByteString version, but is about 20% slower. That slowdown comes from different chunk sizes. hGetContents
uses a hard-coded chunk size of 32 KB (described in the documentation of ByteString). In our previous example, we used a chunk size of 1024 KB, which is a better fit when a lot of bytes are read in. If you changed the chunk size of the strict program variant to 32 KB, the difference between the strict and lazy variants would be negligible, though lazy ByteStrings produce more GC traffic.
Thanks to lazy ByteStrings, we could use hGetContents
to get an infinite ByteString and turn our loop into a pure function. Pure code is in general more valuable than just raw performance.
Starting with bytestring
0.10.0.0, the Data.ByteString.Short
module provides byte arrays with zero memory overhead. A normal ByteString has a memory overhead of a few Word, and a ByteString, once allocated, cannot be moved by GC. This means that multiple small ByteString could contribute to heap fragmentation, or wasted space. ShortByteString, on the other hand, can be moved by GC, but their API is not nearly as complete as the ByteString API, and should only be used for internal optimization.
Working with characters and strings
The standard Char data type is defined to hold any character of the ISO 10646 character set. Char represents every character with 31 bits.
The text representation chosen in Prelude
and base libraries is String = [Char]
. This representation has the convenient property that an understanding of, and operations on, lists carries over to Strings. Furthermore, it's trivial to write programs that process infinite data sequentially without any extra libraries.
Other than being convenient for the programmer, linked lists have a huge overhead, making them ill-suited for high performance string processing. Furthermore, String isn't totally Unicode-correct because some strings' lengths depend on their case.
We could fix performance with a UArray ix Char
. However, this still wouldn't get us Unicode-correctness.
The bytestring
package provides a simple 8-bit character interface for ByteStrings in separate Char8
modules (one for strict and lazy ByteStrings), which may sometimes be all you need if you are sure you're working with 8-bit (ISO 8859) encoded strings.
Using the text library
The library of choice for general text processing nowadays is text
. Its API is designed to resemble String functions, but is faster and Unicode-correct. The Text datatype stores values UTF-16 encoded. That's 16 bits for most characters and 32 bits for obscure characters. Compare this to 31 bits in a Char.
Note
Note that due to the different representations, there is an overhead when converting Strings and Texts. It doesn't always make sense to convert from String to Text or vice versa.
Similar to the bytestring
library, the text
library provides strict and lazy variants under different modules. I/O operations with strict and lazy Text are provided under corresponding modules.
Unlike the bytestring
library, the text
library uses internally an array representation and employs a technique called stream fusion to eliminate the need for intermediate values. Basically this means that pipelines such as T.length . T.toUpper . T.init
will be optimized into a single loop over the input value when optimizations are enabled. Functions that are fused away are indicated in the documentation of text with the phrase Subject to fusion.
The text-icu
package provides bindings for the mature International Components for Unicode (ICU) library on top of text.
Builders for iterative construction
Builder abstractions can be used to efficiently compose multiple small chunks into one big ByteString, Text, or even String. The text
and bytestring
packages provide modules, Data.ByteString.Builder
and Data.Text.Lazy.Builder
, which define Builder types that compose as monoids.
Say we have a data type, Tree
, defined by:
data Tree = Tree !(Int, Tree) !(Int, Tree) | Leaf !ByteString
We want a ByteString serialization of Tree
values, so that for an example value (this requires enabling the OverloadedStrings extension):
Tree (1,Leaf "one") (2, Tree (3,Leaf "three") (4,Leaf "four"))
This encodes as [1:"one",2:[3:"three",4:"four"]]
.
We can accomplish this with the following function:
-- file: builder-encoding.hs import Data.ByteString (ByteString) import qualified Data.ByteString.Builder as B import Data.Monoid ((<>)) import System.IO (stdout) encodeTree :: Tree -> B.Builder encodeTree (Tree (l1, t1) (l2, t2)) = B.charUtf8 '[' <> B.intDec l1 <> B.charUtf8 ':'<> encodeTree t1 <> B.charUtf8 ',' <> B.intDec l2 <> B.charUtf8 ':'<> encodeTree t2 <> B.charUtf8 ']' encodeTree (Leaf bs) = B.charUtf8 '"' <> B.byteString bs <> B.charUtf8 '"'
I also added a main
to test the encoder:
main = B.hPutBuilder stdout $ encodeTree $ Tree (1,Leaf "one") (2, Tree (3,Leaf "three") (4,Leaf "four"))
The ByteString Builder skips all unnecessary intermediate data structures. A ready Builder value can be rendered as a lazy ByteString, meaning it can be consumed lazily. So it's completely possible to create even infinite ByteStrings with Builder. As a final bonus, if you are writing the resulting ByteString into a Handle
, you can use hPutBuilder
, which puts the result straight in the handle's buffer, skipping all intermediate allocations.
The Data.Text.Lazy.Builder
API is similar to the ByteString Builder API. The biggest difference is that text Builders can be constructed only from Chars, and lazy and strict Texts, so they're clearly fit for textual data only.
Builders for strings
Strings are only lists of characters, and lists admit a rather elegant Builder
type:
type Builder = [Char] -> [Char]
Turning a String
into a Builder
is accomplished by applying concatenation partially, and to execute the Builder
we just apply it to the empty list,[]
:
string :: String -> Builder string str = (str ++) toString :: Builder -> String toString b = b []
With this representation, builders are concatenated with normal function composition (.
). Now we can write the previous tree-encoding example using our string builder:
-- file: string-builder.hs data Tree = Tree !(Int, Tree) !(Int, Tree) | Leaf !String encodeTree :: Tree -> Builder encodeTree (Tree (l1, t1) (l2, t2)) = string "[" . string (show l1) . string ":" . encodeTree t1 . string "," . string (show l2) . string ":" . encodeTree t2 . string "]" encodeTree (Leaf str) = string "\"" . string str . string "\"" main = putStrLn $ toString $ encodeTree $ Tree (1,Leaf "one") (2, Tree (3,Leaf "three") (4,Leaf "four"))
It's not hard to see that this builder also creates its result lazily. Coincidentally, the standard Show
type class defines its serializations via:
type ShowS = String -> String
This is exactly the same as our Builder.