I originally wrote all of these data processing computations in python. It seemed a sensible choice because all of the data was in the form of CSV files and slicing up text is easy and fun in python. Unfortunately, python was too slow, even when I rewrote my code to by multicore. I was spending a lot of time waiting around for results and it was slowing down the rate of scientific progress.
I rewrote all of the code in Haskell and it's much faster! Unfortunately, it also crashes. A lot! It keeps running out of memory. Maybe this has happened to you too. So let me tell you why my (and possibly your) Haskell program keeps running out of memory.
First of all, I am on Windows and if you install Haskell Platform for Windows, it is still 32-bit. There is a 64-bit GHC for Windows, but you will have to install it manually, and who has time for that when there's science to do? If you're on Linux, and are lucky enough to have packages which aren't ancient, then you might have a 64-bit GHC already. If you're on Windows, you're out of luck until they release as new Haskell Platform and it's been a while since the last one. Being stuck in 32-bit means that Haskell programs are limited to 4GB of memory. It's worse than that though because the way GHC compiles thing, you're actually limited to only 2GB of memory. Actually it seems like it's more like 1.7GB. Pretty bad for big data work. It's too bad for me because I have a powerful high-memory multi-core desktop I built for crunching data and for some reason it's running Windows.
- Tip #1 - run Haskell programs on Linux.
Of course, you shouldn't actually need to load the whole data set into memory. That's the beauty of lazy evaluation and garbage collection, you can create computations on streams of data and write your code like there's just one big in-memory list, but actually the whole file is not in memory at once. Right??? Well yes, but only if you do it right. There are a number of ways to do it wrong.
The key to having lazy evaluation work for you is to only evaluate a lazy list once. If you evaluate it twice, the first evaluation will load the whole list into memory and it can't be garbage collected because it has to stay around for the second evaluation. There are two ways I can think of to deal with this. You can sequence your evaluations or you can fuse them. The classic example is computing an average. The naive way to compute an average is to first compute the sum of the elements of the list (first evaluation), then compute the length of the list (second evaluation), and then divide the sum by the length. Naively computing the average of a list of numbers which is 2GB will crash your Haskell program due to the dual evaluation interfering with garbage collection. In the sequential approach you could load the list from a file and compute the sum. Then load the list from a file again (new list) and compute the length. Each instance of the list will be garbage collected separately. This is inefficient, but may be your only option if you are using computations you didn't write. For instance, I use the stddev function from Math.Statistics and I don't want to write that myself, so sequential is the best option there. A more efficient approach is fusion, when you evaluate the list once and compute everything you want to computer in a single recursive function. In the case of average, you could write a function which evaluates the list once, keeping both a running sum and a running length. At the end you have the sum and length and you can compute the average. If you are writing your own computations then this is a great option.
- Tip #2 - Only evaluate a large list once - sequence or fuse your computations
That works great if all of your data can be processed sequentially. However, some of my computations actually require that I load the whole dataset into memory at once or else come up with some fancy workarounds. An example of this sort of computation is sorting. It's much easier to sort a list if you have the whole list in memory at once. Of course this only works if your data is less than 1.7GB (or you're on Linux). I thought my data was small enough, but again the crashing, out of memory. Well it turns out that all the Haskell types you know and love are actually quite terrible from a memory use perspective. How big do you think a String is? An Integer? Too big, is the correct answer. Fortunately, there are more efficient alternatives that are less popular among Haskell code examples, such as ByteString (strict if you have small strings, lazy if you have big ones) and Int. Not only are these more memory efficient, but they can also be faster when your Haskell code is compiled down into C or assembly. Ints, for instance, can be loaded directly into CPU registers.
- Tip #3 - Use Int instead of Integer and ByteString instead of String
So those are my tips so far. I've managed to crunch some big datasets since making these changes to my Haskell code. There is great potential for Haskell in the world of big data which is still ruled by Java tools. Sometimes though the convenience of Haskell being a high-level language hides some of the things that are happening under the hood which can affect whether your program succeeds which startling speed or runs out of memory.
If anyone wants to build a high-performance cloud for Haskell-based big data processing, let me know, I have some cool ideas for how to make that work.