Step Three: Profit!

Haskell is Not PHP And That's Okay

2014-04-11T15:26:00.002-07:00

It's interesting what people think programming is and how thinking in Haskell has changed that for me. I remember seeing something about how Mark Zuckerberg wanted to teach kids 0-based counting so they could be computer programmers.

Well guess what, I don't use 0-based counting anymore!

In Haskell instead of x = arr[0] you say something like (x:xs) = arr. You could also say x = head arr, but I find the former idiom happens more often in my code. Getting the first element of a list is a subset of the functionality of destructuring that is as pervasive in Haskell code as array indices are in PHP.

I mention this not because how you get items out of a list in different languages is interesting in itself, but because I used to think that 0-based math is one of the things that makes you a programmer. It turns out I don't need it and I don't miss it.

Other things I am surprised that I don't miss are multi-assignment variables and loops.

I used to think that these were core elements of what it means to program. I used to think that languages without loops were just toys (and a lot of them still are).

Things I still use are if statements and function calls. These seem to be really quite fundamental to what it means to program a computer. There are languages which lack these. For instance, you can use SKI calculus to write computations and that is a language without an if. However, you end up recreating some form of branching combinator if you try to write programs his way.

Things I do miss are strict evaluation and dynamic types.

Lazy evaluation can do some cool things, but it can make your programs more confusing. This especially comes up when debugging things like why they are running out of memory. Sometimes I am implementing a straightforward standard eager evaluation algorithm and it would nice to not have to convert this into a lazy-friendly form. I think this might be more fundamentally difficult than converting from iterative to recursive functions. At least it has been for me so far.

Static typing is great and part of what makes Haskell work well, but sometimes it's a pain. Parsing data structures from dynamically typed languages, for instance JSON, is painful in Haskell. JSON as mixed type arrays and Haskell doesn't, so it's hard to translate between the two, whereas in python you just call loads(jsondata) and you get a native python data structure. So fun and easy! Another place where dynamic typing is nice is function/operator overloading. Haskell has this for numbers with type classes. You can do a + b and as long as they are the same type it will work whether a and b are of type Integer or Float or whatever. However, Haskell also has three kinds of strings (String, ByteString, ByteString.Lazy) and I have to use three different append functions (append, ByteString.append, ByteString.Lazy.append) to concatenate them. Python has two string types (str and bytes) and + works on either (as well as on numbers).

In summary, I used to have some ideas about what programming was and what was important in a language. The more languages I learn the more I realize that the things I were really important are not the important things after all. I used to care a lot about surface-level features like whether a language had lambdas or its support for objects. I've come to realize that when you really get down to it high-level languages are just ways to glue together what is essentially chunks of C code underneath. I'm willing to try out different sets of abstractions until I find the one that gets me to my goal with minimum time and effort. Haskell has proven to be a very practical language for getting things done, but python is still ahead of the game for messing around with data in text files.

Why Does My Haskell Program Keep Running Out of Memory?

2014-03-19T21:12:00.001-07:00

Recently I've been doing some "big data" or "data science" work. I define data as being "big" when there's too much of it for me to look at the actual data and instead I am forced to only look at the results of computations derived from the data. I consider this work to be "science" because people offer hypotheses about what the data might say based on their understanding of the real world situations that generated the data and then I can look at the results of computations on the data and provide evidence for or against these hypotheses. The hypotheses lets me know what kind of computations to write and the results inform the formulation of new hypotheses. Fun stuff!

I originally wrote all of these data processing computations in python. It seemed a sensible choice because all of the data was in the form of CSV files and slicing up text is easy and fun in python. Unfortunately, python was too slow, even when I rewrote my code to by multicore. I was spending a lot of time waiting around for results and it was slowing down the rate of scientific progress.

I rewrote all of the code in Haskell and it's much faster! Unfortunately, it also crashes. A lot! It keeps running out of memory. Maybe this has happened to you too. So let me tell you why my (and possibly your) Haskell program keeps running out of memory.

First of all, I am on Windows and if you install Haskell Platform for Windows, it is still 32-bit. There is a 64-bit GHC for Windows, but you will have to install it manually, and who has time for that when there's science to do? If you're on Linux, and are lucky enough to have packages which aren't ancient, then you might have a 64-bit GHC already. If you're on Windows, you're out of luck until they release as new Haskell Platform and it's been a while since the last one. Being stuck in 32-bit means that Haskell programs are limited to 4GB of memory. It's worse than that though because the way GHC compiles thing, you're actually limited to only 2GB of memory. Actually it seems like it's more like 1.7GB. Pretty bad for big data work. It's too bad for me because I have a powerful high-memory multi-core desktop I built for crunching data and for some reason it's running Windows.

Tip #1 - run Haskell programs on Linux.

Of course, you shouldn't actually need to load the whole data set into memory. That's the beauty of lazy evaluation and garbage collection, you can create computations on streams of data and write your code like there's just one big in-memory list, but actually the whole file is not in memory at once. Right??? Well yes, but only if you do it right. There are a number of ways to do it wrong.

The key to having lazy evaluation work for you is to only evaluate a lazy list once. If you evaluate it twice, the first evaluation will load the whole list into memory and it can't be garbage collected because it has to stay around for the second evaluation. There are two ways I can think of to deal with this. You can sequence your evaluations or you can fuse them. The classic example is computing an average. The naive way to compute an average is to first compute the sum of the elements of the list (first evaluation), then compute the length of the list (second evaluation), and then divide the sum by the length. Naively computing the average of a list of numbers which is 2GB will crash your Haskell program due to the dual evaluation interfering with garbage collection. In the sequential approach you could load the list from a file and compute the sum. Then load the list from a file again (new list) and compute the length. Each instance of the list will be garbage collected separately. This is inefficient, but may be your only option if you are using computations you didn't write. For instance, I use the stddev function from Math.Statistics and I don't want to write that myself, so sequential is the best option there. A more efficient approach is fusion, when you evaluate the list once and compute everything you want to computer in a single recursive function. In the case of average, you could write a function which evaluates the list once, keeping both a running sum and a running length. At the end you have the sum and length and you can compute the average. If you are writing your own computations then this is a great option.

Tip #2 - Only evaluate a large list once - sequence or fuse your computations

That works great if all of your data can be processed sequentially. However, some of my computations actually require that I load the whole dataset into memory at once or else come up with some fancy workarounds. An example of this sort of computation is sorting. It's much easier to sort a list if you have the whole list in memory at once. Of course this only works if your data is less than 1.7GB (or you're on Linux). I thought my data was small enough, but again the crashing, out of memory. Well it turns out that all the Haskell types you know and love are actually quite terrible from a memory use perspective. How big do you think a String is? An Integer? Too big, is the correct answer. Fortunately, there are more efficient alternatives that are less popular among Haskell code examples, such as ByteString (strict if you have small strings, lazy if you have big ones) and Int. Not only are these more memory efficient, but they can also be faster when your Haskell code is compiled down into C or assembly. Ints, for instance, can be loaded directly into CPU registers.

Tip #3 - Use Int instead of Integer and ByteString instead of String

So those are my tips so far. I've managed to crunch some big datasets since making these changes to my Haskell code. There is great potential for Haskell in the world of big data which is still ruled by Java tools. Sometimes though the convenience of Haskell being a high-level language hides some of the things that are happening under the hood which can affect whether your program succeeds which startling speed or runs out of memory.

If anyone wants to build a high-performance cloud for Haskell-based big data processing, let me know, I have some cool ideas for how to make that work.

Bitcoin's Failure to Scale

2014-02-26T13:41:00.001-08:00

The golden age of Bitcoin is over, but it's not because of the reason you'd think. The recent drop in the price of Bitcoin after the collapse of MtGox is irrelevant because as I've said before the price of Bitcoin is irrelevant. However, the MtGox collapse and the joint statement regarding this collapse from major players in the Bitcoin industry highlights that Bitcoin has taken a wrong turn and is now plowing ahead in the wrong direction.

Why is Bitcoin useful? If you've read my post "The Price of Bitcoins is Irrelevant" you know that I consider Bitcoin to be useful for one important function: transferring money online for the purpose of buying and selling goods and services. In the past, we've had decentralized currency exchange in the form of cash and centralized electronic currency exchange in the form of ACH transfers and credit bard payments through banks, but we haven't had a good means for currency exchange which is both electronic and decentralized. Bitcoin is useful because it provides exactly these properties, or at least it used to.

The problem with the modern Bitcoin economy is that it is becoming less and less decentralized. Much of the Bitcoin exchange is now happening through a handful of services such as MtGox and Coinbase. They are essentially taking on the role of unregulated banks and are starting to act in much the same way that banks did before regulation. The collapse of MtGox is a tale as old as money, with an origin before the rise of modern banks. Before banks as we know it existed, there were goldsmiths that would hold onto your gold and other valuables while you were off on the crusades. They provided physical security for your physical wealth in exchange for a fee. You got a paper receipt as proof of your deposit of valuables. Of course eventually someone showed up to get his gold back and found out that his piece of paper was worthless because the king had raided the vaults to finance his war efforts.

The problem with banks is when they replace your actual money with virtual money in the form of an account balance. This is a promise from the bank that they will give you back an equal amount of money as you gave them to hold. Of course, a promise is only worth anything if it's fulfilled. An account balance denominated in Bitcoins is no better than an handwritten IOU. There's no way to know if the vaults are in fact empty.

It's not necessary for Bitcoin companies to implement their services this way, by converting your actual Bitcoin assets into account balances. Bitcoin holdings are independently verifiable by examining the blockchain. Therefore the responsible way to operate is for Bitcoin companies to merely act as proxies. Rather than running the Bitcoin client yourself, a company such as Coinbase can run it for you, providing an easy to use interface, dollar/Bitcoin exchange services, and secure backups of your private keys. However, the majority of Bitcoin services don't operate this way. They do not actually keep your Bitcoins for you, instead the Bitcoins you deposit or receive as payment go into that company's account and in exchange you only get a promise. They can at any time freeze your account, become insolvent, or otherwise break their promise and thereby steal your money. As example of this problem, look at the Coinbase et. al. joint statement in the part where it calls for Bitcoin companies to have "clear policies to not use customer assets for proprietary trading or for margin loans in leveraged trading". The fact that they even have the option to do this means that Bitcoin has failed to live up to its potential. A decentralized currency shouldn't have these problems or else we're just using banks again, unregulated banks, prone to all of the failures and abuses we've come to know and fear.

You might argue that this has nothing to do with Bitcoin. You can still run the client and be fully decentralized. People are free to build these centralized bank-like services on top of Bitcoin and other people are free to ignore them. However, there's a reason that people use services like MtGox and Coinbase. Bitcoin has some flaws which make it a usability nightmare and the centralized services fix those flaws. Let's discuss some aspects of the Bitcoin design and why they failed to scale:

Mining as the means of issuance
Requiring the full transaction history to make transactions
Fluctuating exchange rate

Mining as the means of issuance is one of the key innovations of Bitcoin and it worked quite well in the early days to ensure fairness in a fully decentralized way. However, as mining has scaled, it has failed to remain decentralized. Mining power is now concentrated in just a few mining pools. This is a direct effect of the way mining is done, with the difficulty being set by the hashrate. As the difficulty increases, the ability for individuals to successfully mine declines, forcing consolidation into pools. To be fair, the founders of Bitcoin could not have anticipated dedicated ASICs for mining. In the early days, the difficulty was discussed as something which would go up and down over time, not something that would go forever upwards.

Requiring the full transaction history to make transactions is a straightforward scaling problem. The size of the transaction history grows over time with the number of transactions. For existing clients, dealing with this is just a matter of storage space and keeping up with new transactions. For new clients, the entire history must be downloaded, which delays their introduction into the network. This has also caused recentralization. In the early days of Bitcoin, everyone ran the client, in fact there was no other option. This was fully decentralized, the way Bitcoin was meant to be used. More and more users are migrating to Bitcoin services which manage the transaction history for you. The signup for these can be instant and they are especially useful for mobile users that don't have sufficient resources to run a full Bitcoin client all the time. This once again re-centralized Bitcoin use.

In the early days of Bitcoin, the exchange rate was low but stable. Bitcoins were often obtained through mining instead of purchase as mining was something anyone could do. The exchange rate was not something to worry too much about as it changed slowly, and mostly upward. As the interest in Bitcoin grew, the volatility of the exchange rate increased to the point that it is a significant consideration for customers and vendors that would like to transaction using Bitcoins. This has lead to a desire for holding account balances in dollars. Services like Coinbase and Bitpay will let you transact entirely in dollars. This is not in itself a bad thing, but it means that once again you have to use a centralized service as the Bitcoin client has no way of converting your Bitcoins into dollars for storage.

So three aspects of the Bitcoin design that some would say are integral to its character as a cryptocurrency have all failed to maintain decentralization as Bitcoin has scaled. I argue that in fact these characteristics create pressure to centralize at scale. This is very bad for Bitcoin as it means that as it scales it will lose more and more of its advantage over traditional online payment methods.

Here is my three-point plan for getting back to decentralized cryptocurrencies:

Don't use services that give you an account balance instead of holding your actual Bitcoins

Blockchain seems like the only viable choice right now
In the past I have supported Coinbase, but unfortunately I must suggest moving off of it

Build services that maintain the decentralized operation of Bitcoin

Most of the services provided by companies like MtGox could be offered without account balances, storing actual Bitcoins for users
Bitcoin clients in the cloud offer are a good compromise
These services should be independently auditable by looking at the Blockchain

Build a new cryptocurrency without these scaling problems

Mining was a cool idea, but it must be replaced
Clients should be able to connect quickly without the full transaction history
A stable exchange rate
While we're at it, no transaction malleability

Let me know if you're interested in working on these things with me. I have several ideas for experiments that I think might be good to test the water.

The Price of Bitcoins is Irrelevant

2014-02-14T10:18:00.000-08:00

Much of the press coverage and discussion of Bitcoin has focused on the price of a Bitcoin, which has fluctuated greatly. It was $15 when I first started getting interested in Bitcoins. I had previously done some mining when Bitcoins are worth about $0.01 each, but I found the whole user experience at the time to be unusable. The $15 price point was when I discovered Coinbase and determined that maybe one day it would be feasible to actually exchange Bitcoins for good and services. Since then, the price has gone up exponentially and this has caused a lot of emotion: excitement from speculators, bitterness from people that missed out, disdain from people calling Bitcoins beany babies for nerds, even outright hatred from Charles Stross. Suddenly everyone is asking me about Bitcoins. My aunt even asked me about them at Thanksgiving. Now that the price has gone down somewhat from it's high at around $1000 per Bitcoin, people are proclaiming that this heralds the end of Bitcoins, that they knew all along it was a fad, and that they were smart for not investing. Every fluctuation in the exchange rate is viewed as a portent which confirms the feelings of the commentator.

I can see why people like to talk about the price of Bitcoins, especially the press. It's a very easy to track and easy to visualize indicator. The swooping curves, either upwards or downwards, are visually engaging. It fun to talk about Bitcoin millionaires, and it fun to talk in a schadenfreude sense about people losing their life savings in foolish Bitcoin investments. It's all very exciting and makes for good entertainment news.

I have a different perspective, and it is that the price of bitcoins is irrelevant. The focus on price is due to a misunderstanding of what Bitcoins are, what they're good for, and why they're interesting. Let me break it down for you.

What is money?

Money is a:

unit of measurement
store of value
vehicle for speculation
means of exchange

A given type of money can be any or all of these. People think Bitcoin is all of them and this is the root of the confusion as Bitcoin is bad at 1&2 while being good at 3&4. Therefore some people think Bitcoin is "good money" and some people think it's "bad money". It's both!

Unit of Measurement
We use money as a unit of measurement every day. When you say things like "$X is too much for a cup of coffee" or "My time is worth $X an hour", you are measuring the value of things in terms of dollars. A common misconception people have about Bitcoin is that Bitcoin is a unit of measurement and that something would cost, for instance, 1 Bitcoin. In actuality, when you see a price in Bitcoins it is calculated dynamically from a price in $. Some Bitcoin payment processing services like Bitpay will do this for you automatically based on the current exchange rate. Otherwise the vendor can calculate prices daily based on the current exchange rate. So prices are actually measured in dollars and just displayed in Bitcoins. The reason is that vendors have to spend dollars to acquire the products they sell you. Having a fixed cost in dollars to acquire goods and then selling them at a fixed price in Bitcoins while the dollar-Bitcoin exchange rate fluctuates is a nonsensical situation for vendors. You can hypothesize about a world in which vendors buy their stock of products with Bitcoins and then sell them for Bitcoins, but this is an imaginary world and not the one we are currently operating in. We do not live in a Bitcoin-based economy. We live in a dollar-based economy where Bitcoin only fills a small link in that chain between the customer and the vendor. Therefore, at the present time, Bitcoin is not a good unit of measurement. Most prices in Bitcoin is just a marketing tactic to let people know you accept Bitcoins.

Store of Value
Bitcoin is simple put a terrible store of value because the value is measured in dollars (see above). Obviously if the dollar value decreases then thats bad, but a store of value which increases in value is also not a very good store of value. A good store of value maintains its value (in dollars) consistently over time. Since the price of Bitcoins in dollars is determined by a market-based exchange rate, it is a poor store of value. Of course dollars aren't a particularly good store of value, even when they are stored in banks, due to inflation. Some national currencies might be an even worse store of value than Bitcoins if they are undergoing hyperinflation. However, if the choice is between Bitcoins and dollars, dollars are a superior store of value with significantly less fluctuation in value in the short term and a good track record for holding their value in the long term. The best store of value is a diversified investment portfolio. If you want to put some Bitcoins in the mix that's fine, but often when people buy Bitcoins they buy too many Bitcoins and are not sufficiently diversified. Consider mutual funds, real estate, and small business investments along with high-risk speculative investments such as Bitcoins and Internet startup stock options.

Vehicle for Speculation
When discussing what money is good for, people sometimes forget to include that it is a vehicle for speculation. This is true of all currencies, including dollars, because of currency exchange markets. Bitcoins are an excellent vehicle for speculation. Unlike the currency exchange markets, the Bitcoin-dollar exchange has a low barrier to entry and low overhead. You can start your exchange rate speculation today with only $1 of starting capital. The high volatility of the exchange rate offers many opportunities to buy low and sell high. With the multiple exchange markets there are also ample opportunities for arbitrage. It's a day trader's dream. Of course I'm not saying that Bitcoin is a good speculative investment or that you will make money playing the market. Successful speculation involves guessing when the prices are going to up and when they are going to go down, which is where the fun and the risk come in. So while Bitcoin is a fun and easy way to speculate on currency exchange markets, the actual price of Bitcoin is unimportant to speculation. All that matters is that the price keeps going up and down with sufficient frequency that you have opportunities to place bets on which direction it's going to go.

It's important to understand the difference between speculation and other types of investment. Long-term investments are like farming. With farming, seeds cost less than crops, so you invest in buying the seeds with the belief that if you wait a while the seeds will grow into crops and you can sell them for more than you put in. Value is created with time and effort. Businesses do this as well, so when you invest in a business you are buying a share of the larger value that's going to be created in the future. Speculative investment is more like betting on the result of a coin flips. You buy at a certain price hoping that the price will go up rather than down. Unlike the crops produced by farming or the value produced by businesses, Bitcoins do not naturally become more valuable over time. The rise in the price of Bitcoins has been based entirely on fluctuations in demand, which makes Bitcoins a specultive investment. So when people compare Bitcoins to beanie babies or tulip bulbs, there is a meaningful parallel there when speaking of Bitcoins as an investment. People are going to get rich and people are going to lose their shirts. That's how gambling works. While fun and perhaps the most discussed monetary feature of Bitcoin, I consider speculation to be the least interesting aspect of Bitcoin. The real value of Bitcoins is as a means of exchange.

Means of Exchange
Where Bitcoin really shines is as a means of exchange, even though for some reason the press never seems to cover this aspect of Bitcoin. The great feature of Bitcoin is that you can buy things with it! Also, you can sell things in exchange for Bitcoins! Buying and selling things is perhaps the oldest feature associated with money and a very handy feature indeed. Cash is a form of money that provides this feature but also has some downsides, particularly for Internet sales. Credit and debit cards work for Internet sales, but also have some undesirable features which are actually pretty weird if you think about them. Bitcoin offers an improvement over credit and debit cards for buying and selling goods and services, both in person and over the Internet.

Here are some great features of Bitcoin as compared to credit and debit cards:

Security - The customer only authorizes a specific transaction with the merchant, so the merchant can't steal from the customer or leak information that would allow hackers to steal from the customer.
Privacy - The customer doesn't provide any private information to the merchant such as their home address. Due to the increased security, this private information is not necessary to verify transactions.
Cost - Credit cards processing is expensive for the merchant and this is reflected in higher prices for the customers. All transactions carry what is essentially a sales tax but instead of being used to build schools and roads it just adds to the profit of the banks. Merchants have to pay swipe fees, a percentages of the sale, monthly fees, setup fees, and they have monthly minimums they must pay for even if they end up making no sales. Bitcoin is much cheaper for the merchant as there are only per-transaction fees and they are comparatively very low. Think about this: why do credit card transactions have percentage-based fees when a digital currency transaction requires the same amount of work for the processor whether it's for $1 or $1000?
Ease of use - Bitcoin is the easiest way to accept money on the Internet. No special equipment is required and you don't need a credit card processor or even a bank account. You can start accepting Bitcoins as payment right now for no cost. The easiest way is to set up a Coinbase account. It takes a couple of minutes and it will provide you with some HTML code you can put on your website to start accepting payments. It's really that easy!

So Bitcoin is a terrible form of money for all uses except for buying and selling things. For buying and selling things it's great and really quite revolutionary. Here's the thing about using Bitcoins for buying and selling goods and services: the price of Bitcoins is irrelevant. There is no reason to hold onto Bitcoins because they are a terrible store of value and they're a terrible investment unless you just like to gamble. Vendors do not price in Bitcoins (although they may display prices in Bitcoins) because we do not live in a Bitcoin-based economy. So if you want to buy something with Bitcoins, you purchase just the exact number of Bitcoins you need to match the price in dollars at the current exchange rate and send them to the merchant. The merchant then converts those Bitcoins immediately into dollars and deposits them in a bank account. The Bitcoin-dollars exchange rate only matters for the duration of the transaction, a short enough time period that the price is stable. Services for merchants like Coinbase and Bitpay automate this whole process for you so that you can use Bitcoin as a medium of exchange but only ever deal with dollars on either side.

So if the price of Bitcoins is irrelevant, how can we track the rise of Bitcoin and tell how adoption is going? The real measure of value is in the amount of goods and services being transacted using Bitcoins. Every time a new vendor starts accepting Bitcoins, that's when the real value of Bitcoins goes up. Unfortunately, there's not a handy chart of this, so it will probably never being reported on by the press. However, if you're in Austin for SXSW, hit me up and I'll show you where you can buy tacos with your Bitcoins.

Space Party: Space Captain

2013-02-09T18:41:00.001-08:00

My game studio, Hot Trouble, is working on a local co-op video game called Space Party. It's inspired by Artemis, Space Team, FTL, and Puzzle Pirates. In this game you and your friends all take on different roles crewing a spaceship. Each role has its own minigame that you have to play to do your job and keep the spaceship running.

We're releasing each of the minigames as a standalone game as part of the OneGameAMonth.com initiative. The first one is out now and it's called Space Captain. You pilot a ship around different sectors of the galaxy looking for an Earth-like planet to colonize. Watch out for the other planets though, as they're inhabited by hostile aliens that will chase you down and destroy your ship. You don't have any weapons, so your only option is to run.

JSON: It's Time to Move On

2013-01-25T13:45:00.000-08:00

I love JSON. I love it because it's not XML. I used to think XML was a pretty good idea compared to unpacking structs, but the more it was used for generic tasks like RPC and config file formats, the more it became clear that it was really only suitable for documents. This makes sense, as that's what it was designed to do. XML was being used to represent data structures, and the problem with that is that there is a mismatch between what XML is good at expressing and the sort of data structures you generally want to encode for computational tasks.

JSON is obviously a better choice for a number of common data types and structures such as floats, strings, maps, and lists. The syntax is easier to read and more concise for encoding these types. More importantly, however, is that there is a clear mapping between the data structures and their encoding. This was something you had to invent in XML or use one of a number of incompatible standards, such that XML became a proliferation of different languages speaking about the same things.

JSON has served us well, but much like XML, as it's been used for more and more things, it's shortcomings are becoming apparent. JSON suffers from essentially the same problem as XML, a like of universal mappings for common items that need to be encoded and decoded consistently.

The missing type which most commonly causes me trouble with JSON is byte strings. Javascript only has one string type, while other languages often have two: one for byte strings and one for unicode strings. To be honest, I'm not totally sure if Javascript strings are supposed to be unicode strings. String literals can include unicode escape sequences. However, I'm not clear on if you can have pure byte strings (i.e. with invalid unicode sequences) and I don't know, for instance, if String.charAt(x) counts bytes or unicode characters. However, most JSON encoders assume all strings to be unicode. Therefore, JSON in practice has only a unicode string type and does not support byte strings.

Many applications, however, have byte strings. The most common solution is to base64 encode your byte strings into ASCII characters and encode them into JSON as unicode strings. In addition to being slower, it requires increased semantic complexity. Both the sender and receiver of the JSON now need to know where the base64 encoded strings are in the nested JSON data structure so that they can be encoded and decoded between byte strings and base64.

This has caused people to invent their own protocols on top of or around JSON. For instance, you can tag every string as to whether it needs to be base64 decoded or not. Another solution is to remove all byte strings from the JSON and instead include tagged offsets. The binary data can then be appended to the end of the JSON data as a packed binary blob and the offsets used to extract individual byte strings. A very simple solution I've seen is to encode the whole data structure using a binary-friendly format such as BSON or MessagePack, base64 encode the entire result, and send it as a single JSON string.

The advantage to building something on top of or around JSON is that the encoder and decoder do all of the work of analyzing the data structure and patching incompatibilities with standard JSON. The disadvantage is that now you're using a nonstandard protocol which is going to need to be implemented for both the sender and receiver, for every language you want to use.

The best solution overall is to realize the limitations of JSON and decide on a new protocol which fixes these limitations. There are several alternatives to JSON already, but they focus more on efficiency on encoding and decoding rather than on the more fundamental semantic mismatch issues. Of the binary formats I've looked at (BSON, BJSON, MessagePack), only BSON has separate data types for unicode strings and byte strings. I'm not specifically advocating BSON, but at least they have the right idea on that front.

This new protocol doesn't even necessarily need to be a binary protocol. It just needs to support byte strings as a semantic type. In the end, everything needs to be JSON-compatible in order to be browser-compatible, so building something on top of JSON would probably be a fine solution. People are already doing this, as I mentioned above. The next step is to give it a name and release it on github so that everyone can use it and start adding support for more languages.

Here is my minimum feature least for a new encoding:

It should be JSON-style where you just give it a data structure and it serializes it, and you give it a string and it deserializes it into a data structure. (As opposed to Protobuf/Thrift style with schemas)
Support for all the JSON datatypes - string, float, map, list, boolean, null
Add support for byte strings in addition to unicode strings
Add support for integers in addition to floats
Add support for dates
Browser-compatible, which probably means encoded as JSON between the client and server

Nice-to-have optional features:

Sets as well as lists
Ordered maps as well as unordered maps

In the meantime, I've switched to using BSON when not in browsers and I'm still using JSON in the browser. This is not a good solution, but it's the best available at the moment that doesn't require inventing a custom protocol.

Adventure Time Game Jam

2012-10-01T10:24:00.002-07:00

I was recently fortunate enough to participate in the Adventure Time Game Jam, sponsored by Fantastic Arcade. They managed to get licensing rights from Pendleton Ward and Cartoon Network to use Adventure Time characters in games, under the condition that we could only distribute our games through the game jam site, and that Cartoon Network could post the ones they like on their own site.

The were about 700 participants, and approximately 100 games were produced. The winning game was by indie studio Vlambeer. It was such a great game too!

My own team consisted of myself as programmer, Corie Johnson as UI/UX/graphic designer, and Celine Suarez as voice actress and graphic artist. Corie also recorded the opening theme song and composed an original rap which she performed for the ending screen.

It was a unique experience. The game jam took place in an abandoned yoga studio next to the Alamo Drafthouse South Lamar. When we first arrived, there were no chairs. Our Internet was stolen from the Drafthouse. There was a main in the corner with an Einstein's Arcade t-shirt making ethernet cables and each time he finished one, one more person got to get online. In another corner, Vlambeer were sitting on the ground playing Infinite Swat with xbox controllers on a laptop.

For some reason pizza and beer kept arriving from unknown origins for 48 hours. All of the audio was recorded on iPhones in the shower at the space where we were doing the game jam. The ending rap was composed and the main theme recorded in the car driving to and from the space. There was no time to waste on second guessing decisions as the clock was constantly ticking. In the end I think we had one of the most finished games. You can download it from the site. Also check out how it was mentioned in the top 8 coolest games from the jam on Wired!

For me it was great working with such a talented team. I basically just hacked code nonstop. I did the whole game in KineticJS, which is a great HTML5 graphics framework, and I used Buzz for the sound. These libraries saved me a lot of time and I learned a lot about the affordances and limitations of HTML5 games.

Freefall Tutorial Screencast

2012-07-20T08:44:00.001-07:00

I made a screencast walking through the tutorial for building a Freefall example app. In this particular example, I built a simple leaderboard services where you can post scores and then get a sorted list of all the scores. Make sure to watch it in HD so that the text is legible.

Freefall Tutorial Docs

2012-07-17T15:27:00.005-07:00

I've been working on some tutorial docs for using Freefall. There is a usage tutorial that describes all of the different commands you can use with the command line tool. There is also an app development tutorial which walks you through developing a simple leaderboard service. By the end of the tutorial you should have a leaderboard up and running on Google App Engine!

If you go through the tutorial, please let me know how it goes. Any feedback on the documentation would be helpful as I want this to be a tool which people can actually use.

Austin on Kickstarter

2012-07-16T20:32:00.000-07:00

I was recently asked if I knew anyone that had done a Kickstarter campaign and might want to be on a crowdfunding panel here in Austin. I started thinking about all of the local folks I knew that had done Kickstarters. There were quite a few! I wanted to share them with you. These are just people I know personally (and friends of friends), so there must be a lot more projects going on in Austin that I don't know about. This is exciting to think about.

Here is the list:

Tammany Hall, The Great Fire of London - Pandasaurus Games / Nathan McNair
Inevitable - Dystopian Holdings / Jonathan Leistiko
Thunderbeam - Karakasa Games / Wiley Wiggins
growerbot - Luke Iseman
Big Poppa E's Poetry Project - Big Poppa E
The Blue Hit Recordings Project - The Blue Hit
Beatbox Beverages (not launched yet) - Beatbox Beverages / Aimy Steadman
CAT22 - me!

Freefall Scaling

2012-07-11T15:42:00.000-07:00

Freefall is a cloud-based NoSQL database which is designed from the ground up to be ultra-scalable. For the most part, users of Freefall don't need to know anything about the details. It just works. However, scalability enthusiasts might be interested in knowing what's going on under the hood.

The first step in designing Freefall to scale is that it runs on top of Google App Engine. This means that, for the most part, Google will autoscale the number of servers in order to handle the rate of incoming requests. There are plenty of parameters you can tweak on the App Engine web dashboard in order to optimize performance, but for basic functionality you don't need to change a thing.

The next step is the separation of frontend and backend services. The public API of your services, the actions and the views, are handled by the frontend servers. The transforms, the internal logic of your application and the bulk of the computation, are handled by the backend servers. This means that a client will never block while waiting for a time-consuming computation to complete. Actions and views are designed to return very quickly, freeing up the frontend servers to handle more requests, while the backend servers compute asynchronously. Therefore, when the service is overloaded with too many writes to process, the failure mode is stale data, not catastrophic collapse.

The next step is to separate reads and writes. Views are read-only. In fact, views are pre-computed and pre-serialized and cached in memory. So when you load a view all the frontend server has to do is read the pre-serialized bytes out of memory and write them to the HTTP socket. Views are therefore extremely fast. Actions are write-only. The purpose of an action is to change the application's state. All the frontend servers do for an action is to deserialize the incoming data and add the request to a queue. The actual processing of the action is done asynchronously on the backend. Actions are therefore extremely fast.

On the backend, an action and a transform are essentially the same, the only differences between them being whether it is part of the public API or internal logic and whether the input is supplied by the client or from internally stored data. From a processing perspective, they operate in the same way. The input data is loaded and the transform function is run. It makes changes to the output based on the input and the results of computation. After the output has been changed, two things happen: views are calculated, and transforms are triggered. If the particular model which is the output of the transform is marked as a view, then the state of that model is serialized and cached in memory so that it can be retrieved by the client. If any transforms are configured to be triggered by the output model then they are called and the process repeats again. Eventually all of the transforms have been processed and all of the views have been calculated and the system returns to a state of rest until the next action is performed.

That's the system in a nutshell, but I glossed over some details which are important to the technical aspects of how Freefall scales so well. If we were to process one transform at a time then that could make things very slow. So instead, App Engine can process multiple transforms at once by launching simultaneous backend servers which pull tasks from the queue. This allows for highly parallelized computation, similar to MapReduce. There are a number of ways that parallel computation can go awry, but everything works out in Freefall because of some clever design elements. First of all, the structure of transforms creates a data flow graph. Transforms can have multiple inputs, but only one output, and cycles are not allowed. The structure of the graph therefore partially serializes computation because a transform isn't executed until its inputs have changed.

Additionally, transforms are pure, side-effect free functions. So the value of the output is entirely determined by the value of the inputs (and the computation). It therefore doesn't matter what order we run the computations in as long as they have the correct inputs. This may seem confusing because the transform functions modify the output state. However, this is all a ruse to make it easier to write transforms in a more familiar syntax. Transforms do not actually modify the output model, but rather they are monadic functions, which is to say that they produce monads. A monad in this case is a list of requested changes to make to the model. The model is not actually modified, the modifications that are requested are just collected and returned at the end of the function. This is important because it means we can run the function as many times as we like without fear of it actually modifying the database. In fact, we do sometimes run the function multiple times. Freefall is a Software Transactional Memory (STM) database. We run the set of requested changes in a transaction, possibly simultaneously with a lot of other transactions. If any two transactions modify the same model then we abort and retry. The function which was aborted is rerun using the new values for the model. This is the one case in which it does matter in which order we run the functions as the rerun function might return a different output given its new input. However, this is essentially a case of two things happening simultaneously and so in the interest of moving forward one of the two simultaneous events is chosen to happen first and the other second. Deadlock is therefore avoided and consistency is maintained (because everything happens in transactions).

So that's basically all of the magic: queues, caching, asynchronous monadic functions, and software transactional memory. The result is a database that won't fall over under read load and can be scaled up to handle arbitrarily high write load by launching more backend processing servers.

Freefall: What Is It And Why Is It Awesome?

2012-07-11T14:51:00.000-07:00

To describe it simply, Freefall is a NoSQL database. It is similar to other NoSQL databases such as CouchDB, SimpleDB, or MongoDB. However, there are some key differences in Freefall, particularly in how you use it.

Unlike most databases, you do not run Freefall on your own servers. It runs as a Google App Engine app. You run your own instance and you pay Google for the bandwidth and computation time. The reason that you need to run your own instance is because Freefall isn't just a generic database. You specify the services provided by your application and then Freefall generates a custom App Engine app to provide those services. It also generates custom client libraries to call the services. So your experience as a developer is of a high-level API provided as a library for the language of your choice to access your specific services. In this way Freefall is similar to Rails because it provides most of the infrastructure and you just provide your application-specific code. Most importantly, you don't need to know anything at all about Google App Engine! It's all taken care of by Freefall. You just need to define your specific services and then you're ready to go.

For example, let's say you wanted a simple high score tracking service. You could define a "reportScores" action which reports a new score for a given playerid. You could then define a "highest" transform which discards all scores for a given player which aren't the highest seen. You could then define a "highScores" view which returns a list of the high scores. Freefall would then generate all of the code to make a server which supports these functions and client libraries with high-level methods such as "void reportScores(String playerid, float score)". You can then deploy your server-side code to App Engine and call the client library to access your high score service.

As you can see from the above example, Freefall is an MVC framework. You define actions which are the public API for changing the state of your application. You can also define internal transforms which derive new state from the state changed by actions, or from other transforms. Eventually, the state changes reach one of the defined views, in which case the changes become publicly accessible. Actions and views together form the public API of your service, while transforms represent the internal logic on your service. Together they form a data flow graph in which actions flow into transforms and then into views. Actions are the input and views are the output.

Transforms are a powerful feature of Freefall as they allow for arbitrary computations to take place to process your data. You can do validation, authorization, sorting, joins, and filtering. Transforms are similar to CouchDB views, except that they can take multiple inputs and they can be chained by using the output of one transform as the input for another. This makes them much more flexible than CouchDB views. Additionally, transforms (and actions) are full python functions. They can even import modules! They are stored in .py files, not in the database, so you can use version control to keep track of your code.

So when you really get down to it, Freefall is much more than a NoSQL database. It's a framework for doing data-driven server-side computations in a convenient and scalable way. Other databases just store the data and require you to do computation client-side, or they provide limited or awkward server-side computation. Freefall provides all of the power of python on the server, without all of the hassle and with much better scalability than setting up a python-based web server.

Announcing Freefall: Cloud Services for Mobile Apps

2012-07-11T14:03:00.000-07:00

For the past few years I've been working as a scalability consultant for Internet startups, mostly on scaling websites. People call me when their Rails servers are crushed by the popularity of their product, and I fix them. There's no magic bullet to scaling. There are a few principles of good design and they are largely not followed, so my job is to bring things back in line with best practices for optimum scaling. For a while I've been thinking about taking these best practices and packaging them up into something people could use directly, rather than implementing the same set of optimizations for each client. However, I found that web companies tend to already be committed to a particular stack. As I've also been doing Android and iOS development lately, I thought mobile developers might be the ideal market for a new, super-scalable backend system. I've noticed that mobile developers aren't that interested in hacking backend code. They'd rather just get on with their mobile apps and leave the backend to someone else. There are a lot of cloud backend services already available, but my idea was different, it was a universal backend for anything from leaderboards to MMOs.

I was pitching this idea to any mobile developer that would listen at SXSW Interactive this year and I was lucky enough to pitch it to John Warren at Minicore Studios. I've known John for a while as I am a friend of the St. Edward's Digital MBA program from which he graduated. He'd been talking about starting a game development studio, and sure enough he had done so and they had a booth at SXSW Screenburn. My pitch to John was that he would not need to hire server-side developers or sysadmins to run the servers for the online components of his games. His mobile front-end developers could just continue writing their games on Android, iOS, PC, and Xbox. There would be client libraries available in all of the necessary languages and the developers could use them like any other library, not ever thinking about the server side of things. The system would be flexible to whatever he needed to do with online services, not limited to a static set of services such as leaderboards and achievements like most cloud services. Best of all, there would be no monthly fees, it would be open source and built on top of Google App Engine, so you just pay Google for your bandwidth and computation, and you only pay for what you use.

I guess John thought it was a good idea, because he hired me on the spot to build this service. I've been working all summer at Minicore building an open source persistant world server for the indie game development community to use free of charge. What can I say? This has been a dream opportunity for me. After spending so much time fixing broken design I was able to build something which follow best practices from the ground up. It's flexible, it's fast, and most of all it scales like crazy.

I'm going to post more technical details soon, as well as documentation and guides. Right now I just wanted to let the world know that this is happening. I just got the first service working, leaderboards for Minicore's Tanks for the Memories for Android, and I decided it was time to make a post. In the meantime, check out the project source (without documentation as of yet) and give a shout out to John to thank him for making this possible for the indie game community.

High-level Languages for the 6502

2012-04-14T14:31:00.003-07:00

What's up, hackers? I know you guys don't particularly care about this subject, but I am have become temporarily obsessed with it, so here for your perusal, my research on high-level languages for the 6502!

Common-Lisp assembler - This is lisp-syntax assembler. Except you can also use lisp structures such as conditionals and loops. You can also define new lisp functions. It compiles to normal assembly. So I think it's more like macros than an actual lisp runtime. (See also COMFY)

Python on a chip - A subset of python syntax and VM!

Requires roughly 55 KB program memory
Initializes in 4KB RAM; print "hello world" needs 5KB; 8KB is the minimum recommended RAM.
Supports integers, floats, tuples, lists, dicts, functions, modules, classes, generators, decorators and closures
Supports 25 of 29 keywords and 89 of 112 bytecodes from Python 2.6
Can run multiple stackless green threads (round-robin)
Has a mark-sweep garbage collector
Has a hosted interactive prompt for live coding
Licensed under the GNU GPL ver. 2

Lots of great stuff about 6502 languages, particularly Forth. - Great info on how to implement your own languages on the 6502. Forth is an obvious choice, but I haven't found any good Forth implementations as of yet. There seem to be a lot of Forth projects that may or may not be related which I need to evaluate.

New Blog, Step Three: Privacy!

2011-06-13T20:53:00.000-07:00

This summer I have an internship with the Tor Project through Google Summer of Code. I started a blog for that and to generally talk about privacy stuff which will never be a profitable endeavor. Ironically, I'm getting paid by Google to work on it, but for Google this is an entirely non-profit endeavor in the name of charity and the general betterment of the world.

Privacy and profit have something in common in that each is an elusive goal. Thanks to my friend Drake Wilson for suggesting the related name for the new site, Step Three: Privacy!

I'll still be posting on here when I have startup and coding related matters to discuss, although I would recommend coders check out the other blog as well as there's some neat stuff on there if you do networking stuff in python.

How to Help in Eqypt: A Historical Perspective and a Call to Action

2011-01-29T07:51:00.000-08:00

It's really amazing how the censorship bar keeps getting raised. When I co-founded Freenet over ten years ago, there were lots of assumptions shared about what online censorship was and how far people would be willing to go and also about what free speech was and what people wanted to communicate online. These assumptions have carried through to the design of today's censorship resistant systems. For instance, Tor still uses SSL we used to think that no one in their right mind would block SSL because then they'd be blocking HTTPS and critical systems such as any online commerce. The essential assumption was that there was a certain level to which censors would not go and we just needed to hide our traffic below that level. This was a good assumption for a long time, but now the game has changed.

Iran was the first wake-up call. They went farther than China was every willing to go by severely throttling SSL specifically. This was really a smart move because it didn't slow down the ability to read pages on the Internet, as most are unencrypted. It did slow down Tor and the ability to log in to any sites that use SSL for logins (hopefully all of them at this point). Since logins are required for most publishing services such as email, Twitter, Facebook, etc., this throttled both the ability to send information out. Of course online commerce was affected, but they were willing to accept that. The Iran attack became the new gold standard in online censorship. All systems need to adapt to this new reality. Since SSL is now a target, SSL is no longer a good wrapper for traffic. This is why I started Dust, to provide a more modern transport layer for bypassing current censorship methods. I really believe we can make something which is undetectable and thus cannot be throttled or blocked. I think information theory is on our side here and that this is a war we can win.

However, Egypt raised the bar yet again by simply unplugging the Internet. This is remarkable as it not only shows how much farther censors are willing to go now, but also how the nature of online freedom of speech has changed. One of the classic examples we used to use to discuss the purpose of Freeenet was that China blocked access to CNN. This seems like a comically naive goal now. People aren't trying to access news from major publishers. They're organizing protests via Twitter. This is totally decentralized content, it's peer-to-peer communication.

Unfortunately, this isn't a software fix. If the cables aren't plugged in, there's no clever ways we can encode the data to get it past the censors. This particular situation is a hardware problem. The infrastructure is centralized in such a way that it's easy for the government of a country to just switch it off and so this is what happened. Some respected individuals have called for the building of a new Internet without these problems.

I think this is a very noble endeavor, but I want to be straight with you about the problems with this idea. Essentially, we've tried this and it doesn't work. We've been trying this for ten years. Building an Internet out of Wifi mesh points is like wiring a city for electricity using USB cables. The 3G and 4G wireless Internet that we have now is connected by a high speed wired backbone and this is what makes it work well. There are many problems with a entirely mesh network, but the primary one is range. Once you start looking at coverage areas and doing the math you quickly discover that the number of mesh nodes required to cover any decent areas is astronomical, particularly because you need to connect out to the larger Internet either by crossing the border to a friendly nation or connecting to a satellite link.

There is something we can do, though. We can design a custom network for these situations which, while it doesn't connect to the general Internet, provides network connectivity to people on the ground with each other. Here's a brief overview of my design:

Femtocells are superior to Wifi access points here. The computing device of choice is going to be the camera and GPS-equipped phone, not the laptop. Phones are used to drifting from tower to tower. They have hand-off protocols for switching towers seamlessly. That's why you can talk on the phone while driving down the highway. A femtocell is essentially a "fake" cellular phone tower that intercepts your phone signals and routes them over your own network connection instead of the phone company backbone. You normally get these to improve reception in areas with poor or non-existent tower coverage. Places like, for instance, Egypt right now.

My proposal is to combine portable, battery-powered femtocells with a custom backend that, instead of routing your data packets over an ethernet connection, stores the data for exchange on a store-and-forward mesh network, much as in the FidoNet network referred to in the Rushkoff article. Then, instead of having fixed towers and moving phones we have moving towers.

This is all kind of technical, I suppose, so let me break it down for you in terms of a use case. You're in Egypt and you want to get news about what's going on, send out videos of important happenings to the world, and organize with your fellow citizens to take political action. You have a phone with a camera and text and MMS messaging. There are mobile cell phone towers roaming around the city. (This is something you'd need to organize and is a whole issue in itself, but some people specialize in this kind of theory. It's solvable.) When you come in range of an access point, you can send text and MMS messages. You also receive any that have been sent to you. The access point is actually another citizen with a backpack femtocell and battery. They could be walking, although I've also seen a similar plan executed using motorcycles. The tower stores your sent messages. The towers move in such a pattern that they come into range of each other. At this point they exchange stored messages. When a tower comes in range of a phone for which it has stored messages, it sends them to the phone and then deletes them. Sending messages to people in your phone contact list works the same as always. Getting information out of the country just requires sending a text or MMS message to someone that is known to have a satellite, dial-up, or other link outside. Once the information moves through the mesh to them, they can send it on.

I think this is the right way to do decentralized mesh networking in situations like what is happening now in Egypt. This is something we can build right now. I'm ready to start on this whenever you are. The first step is that we're going to need some femtocells. After that, it becomes a software problem again, like hacking a Wifi router to run OpenWRT.

If this is a topic you're interested in, I will be giving a talk about this on March 11 at the Dorkbot SXSW event: The Vision of the Future: 2021. Come by and say hi and we can figure out how to make this happen.

Retro Indie Game Development with HTML5 - The Series

2011-01-05T18:57:00.000-08:00

Like many web developers I've become interesting in the recent developments in HTML5. Web browsers can now do things they've never been able to do and it's an exciting time. It's also an exciting time for games right now. Two phenomena in game development have arisen which make game development fun again: retro games and indie games, although they are often found in conjunction. Retro games use the old school 8-bit graphics and sound we loved when we were kids. Some of these retro titles are from studios such as the new Megaman games or Cave Story for the Wii. There has also been a rising tide of indie games such as Minecraft and the games of the Humble Indie Bundle such as Braid. Some of these have retro graphics and some like Machinarium have pretty nice art. Not to save that low resolution art isn't nice, "pixel art" has become its own genre with its own talented artists. Some of these indie games have actually been quite successful with both Minecraft and the Humble Indie Bundle raising millions of dollars in sales.

I think the simultaneous rise of HTML5, the retro style, and the commercial success of the indie development methodology have created a great opportunity for the hacker turned entrepreneur. Additionally, all of the open source code, DIY tools, and Creative Commons licensed artwork provide a relatively low barrier to entry. Take, for instance, Realm of the Mad God. This is a really fun retro indie MMO. It was actually developed as part of a contest where artists first made Creative Commons licensed art assets and programmers then used these to make a game. The map generation code is also open source. You could go out and write a game like this today and, even better, you don't have to do it in Flash like they did. A game like this could be written in pure HTML/CSS/Javascript, which is good news for web developers that have already been working in this medium for a while.

My plan is to write a series of posts about all of the great things I've discovered about developing retro indie HTML5 games as I've been working on my own game. While this may seem like a very specific topic, it opens up the doors to a variety of topics with nice concrete examples. For instance, I've often wondered, HTML5 sounds cool I guess but what is it good for, actually? In developing my game, it became quite apparent that it would be pretty much impossible without some very specific HTML5 features, not the obvious things like the Canvas and Audio APIs, but specifically Web Workers have been indispensable.

So watch the blog for future posts in this series. I'm going to start with Akihabara, the HTML5 game library specifically designed for retro games. Also let me know if there's anything specific you're interested in it and if I something to share on the subject then I'll try to make a post about it.

The Original Introduction Problem in P2P Networks

2010-06-17T22:13:00.001-07:00

BitCoin was released this week, a very interesting P2P currency based on proof-of-work with a novel method to deal with double-spending via a P2P timestamp server. Cool stuff.

On the BitCoin forums, a discussion was going on regarding how new BitCoin nodes connect to IRC in order to find other BitCoin nodes. This method was somewhat controversial because it was drawing the ire of the IRC network admins because it looked like they were running a botnet. Additionally, if the IRC server goes down then new users can't join the BitCoin network. However, what are you going to do? When you first run a node, it doesn't know about any other nodes. It's a tough situation.

This is a common problem in P2P, known as Original Introduction, although bootstrapping is also a good word for it. The problem with bootstrapping is that you can't decentralize it. Whether it's IRC or HTTP or DNS, the client needs to be hardcoded with an address or list of addresses which is sufficiently fresh that at least one of the listed addresses is still active. After the first node is reached, you are no longer in Original Introduction mode and can use the full range of techniques for decentralization, such as gossip. Unless, of course, you get disconnected from the network and all of your known peers go away, in which case you're back to bootstrapping.

There are two properties that are at odds when you chose a bootstrapping method: robustness (scalability/reliability) and freshness. Robustness is increased at the expense of freshness by caching on multiple servers, as is usually done with HTTP peer lists. Freshness is maximized (at least up to the TCP timeout) at the expense of robustness by having everyone connected, as with IRC. Of course, the key is finding the right mix of robustness and freshness because you need both for the bootstrap to be successful.

Here are some of my current favorite methods for bootstrapping:

Append list of fresh peers to executable or installer dynamically on download. People usually get the application from its official website, so the website is already a point of failure for new users. You're already hardcoding an address in the application, the address that the application will use to bootstrap. So instead just add fresh peers at the moment of download. You need some fancy code in the executable to read the list off the end, but I've implemented this in an NSIS installer and it's not that hard. Most software developers are upset by the idea of this method.

Connect via XMPP to Google App Engine application. This gives the freshness of IRC, but with more robust scaling. App Engine is mostly for writing web apps, but it provides email and XMPP handling as well. It would be simple to write one application that could handle peer lists via either XMPP or HTTP with the same handler code. I'm currently using this in an application and it works well and is very reliable. I only wish there was a second App Engine to use as a fallback because it does have occasional downtime.

An alternative to requiring all nodes to include the complexity of a protocol like IRC or XMPP is to have a few special sentinel nodes which sit on the network and collect addresses of connected nodes via the usual decentralized methods available to an active node. These sentinel nodes periodically upload fresh addresses, say via HTTP POST to a number of websites. A new node can then download a fresh address list from any of the websites which is currently functioning and reachable. If you have 5 sentinels each uploading every 5 minutes (staggered), then you'll have updates roughly once a minute. This is on par with IRC in terms of freshness and is robust as you care to make it by varying the number of HTTP mirrors and the number of sentinels.

The Truth About Mobile Bandwidth Pricing

2010-06-07T21:52:00.000-07:00

AT&T just ended unlimited bandwidth for the iPhone and people seem to be confused about what this means. As a follow-up to my post on consumer bandwidth pricing, let me break down the mobile bandwidth pricing strategies for you.

It's not really a cap, it's a pricing strategy. Also, it's not about keeping a few extreme users from ruining the network for everyone. For congestion management you'd need peak usage pricing like electricity companies use, only for geographical areas instead of (or in addition to) time-based pricing. For instance, raise the price of bandwidth in Manhattan during daytime and at the Austin Convention Center's cell tower during SXSW. Cumulative usage-based pricing doesn't solve congestion. It's just a strategy to raise prices.

Here's the breakdown of how much you'll pay per month depending on your data usage on the various networks that support smartphones.

As you can see, AT&T starts low and then after the 2GB "cap" quickly cuts across all the prices of the carriers that offer unlimited bandwidth. If you actually use less than 2GB/month, it's still a pretty good deal, second only to Sprint. At 4GB/month, it's the most expensive.

Also notice that Tmobile is more expensive if you get a 2-year contract that if you have no contract. This is their terrible new pricing plan in which they no longer subsidize phones in order to lock you into a contract. Instead, they essentially finance your phone by having you pay less up front but then more per month. When your 2-year contract is up, you will have paid more than you saved on the initial phone purchase. So if you get a Tmobile phone, don't get a contract. Just buy the phone outright.

The Truth About Consumer Bandwidth Pricing

2009-04-21T13:49:00.000-07:00

There's been a lot of noise made recently about Time Warner instituting bandwidth caps. Everyone was angry at Time Warner, whereas Time Warner claims it's losing money because of a few people hogging all the bandwidth, that usage based pricing is more fair and also necessary to pay for building up their networks, and that all of this BitTorrent traffic and streaming video is killing their networks and needs to be capped.

I have an inside perspective on this matter because when I was the Director of Product Management at BitTorrent, we often spoke with ISPs. We knew that Comcast was throttling BitTorrent traffic far before it made it into the news and I flew down to Comcast headquarters in Philadelphia to discuss the situation. I was suprised when the told me that they had plenty of bandwidth and that BitTorrent wasn't anywhere close to crushing their network. Their problem was that they don't want to sell bandwidth, a comodity with a price racing to zero. They want to sell entertainment services, which have a higher profit margin. They are therefore threatened by online video as it competes with cable TV.

The consumer ISP strategy thus has a twofold purpose: raise the price of bandwidth, and at the same time make the Internet a less appealing way to watch video. Both of these purposes are accomplished by bandwidth caps. Additionally, the new pricing models make it complicated to determined how much you're going to be paying exactly for bandwidth, allowing the ISPs to increase prices covertly. If they were to just declare that prices were going up because they felt like it, people would be very angry indeed, and it might lead to government regulation of pricing.

In order to unravel the mystery of the new pricing models, I've made some graphs that show how much you will pay in dollars for a number of total gigabytes transferred in a month. I was very suprised by the results.

To start, here is a graph of a lot of different plans, such as various Time Warner plans, AT&T DSL, and the main 3G mobile carriers.

On the bottom is gigabytes and on the left is dollars. Yes, dollars. 300 GB would costs you $140,000 on AT&T 3G. You'll notice that only the 3G providers show up at all, everything else being squished into a single line on the bottom. This is because while Time Warner is charges overages of $1/GB, Sprint is charging $50/GB, Verison $280/GB, and AT&T a ridiculous $480/GB after you exceed the 5GB cap. Everyone is mad about the Time Warner caps, but it's really the 3G caps that are totally insane. Every iPhone user is on AT&T, so when Hulu for iPhone comes out it's going to be crazy.

So don't use more than 5G of 3G per month or else you're getting ripped off. Let's compare some ISPs just in the 1-5G range to see how they stack up.

Amazon S3 is included here at the bottom just to show how much more expensive consumer bandwidth is than hosting bandwidth. The bottom tier of Time Warner service is a clear winner here, following by the original capper Comcast. 3G services are in the middle, with premium tier cable and DSL services losing. In this bandwidth bracket, you don't really get much benefit from upgrading your service.

Now let's look at ISP choices excluding 3G.

The lowest Time Warner tier wins again if you lose little bandwidth, and then Comcast wins everything else up to 250G where they have put a hard cap.

Now let's look in depth at just the Time Warner tiers.

The graph is interesting because Time Warner imposes an overage fee cap of $75. This causes the lowest tier to come out best for both low and high numbers of gigabytes. The lowest tier charges $15/month for 1GB and $2/GB for each additional GB, up to $75 in overages, meaning that your total bill is capped at $90. You therefore get unlimited bandwidth for $90 with that plan. Whereas their highest tier plan is $75 for 100 GB and then $1/GB after that up to $75 in overage charges. You get unlimited bandwidth for $150 with this plan. So the lowest tier wins and the highest tier loses. The middle tiers only come into play for medium amounts of bandwidth.

So, let's look at medium amounts of bandwidth where the multiple tiers come into play.

This graphs shows a situation similar to the one pitched by Time Warner. There are multiple tiers and you get the best deal by choosing the right tier for the amount of bandwidth you use. However, note that the goal is not to avoid overages. The goal is to avoid having your overage charges cost more than the monthly charge of the next plan up. So while the lowest tier only includes 1GB/month, it's the best plan up to around 10GB/month. Similarly, the standard plan will be better than an upgrade up to 50GB/month. The highest tier is only good for people that use >80 GB/month. And Time Warner Business Class is, as shown on all of the graphs, always just a terrible deal.

It was just discovered that AT&T DSL is implementing bandwidth caps. They have a different model because they don't have a cap on overage fees. That sounds like it would probably be a worse deal than Time Warner. Let's take a look, first at just the different AT&T DSL tiers.

This is the more classical model that you'd expect with overages. Since there are no caps on overage fees, you get the best deal by choosing a plan matched to your usage. If you guess incorrectly, you overpay. The ordering of plans from cheapest to most expensive becomes inverted from low usage to high usage.

Now let's compare the various AT&T DSL plans to the various Time Warner cable plans.

There are a lot of lines on this graph, but you only need to look at the bottom. The lowest tier of Time Warner again wins for low bandwidth. After than, successive AT&T DSL plans win. Despite the fact that their pricing structure is worse, their actual prices are better than Time Warner as long as you're good at guessing how much bandwidth you're going to use. If you're bad at guessing, only the lowest two tiers of Time Warner could ever possibly be better than AT&T DSL and only for a small range of usage. So if you're bad at guessing your usage, your best bet is to get the highest tier of AT&T DSL.

Conclusions

I was suprised by the outcome of these charts. The Time Warner caps are not that big of a deal and the AT&T caps are even less of a big deal. What you really need to watch out for is the 3G caps. Those are just totally off the rails.

The best deal for consumer Internet is AT&T DSL, even with the caps and overage fees. If you know how much bandwidth you're going to use, buy the appropriate tier. If you don't know how much bandwidth you're going to use, you're safest buying the highest tier.

If you're going to go with Time Warner, the lower tiers are a better deal. Go with the lowest tier you can and only upgrade if your overage fees are costing you more than the next tier. Never buy the highest tier or business class, they are ripoffs.

3G is a terrible deal. If you use less than 5G a month, all the 3G providers are priced the same and are not a very good deal for Internet. Use the lowest tier of Time Warner instead. Under no circumstances use more than 5G of 3G in a month, you will get ripped off big time.

Also, Hulu for iPhone is going to be a train wreck.

Diakonos: A Programmer's Text Editor in Ruby

2009-02-27T12:06:00.000-08:00

A text editor (or for some an IDE) is the most important tool a programmer has, other than the programming language itself. Religious wars over editors are inevitable because people spend so much time with their editor. Some people flip-flop, but many people become both functionally and emotionally attached. No one wants to spend time learning new keybindings when they could be programming instead.

Personally, I use nano. This is not out of ignorance, mental damage, or a deep moral perversion as my friends that use emacs and vi insist. I want an editor which is small and quick to install. It must be available on all platforms and easy to install (if there's no Debian/Ubuntu package in the main repositories, forget it). I'm not going to mess around with configuring it. And I basically just don't like vi. So nano has been winning the war for my soul for many years. However, like all programmers, I dream of a better world. I wouldn't mind a slightly (or even somewhat) better editor, but I everything I've ever tried lacked the beautiful simplicity of nano. With more features comes more hassle.

Then I found Diakonos. It's a console-based text editor (which I like because I ssh into my server and edit things as much as I edit them locally), and it's written in Ruby. It has the modern features, such as multiple buffers, syntax highlighting, and syntax-aware indentation. It's scriptable, either through the Ruby interface or through external programs (in any language) which are fed the old buffer on stdin and output new buffer contents on stdout.

Like all editors under my consideration, it has packages in the main repositories of both Debian and Ubuntu. It also has Windows and OS X binaries (also a Ruby gem for you Ruby guys). It's as quick and easy to install as nano, and through it has lots more features, they are not obtrusive. The keybindings are the "standard" Windows-style ones (ctrl-x cut, ctrl-c copy, ctrl-v paste). You can of course configure it to emacs or whatever style you want, but I am personally happy to use a similar set of keys across my editor and web browser.

I am particularly excited about finally having an editor that's not written in C. This is a personal issue. Many people like C, but I just think it's time for us to move on as a society. I have a T-shirt that says "I would code in C for love, but not for money." While you may love C, autoconf, and make, I am personally very excited about an editor both written in and scriptable in Ruby. It seems like a step towards the future. It's also nice to have a fresh codebase which doesn't inherit several decades of design decisions.

My apologies for insulting your favorite text editors and programming languages, my Internet friends. I meant no harm. Just check out Diakonos for a bit and see what you think. It has a feel which is both fresh and yet somehow also classic. A "modern classic" if you will. And it's fun. In a way I can't really articulate, it's just enjoyable to use. Also, the author is a really nice guy and the IRC channel isn't full of obnoxious jerks (#mathetes on freenode), just good folks like you and me, hacking on code. I'll see you there!

Startup Camp Austin, Feb 28th

2009-02-20T10:55:00.000-08:00

Next Saturday, Feb 28th, from 1pm-6pm, is the second annual Startup Camp Austin!

Last year's Startup Camp Austin was pretty great. A lot has changed since then in the Austin Startup Scene. It's really quite booming. With events like SXSW Accelerator and the CapitalFactory application deadline coming up at the beginning of March, we decided that now was a good time to get together again and talk about the ongoing developments of interest to Austin startups.

There are still a few slots left, so if you'd like to do a presentation, pitch, or demo, or lead a roundtable discussion, sign up on the wiki and I'll save you a slot in the program. Also feel free to just add discussion topics and we can discuss whatever anyone feels like discussing.

Also, please RVSP on the Facebook event so that we know how much food to provide.

I have to say, I'm pretty excited about this camp. The first Startup Camp was kind of scary because we'd never put on an Unconference before and I had just moved back to Austin and started my own startup. I really wanted to help make Austin a great place for startups, but it was just one person's dream. Since then, things have become so exciting! There are lots of events for startups now, from SD2020 to SXSW Accelerator. There are several new experiments in funding going on, including a startup incubator and a startup organized as a coop. Coworking spaces and BarCamps have become hot items, sprouting up in Dallas, Houston, and San Antonio as well. So many of my friends have lost or quit their jobs due to the economic turmoil and instead of feeling down about it have decided that now is a great time to start a startup. It's really a very optimistic time for startup entrepreneurs as we see opportunities in every problem.

So if you're currently at a startup, are interested in starting one, or just curious about how things are going in the Austin startup community, come to the ACTLab next Saturday. It's located on the UT Campus in the Communications Building (CMB) on the 4th floor, in Studio 4B. The Communications Building is on the southeast corner of Dean Keaton and Guadalupe, across from Madam Mam's. There's a parking lot right across the street (south of Madam Mam's) which is usually $6 to park all day.

I hope to see you there!

Austin Gets Its Own Startup Incubator

2009-01-30T09:34:00.000-08:00

I love having a startup in Austin. I think it's a great place to do a startup right now. At the Tech Happy Hour last night, one early stage investor likened Austin to a gasoline soaked pile of rags just waiting for a spark. Indeed!

For a while I've felt that the missing element in the Austin startup scene is an early stage, small investment startup incubator in the spirit of Y Combinator and TechStars. Austin is a great place to bootstrap, and angel and VC funding are available, but for many young entrepreneurs the best way to get started is with a startup incubator. You get to meet people with startup experience, you get to pitch, and you get some press. It's one of the best ways to get started, especially if you're on the engineering side and you need to meet people with business experience.

Austin finally has such a venture, and it's called Capital Factory. You can read their press release to get the sales pitch, but let me just break down the numbers for you. If you're one of the three companies picked, you get $20,000 for 5%, giving you a $380,000 valuation, which is comparable or slightly better than YC and TechStars in terms of valuation. There are of course many intangibles to compare between the various incubators, but it basically comes down to where you want to start your company: the Bay Area, Boulder, or Austin. For myself, I choose Austin!

They're also still looking for a few good investor-mentors, so if you want to help the Austin startup scene and you've got some time and money to invest, check them out. I can't wait until pitch day to see what new startups are started!

P2P Money with App Engine, OAuth, and QR Codes

2009-01-23T10:48:00.000-08:00

In honor of National Service Day, I decided to take a day off from my regularly scheduled Ringlight hacking and work on some community service hacking. In Austin we have a complimentary currency called the Austin Time Exchange Network (ATEN). There's a lot to say about complimentary currency and its role in helping economies during a downturn. However, I want to delve mainly into the technical details of my hack, so if you're interested I recommend Bernard Lietaer. The basic idea is that you can pay people for their time in ATEN currency, denominated in hours, rather than dollars. This is quite good for situations where no one has dollars they want to spend, but they do have work they want to do and get done, such as the current economy. There's no shortage of needs or workers, only a shortage of money. So let's make our own money! Problem solved! You'll still use dollars to pay taxes, your mortgage, and Wal-Mart, but you can use ATEN hours to buy local goods and services from people in Austin that accept this currency.

The goal of this project, named Austin Time Machine (ATM) is to provide a means to withdraw electronic currency into a physical paper form (cash) and later deposit paper to an electronic account. This is particularly useful for the sorts of situations which are normally "cash only", for instance festivals where it's unreasonable to expect all of the booths to have computers and Internet. Since the paper currency is backed by a separate online currency (in this case OpenSourceCurrency.org), the ATM service doesn't need to manage things like account balance. It only needs to keep track of bill serial numbers and manage authentication to the "bank" so that it can transfer credits to and from user accounts.

So on to the technical details. The first interesting bit is that OpenSourceCurrency.org supports OAuth for authenticating users. Additionally, I implemented the whole service on App Engine, which is wonderful because I don't have to run it on my server or manage uptime. However, this meant that I had to port the python OAuth library to use the App Engine API. In particular, I had to replace all of the use of httplib with App Engine's urlfetch service. This code will be useful to anyone attempting to authenticate to external services from inside an App Engine application. This app also provides a handy example of how to write an OAuth client. It's a little bit more complicated than it needs to be, but it's not that bad if you use an OAuth library to generate the signatures and such. It's basically involves just POSTing some fields to a few URLs and providing callback URLs that the website will POST back to. You pass some tokens around this way and end up with a token which, when included in a call to whatever web service you're trying to access, will serve to authenticate you as acting on the behalf of the user.

The next component of the app is the storage of serial numbers when you withdraw bills and verification of serial numbers when you deposit. Nothing particularly exciting here. I created an App Engine Model for each bill and save and access them using the standard App Engine ORM API. This is worth checking out if you haven't used App Engine before though because it's a simple example of how it works, and it's very different than SQL. Basically you need to assign a unique (string) key to each object and this is how you access them. The mechanisms you might except from SQL such as the UNIQUE keyword are absent.

With all of the nitty gritty storage and OAuth stuff taken care of, the bulk of the application is very simple. OpenSourceCurrency.org is a Rails app and so exposes a simple REST and JSON (or XML) API to do transactions. There are a couple of gaps in the API (from the perspective of this particular app) which I work around in this code. The API only lets you transfer money from the current user to a specified destination user, and you need the userid of the destination user. For withdrawl it's easy, I transfer money from the authenticated user to my own account, since I happen to know my userid. For deposit, I perform a tricky manuever. I charge the user a 0.1 hour fee, transferring it from their account to mine just like in a withdrawl. The result of that call includes their userid in the JSON output. I then take that userid and have the ATM service log into my own account (specifying credential via HTTP Auth, not OAuth) and transfer from my account to the account of the user, specified by their userid. A bit complicated! However, I'm working with Tom Brown, creator of the OpenSourceCurrency.org API, to create a simpler API.

Finally, once you've made a withdrawl, the bill needs to be generated so you can print it. This is currently done with just a little bit of HTML. A PDF export would be nice for printing multiple bills on one page, but for the prototype HTML was of course the fastest. The QR code generation turned out to be extremely simple because the Google Chart API recently added QR code support. So the QR code is just a single HTML img tag with a URL which will automatically generate a QR code. Nice!

Feel free to play this all this stuff. Check out Tom's screencast on using the ATM, the live ATM site, and of course the source code (also available as a zip).

Scalable Clustering with Thrift and SQS

2008-12-09T14:39:00.000-08:00

Since the Ringlight beta launch, we're edging up towards 100 users. It's certainly not the load that the engineers at Twitter have to deal with, but I would like to impress upon you my Law of Scaling:

Every power of ten, something different breaks (or becomes unusably slow).

So even with modest growth from 10 to 100 users, it's probably time to fix something.

Once principle of scalable design is to decouple slow operations from the user interface.

For instance, subscribers of Ringlight Personal Edition have the added feature of one-click backup of all of their files. However, this operation can take a long time to complete. Even just generating the list of files to back up can be time consuming if you have a really large number of files. Therefore, it is advantageous to move all of this out of the website and into a background process. The web application just records that you have clicked the one-click backup option and then alerts the background process that it's time to figure out exactly what needs to be done about this. This sort of architecture will keep your web page loading snappy and your users happy even on a heavily overloaded website, as you're not wasting their time making them wait for the page to load.

There are a number of ways to communicate between your web application and the background process. Of particular interest from a scalability standpoint are message queueing services such as Starling and SQS. These allow for high scalability by allowing many producers (your web server instances) talk to many consumers (your background processes).

Starling is a server written in Ruby (for Python, see Peafowl) that you run yourself. SQS is a hosted service that you pay for based on usage (number of messages sent and bandwidth used). Both are reasonable choices and have pretty similar APIs. You connect to the service, and then push strings onto a particular queue (identified by a queue name, which is also a string). Other processes can fetch strings from the queue given its name. Pretty easy! They also both have client libraries in most major languages, so integration into your app shouldn't be very difficult.

Of course, they only support strings, so if you have fancy objects that you want to send then you'll need to serialize and deserialize them to and from strings. There are of course language-specific ways to do this (Java Object Serialization, Python Pickles, etc.), but I prefer Thrift because it's fast, efficient, and is the same in multiple languages. This is handy because you can implement different components in different languages, which is sometimes useful. For instance, my web server is in Java and my background process is in Python.

Thrift also provides some additional handy components besides serialization, in particular a transport layer that provides RPC semantics over arbitrary transport mechanisms. It comes by default with socket and HTTP transports.

What I have implemented and made available for you in case you might find it useful is an SQS transport for Thrift. It effectively provides cross-language multicast RPC in a few lines of code. The key piece of code is TSqsClient, which provides the SQS transport using the boto library for Python. This is the piece that you'll need to port if you want to support other languages. The rest of the code is just for example purposes and is derived from simple-thrift-queue, which is a nice example of how to build an application using Thrift. The available methods are defined in the thrift file. It's important that they are defined as async and void, as this is a one-way transport. The producer calls methods on the stub classes generated by the Thrift compiler. These method calls are queued up SQS. The consumer gets the method calls from SQS and calls the methods on the handler class. Additionally, there are a couple of utilities. One to fetch a single message from SQS, so you can test the producer, and one clear the queue if you send too many messages.

Once nice thing about using Thrift is that you can swap out the transport easily. You can replace my SQS transport with a Starling one, or ditch queues altogether and use sockets or HTTP. The advantage of using SQS is that the producers and consumers can all be on different machines or the same machine, it makes no difference. Used together, you have a very flexible and very scalable system with very few lines of code. Just update your thrift file and handler class to use your API and everything else is handled for you!