Tuesday, December 9, 2008

Scalable Clustering with Thrift and SQS

Since the Ringlight beta launch, we're edging up towards 100 users. It's certainly not the load that the engineers at Twitter have to deal with, but I would like to impress upon you my Law of Scaling:

Every power of ten, something different breaks (or becomes unusably slow).

So even with modest growth from 10 to 100 users, it's probably time to fix something.

Once principle of scalable design is to decouple slow operations from the user interface.

For instance, subscribers of Ringlight Personal Edition have the added feature of one-click backup of all of their files. However, this operation can take a long time to complete. Even just generating the list of files to back up can be time consuming if you have a really large number of files. Therefore, it is advantageous to move all of this out of the website and into a background process. The web application just records that you have clicked the one-click backup option and then alerts the background process that it's time to figure out exactly what needs to be done about this. This sort of architecture will keep your web page loading snappy and your users happy even on a heavily overloaded website, as you're not wasting their time making them wait for the page to load.

There are a number of ways to communicate between your web application and the background process. Of particular interest from a scalability standpoint are message queueing services such as Starling and SQS. These allow for high scalability by allowing many producers (your web server instances) talk to many consumers (your background processes).

Starling is a server written in Ruby (for Python, see Peafowl) that you run yourself. SQS is a hosted service that you pay for based on usage (number of messages sent and bandwidth used). Both are reasonable choices and have pretty similar APIs. You connect to the service, and then push strings onto a particular queue (identified by a queue name, which is also a string). Other processes can fetch strings from the queue given its name. Pretty easy! They also both have client libraries in most major languages, so integration into your app shouldn't be very difficult.

Of course, they only support strings, so if you have fancy objects that you want to send then you'll need to serialize and deserialize them to and from strings. There are of course language-specific ways to do this (Java Object Serialization, Python Pickles, etc.), but I prefer Thrift because it's fast, efficient, and is the same in multiple languages. This is handy because you can implement different components in different languages, which is sometimes useful. For instance, my web server is in Java and my background process is in Python.

Thrift also provides some additional handy components besides serialization, in particular a transport layer that provides RPC semantics over arbitrary transport mechanisms. It comes by default with socket and HTTP transports.

What I have implemented and made available for you in case you might find it useful is an SQS transport for Thrift. It effectively provides cross-language multicast RPC in a few lines of code. The key piece of code is TSqsClient, which provides the SQS transport using the boto library for Python. This is the piece that you'll need to port if you want to support other languages. The rest of the code is just for example purposes and is derived from simple-thrift-queue, which is a nice example of how to build an application using Thrift. The available methods are defined in the thrift file. It's important that they are defined as async and void, as this is a one-way transport. The producer calls methods on the stub classes generated by the Thrift compiler. These method calls are queued up SQS. The consumer gets the method calls from SQS and calls the methods on the handler class. Additionally, there are a couple of utilities. One to fetch a single message from SQS, so you can test the producer, and one clear the queue if you send too many messages.

Once nice thing about using Thrift is that you can swap out the transport easily. You can replace my SQS transport with a Starling one, or ditch queues altogether and use sockets or HTTP. The advantage of using SQS is that the producers and consumers can all be on different machines or the same machine, it makes no difference. Used together, you have a very flexible and very scalable system with very few lines of code. Just update your thrift file and handler class to use your API and everything else is handled for you!