Thursday, June 17, 2010

The Original Introduction Problem in P2P Networks

BitCoin was released this week, a very interesting P2P currency based on proof-of-work with a novel method to deal with double-spending via a P2P timestamp server. Cool stuff.

On the BitCoin forums, a discussion was going on regarding how new BitCoin nodes connect to IRC in order to find other BitCoin nodes. This method was somewhat controversial because it was drawing the ire of the IRC network admins because it looked like they were running a botnet. Additionally, if the IRC server goes down then new users can't join the BitCoin network. However, what are you going to do? When you first run a node, it doesn't know about any other nodes. It's a tough situation.

This is a common problem in P2P, known as Original Introduction, although bootstrapping is also a good word for it. The problem with bootstrapping is that you can't decentralize it. Whether it's IRC or HTTP or DNS, the client needs to be hardcoded with an address or list of addresses which is sufficiently fresh that at least one of the listed addresses is still active. After the first node is reached, you are no longer in Original Introduction mode and can use the full range of techniques for decentralization, such as gossip. Unless, of course, you get disconnected from the network and all of your known peers go away, in which case you're back to bootstrapping.

There are two properties that are at odds when you chose a bootstrapping method: robustness (scalability/reliability) and freshness. Robustness is increased at the expense of freshness by caching on multiple servers, as is usually done with HTTP peer lists. Freshness is maximized (at least up to the TCP timeout) at the expense of robustness by having everyone connected, as with IRC. Of course, the key is finding the right mix of robustness and freshness because you need both for the bootstrap to be successful.

Here are some of my current favorite methods for bootstrapping:

Append list of fresh peers to executable or installer dynamically on download. People usually get the application from its official website, so the website is already a point of failure for new users. You're already hardcoding an address in the application, the address that the application will use to bootstrap. So instead just add fresh peers at the moment of download. You need some fancy code in the executable to read the list off the end, but I've implemented this in an NSIS installer and it's not that hard. Most software developers are upset by the idea of this method.

Connect via XMPP to Google App Engine application. This gives the freshness of IRC, but with more robust scaling. App Engine is mostly for writing web apps, but it provides email and XMPP handling as well. It would be simple to write one application that could handle peer lists via either XMPP or HTTP with the same handler code. I'm currently using this in an application and it works well and is very reliable. I only wish there was a second App Engine to use as a fallback because it does have occasional downtime.

An alternative to requiring all nodes to include the complexity of a protocol like IRC or XMPP is to have a few special sentinel nodes which sit on the network and collect addresses of connected nodes via the usual decentralized methods available to an active node. These sentinel nodes periodically upload fresh addresses, say via HTTP POST to a number of websites. A new node can then download a fresh address list from any of the websites which is currently functioning and reachable. If you have 5 sentinels each uploading every 5 minutes (staggered), then you'll have updates roughly once a minute. This is on par with IRC in terms of freshness and is robust as you care to make it by varying the number of HTTP mirrors and the number of sentinels.