Step Three: Profit!

Friday, August 15, 2008

Autocomplete Form Fields (with jQuery)

I recently added a feature to Ringlight which lets you share a private file with a particular user. There is a field to enter the username that you want to share with and obviously this field needs to autocomplete as you type. This is a feature that users expect now and so it's not really optional.

jQuery makes it very easy to add this feature:

Download the autocomplete extension for jQuery. You'll need both the javascript and CSS files.
Add an input field to your form:
Set the cacheLength option to speed up responsiveness: options={cacheLength: "20"};
Call the autocomplete extension: $("#youAutocompleteMe").autocomplete(url, options);
You'll need a server-side script installed at the url you pass to autocomplete to return matches.

The autocomplete function will call the given url, passing the text currently in the input box in the q parameter. Your server-side code should return a newline-delimited list of matches.

That's it! It's pretty easy. The key to performance is setting that cacheLength parameter as it default to 1, which doesn't provide much caching at all.

Friday, August 8, 2008

AJAX File Upload Progress Bars (with jQuery)

I recently added progress bars (actually, a percentage instead of a bar, but it was the same to implement) to Ringlight uploads and downloads. I was suprised to find that the available server-side libraries for dealing with file uploads seemed to be inadequate for adding this functionality to my website.

The basic technique for adding progress bars is relatively simple. With jQuery:

Install the Ajax File Upload extension.
Install the periodic execution extension.
Register some Ajax event callbacks to reveal the progress bar on the page when the upload starts and also to catch errors.
Call the $.ajaxFileUpload function with the URL of the upload handler script, the id of the file input element, and the callback function to handle output from the upload handler.
Have the upload handler return a Json object with an id for the upload.
Call your progress bar update function with the periodic execution model: $.periodic(updateProgressBar);
The updateProgressBar function should fetch the download status from a server-side script, supplying the upload id and a callback function: $.getJSON("fetchProgress", {id: id}, function(data) {/* update progress bar with data.percentDownloaded*/});
The fetchProgress script should return upload progress information in a Json object. I return percentDownloaded, but you can include anything you'd like, such as upload rate. You should also provide error information here, such as if the upload failed.
The callback function for fetchProgress should update the page to reflect updated progress. For instance, updating a percentage to completion could be as simple as $("#percent").empty().append(data.percentDownloaded);

This was all very simple to implement and jQuery made it possible in very few lines of code. The difficulty was in providing a percentDownloaded value. The difficulty comes from the fact that it is common for browsers to not include the Content-Length field for uploaded files. The file upload handling libraries generally solve this problem by either 1) not providing a content length or 2) loading the whole file into memory (or disk, in some cases) and then finding the length of it. Either way, not very useful for a progress bar! This total failure to handle streaming files is a common problem in libraries and if you could avoid it in the libraries you implement then the world would be most appreciative.

In the meantime, there are a number of action items that require your attention. First, calculate the file length by taking the HTTP request Content-Length field and subtracting the size of everything which is not the file in order to yield the file length. I did this with the following shoddy algorithm:

Extract the MIME boundary from the Content-Disposition field.
Subtract the size of the MIME boundary twice (there is a boundary on both sides of the file).
Subtract 2 because the second boundary has a trailing "--".
Subtract 4 because each boundary has a trailing two-character newline (\r\n).
Subtract 8 because my numbers were always off by exactly 8. I'm not sure where this additional 8 is coming from.

As I said, this algorithm is shoddy, a kludge not fit for use in production. However, it works for now! It needs extensive testing and tweaking on a variety of browsers. The next steps:

Improve algorithm so that it's robust enough to work with most browsers
Submit a patch to python's FieldStorage class to support this algorithm
Submit a bug report to Mozilla requesting that they supply content length in file uploads

For now, my upload progress bars are working, so I'm happy.

Tuesday, July 29, 2008

Flash Crowd Preparation - Load Testing With Siege

The official launch of Ringlight is approaching and so it's time to prepare for launch day flash crowds, in particular from Slashdot (or Digg, etc.). On the bright side, these sorts of Flash crowds are not as fearsome as the used to be, in that most website hosting these days provides adequate bandwidth and CPU to handle the load. The weak point is most likely your application itself, so it's worth load testing it.

So how much load is a Slashdot flash crowd? On the order of 1-10 requests per second. If you can handle 100 then you're more than fine. Additionally, this load only lasts about 24 hours. These are small numbers as compared to the continuous load you can expect for a popular site, but scaling up for your post-launch flash crowd is good preparation for the traffic to come.

The first step in scaling is to load test your site. Don't go writing a level 4 load balancer in C with async I/O just yet. First, find out what is slow and just how slow it is. I like to use siege for this because it let's you start simple.

First, apt-get install siege. Then, run siege.config to make a new config file (you can edit it later).
Then, try the simplest possible test: siege http://yoursite.com/

Don't run siege on someone else's website as they are likely to think they are under attack (and they are!) and block your IP.

Siege will launch a bunch of connections and keep track of core statistics: availability (should be 100%), response time (should be less than 1 second), and transaction rate (you're shooting for 100 transactions/second).

By default, siege will just keep attacking your server until you tell it to stop (ctrl-C). To run a shorter test, use -r to specify the number of repetitons. Note, however, that this is not the number of transactions that will be made. siege uses a number of simultaneous connections (15 by default, set it with -c), so if you specify -r 10 -c 10 for 10 repetitions with 10 simultaneous connections, then there will be 100 transactions.

Other important options are -d to set the random delay between requests and -b to use no delay at all. The use of -b is less realistic in terms of real load, but will give you an idea of the maximum throughput that your system can handle.

I tested my site with siege -b -c 100 -r 100 in order to hit it hard and see the max throughput. I found that most pages could handle 100 transactions/second, but one page as doing a scant 5 t/s. Unacceptable! I added memcached caching to that page for caching some of its internal state and it now benchmarks at about 90 t/s. That's perfectly acceptable for the launch crowd, but this benchmark relies on cache hits. With 100% cache misses, I'd be back to 5 t/s. So the real performance is somewhere in the middle, depending on the ratio of cache hits to misses. What this ration is depends on real traffic patterns as well as the size of the cache and the number of pages. This makes it hard to say, but I can give it a shot.

The way to simulate something like this with siege is to put a number of URLs in a file called urls.txt and then run siege -i. siege will then pick URLs to hit randomly from the file. Put multiple copies of the same URL in the file in order to simulate relative weighting of the URLs. By providing a file containing all of my caching-dependent URLs, weighted based on estimated popularity, I can see how well my cache is holding up and tweak caching settings as necessary to get adequate performance.

Friday, July 11, 2008

Startup Camp Austin

There's been a lot of discussion lately about Austin's startup culture.
Is Austin a good place to do a startup? Should you head to California instead? What sort of startsup work best in Austin? Where should you look for funding? Is Austin Ventures the only game in town? Why doesn't Austin have an equivalent of Tech Stars or Y Combinator?

There's a growing movement to make Austin a great place to do startups. Projects like the Startup District and various coworking spaces, both commercial and noncommercial are examining what we need to provide in order for Austin startups to succeed.

In the spirit of these disucssions, I'm organizing Startup Camp Austin, a one-day event on August 2nd where Austin's startup community will come together to discuss the issues, challenges, and advantages of having a startup in Austin.

RSVP on the Facebook Event page and sign up to give a presentation, moderate a discussion, demo your product, or pitch your idea on the BarCamp wiki.

Monday, June 30, 2008

Java and Memory Leaks

People are often surprised that I prefer Java as the primary language for coding serious applications. The assume that it must be ignorance of other languages, enforced slavery by my employer, or simple insanity. I assure you, however, that while I have experience in programming with 15 different programming languages and I enjoy many of them, I still choose Java for doing real work.

This is because while first class functions, closures, and metaobjects are all very cool and fun, I don't think these are the important factors when writing, say, a web application that you need to scale up to lots of users. What really matters are the libraries and the tools. These will save you more time than not having to type semi-colons at the ends of lines.

An example of what I mean is memory profiling. I recently wrote a handy load testing tool for Ringlight which generates an up-to-date google sitemap by hitting every URL on the website, comparing hashes to see if it's changed, and updating the sitemap's lastmod field. There are currently around 200,000 pages and hitting them all at once is a good test of the memory cache, database responsiveness, average page load time, etc. Interestingly enough, this process causes the server to run out of memory and crash.

Excellent! The load test revealed a memory leak, just what a load test should do. If the application was in, say, python, I would do is run the code under the primitive profiler that's available. This would spit out a stats file. I could then write some code using the bstats.py library to sort the stats in various ways looking for area which are consuming lots of memory.

Luckily, the server is in Java, so I can get a 15-day trial of YourKit Java Profiler (there are free ones as well, but YourKit is the best). My code runs on the server, the user interface runs on my local machine. They automatically communicate over the network so that I can get realtime graphs of memory consumption as my app runs. I can take snapshots of the memory state, run tests scenarious, compare the snapshots to see only the memory retained between tests, drill down through paths of method calls that look suspicious, check for common memory allocation gotchas, etc. It's an excellent tool and it makes finding memory leaks easy.

In this case, the memory leak seems to be inside the Java database access layer (JDBC). It appears that this is because the MySQL JDBC driver intentionally does not garbage collect results. You must wrap the use of any ResultSet objects with a finally clause that will attempt to close them. Of course, this is just good coding style anyway and I had already done this in my applications database access methods. Unfortunately, I later decided that I didn't like the way one of these methods was called and so added an additional database access method and this time I had forgotton to add the finally clause. As this new method because more prevelant in my code, the memory leak got worse.

Of course I'm sure that you, dear readers, would never be guilty of such inconsistent coding practices. This memory leak is a result of my fast and loose coding style. Some might say that it is this style which leads me to use languages with good tools. Others I know prefer to write their own memory profilers, object graph inspectors, and even syntax style checkers from scratch. Personally, I prefer to spend my time writing applications, at least while engaged in the professional business of writing applications. When not at work, I enjoy inventing my own impractical languages as much as anyone.

Wednesday, June 25, 2008

Mini-Bio

I realized as I was e-mailing an introduction to a new business contact today that I never really took the time to properly introduce myself. Here's the same miniature biography that I sent to my new contact:

I have worked for the last ten years in open source and peer-to-peer software, but in community projects and tech startups. I founded a number of open source peer-to-peer software projects, including Freenet, Tristero, Alluvium, and Project Snakebite. I've also worked in peer-to-peer Interent video delivery at Swarmcast as Senior Engineer and then at BitTorrent as the Director of Product Management.

I currently have a startup here in Austin called Ringlight, where I make social file-sharing software. It can be thought of as "peer-to-peer meets Web 2.0" or "google for your desktop". It indexes all of the files on all of your computers and makes them available on a website, so you can access your files from anywhere that has a web browser, send links to your friends, search, tag, bookmark, etc.

I also do some consulting in the areas of product design, product management, and engineering architecture.
I'm always happy to meet anyone in Austin with a startup and I'd love to hear what you're working on. Let me know if you'd like to have lunch anytime this week or next. Also, I will be at Jelly this Friday and most any Friday, so feel free to stop by if you'd like. Jelly is a co-working group that meets on Fridays at Cafe Caffeine. It's a great place to meet other people in startups.

Other things to check out if you want to know more about me: blog, personal website, twitter, LinkedIn profile.

Friday, June 6, 2008

Twitter Integration for your Website

Social sharing of content is a popular feature for websites. It seems like every blog post these days is accompanied with a list of bookmarklet buttons: Digg, StumbleUpon, del.icio.us, etc. However, what about adding the ability to post links to Twitter from your website? It's not quite as simple as a bookmarklet, but it's still pretty easy to do.

The Ringlight website is written in Java (client is in python), so when I picked a library for twitter access, I picked jtwitter as it has no external dependencies.

It's so easy to use that it's almost too easy. Check out my code:

Twitter twit=new Twitter(username, password);
Status status=twit.updateStatus(message);

So easy, right? Now you can go to any file on Ringlight, click on Share on Twitter, and the link is posted on Twitter. You don't even have to be logged into the Ringlight website, so any random user on the Internet can start twittering links to my copy of Accelerando.