Monday, June 30, 2008

Java and Memory Leaks

People are often surprised that I prefer Java as the primary language for coding serious applications. The assume that it must be ignorance of other languages, enforced slavery by my employer, or simple insanity. I assure you, however, that while I have experience in programming with 15 different programming languages and I enjoy many of them, I still choose Java for doing real work.

This is because while first class functions, closures, and metaobjects are all very cool and fun, I don't think these are the important factors when writing, say, a web application that you need to scale up to lots of users. What really matters are the libraries and the tools. These will save you more time than not having to type semi-colons at the ends of lines.

An example of what I mean is memory profiling. I recently wrote a handy load testing tool for Ringlight which generates an up-to-date google sitemap by hitting every URL on the website, comparing hashes to see if it's changed, and updating the sitemap's lastmod field. There are currently around 200,000 pages and hitting them all at once is a good test of the memory cache, database responsiveness, average page load time, etc. Interestingly enough, this process causes the server to run out of memory and crash.

Excellent! The load test revealed a memory leak, just what a load test should do. If the application was in, say, python, I would do is run the code under the primitive profiler that's available. This would spit out a stats file. I could then write some code using the bstats.py library to sort the stats in various ways looking for area which are consuming lots of memory.

Luckily, the server is in Java, so I can get a 15-day trial of YourKit Java Profiler (there are free ones as well, but YourKit is the best). My code runs on the server, the user interface runs on my local machine. They automatically communicate over the network so that I can get realtime graphs of memory consumption as my app runs. I can take snapshots of the memory state, run tests scenarious, compare the snapshots to see only the memory retained between tests, drill down through paths of method calls that look suspicious, check for common memory allocation gotchas, etc. It's an excellent tool and it makes finding memory leaks easy.

In this case, the memory leak seems to be inside the Java database access layer (JDBC). It appears that this is because the MySQL JDBC driver intentionally does not garbage collect results. You must wrap the use of any ResultSet objects with a finally clause that will attempt to close them. Of course, this is just good coding style anyway and I had already done this in my applications database access methods. Unfortunately, I later decided that I didn't like the way one of these methods was called and so added an additional database access method and this time I had forgotton to add the finally clause. As this new method because more prevelant in my code, the memory leak got worse.

Of course I'm sure that you, dear readers, would never be guilty of such inconsistent coding practices. This memory leak is a result of my fast and loose coding style. Some might say that it is this style which leads me to use languages with good tools. Others I know prefer to write their own memory profilers, object graph inspectors, and even syntax style checkers from scratch. Personally, I prefer to spend my time writing applications, at least while engaged in the professional business of writing applications. When not at work, I enjoy inventing my own impractical languages as much as anyone.

Wednesday, June 25, 2008

Mini-Bio

I realized as I was e-mailing an introduction to a new business contact today that I never really took the time to properly introduce myself. Here's the same miniature biography that I sent to my new contact:

I have worked for the last ten years in open source and peer-to-peer software, but in community projects and tech startups. I founded a number of open source peer-to-peer software projects, including Freenet, Tristero, Alluvium, and Project Snakebite. I've also worked in peer-to-peer Interent video delivery at Swarmcast as Senior Engineer and then at BitTorrent as the Director of Product Management.

I currently have a startup here in Austin called Ringlight, where I make social file-sharing software. It can be thought of as "peer-to-peer meets Web 2.0" or "google for your desktop". It indexes all of the files on all of your computers and makes them available on a website, so you can access your files from anywhere that has a web browser, send links to your friends, search, tag, bookmark, etc.

I also do some consulting in the areas of product design, product management, and engineering architecture.
I'm always happy to meet anyone in Austin with a startup and I'd love to hear what you're working on. Let me know if you'd like to have lunch anytime this week or next. Also, I will be at Jelly this Friday and most any Friday, so feel free to stop by if you'd like. Jelly is a co-working group that meets on Fridays at Cafe Caffeine. It's a great place to meet other people in startups.

Other things to check out if you want to know more about me: blog, personal website, twitter, LinkedIn profile.

Friday, June 6, 2008

Twitter Integration for your Website

Social sharing of content is a popular feature for websites. It seems like every blog post these days is accompanied with a list of bookmarklet buttons: Digg, StumbleUpon, del.icio.us, etc. However, what about adding the ability to post links to Twitter from your website? It's not quite as simple as a bookmarklet, but it's still pretty easy to do.

The Ringlight website is written in Java (client is in python), so when I picked a library for twitter access, I picked jtwitter as it has no external dependencies.

It's so easy to use that it's almost too easy. Check out my code:

Twitter twit=new Twitter(username, password);
Status status=twit.updateStatus(message);

So easy, right? Now you can go to any file on Ringlight, click on Share on Twitter, and the link is posted on Twitter. You don't even have to be logged into the Ringlight website, so any random user on the Internet can start twittering links to my copy of Accelerando.

Wednesday, June 4, 2008

Cross-Platform Monitoring of Filesystem Events

A recent problem with deployment of the Ringlight client has been that users with a large number of folders have been experiencing annoying amounts of CPU usage. This is because the most fundamental functionality that Ringlight provides is periodically rescanning your filesystem to automatically find changes to the filesystem. Rescan too infrequently and changes won't appear on the website when expected, confusing users. Rescan too frequently and users will complain about too much CPU usage. There are many applications that require rescanning the filesystem, from virus scanners to automatic backup programs, and they all have to deal with this problem.

An attractive alternative to rescanning the filesystem is to use filesystem monitoring events. Rather than periodically scanning to see if anything has changed, instead let the operating system notify you when something has change. Very efficient! Unfortunately, unlike a simple recursive traversal of directories, this approach has to be implemented separately on each major platform and each OS has its own pitfalls and gotchas. I will focus primarily on building this in python, although the same underlying mechanisms can be used in any language with appropriate bindings.

On Windows, filesystem events are available using the Python for Windows Extensions. It is not a particularly simple API to use, being a direct binding to the Windows system calls.

On OS X, PyKQueue offers a binding to kqueue, which is also available on BSD.

On Linux, there are two kernel interfaces, dnotify and inotify, depending on whether the kernel version is lesser or greater than 2.6.13. You can call inotify directly with pyinotify. You can also use a more generic library such as Gamin, which will use either inotify or dnotify, whichever is available. Of course, really old versions of Linux don't even have dnotify, and you'll have to fall back to periodic rescanning.

Every platform's filesystem monitoring API is different and each has different issues, however they generally share a common set of issues as well:

  • Network drives don't generate events - you'll have to use periodic scanning for these
  • Every folder to be watched must be registered separately - you can't request notifications for an entire directory tree. You can to register all the directories separately and when a new directory is create you have to remember to register it as well. You sometimes need to keep a file descriptor around for each directory you're watching, so watch out for running out of file descriptors.
  • No special handling is done for shortcuts, aliases, or symlinks - if you're monitoring, say, a directory, and that directory has a shortcut (or the equivalent for that OS), you need to monitor two objects: the shortcut itself (in case its target is changed), and the targeted file or directory.
  • Sometimes deleting a directory won't send deletion events for files in that directory or subdirectories. You have to maintain a copy of this information yourself and perform a virtual recursive delete on your database when the parent directory deletion event is received.
The Ringlight client, being a cross-platform application the primary function of which is to monitor filesystem changes (and report them to the server, where the real fun happens), naturally takes all this into account. I am planning on release the filesystem monitoring core of the Ringlight client as open source, as there are no good cross-platform filesystem monitoring libraries available and it's really a shame that people have to reimplement all of this for their applications.

By the way, the users seem quite happy with the new version of the client that users filesystem monitoring events. No complaints about excessive CPU usage anymore!