Wednesday, May 28, 2008

Adding Search (with Lucene)

The time has finally come for me to add search functionality to Ringlight. There is enough content now that just clicking around is getting tedious. After all, it's up to almost 200,000 files now.

There are a number of considerations to make when adding search to your site. For instance, you can usually get by pretty well with just integrating Google search into your website. This is fast, easy, and doesn't require messing with your backend code at all.

However, this is not really what I want. I want to let users search for files, not web pages, and I want the results integrated nicely with everything else. For instance, it would be cool to use a search query as a radio playlist like you can do on Hype Machine. So I'll need to build my own search engine.

This is not really that hard to do. I would recommend you read some articles and then download Managing Gigabytes for Java. Those articles are by Tom from AudioGalaxy. You may remember AudioGalaxy as the best thing to happen and unhappen to music in my lifetime. I know do. More importantly, it was deliciously scalable and for the most part it was just a search engine. So don't go writing one without learning some tips from the best.

I'm sure that a little engineering and MG4J could produce a highly scalable search engine. However, I didn't really want to spend that much time on it, so I went with a higher level solution in the form of Lucene for Java. There is also a popular version for Python. I would recommend waiting a while if you're considering using Lucy (Lucene in C with Python and Ruby bindings) because I don't consider it mature. I'd also stay away from layers on top of Lucene like Solr because if you're looking for tools to make Lucene easier to use then you're missing the point that it's already easy to use.

Download the official Lucene release and you'll see that it comes with source code for a demo. One class, IndexFiles, shows you how to add information to the search index. Another class, SearchFiles, shows you how to search the index to retrieve items. Like most demo code, it's both too simple (doesn't provide enough use cases of the library to let you fully understand it) and too complex (has a bunch of command line arguements that it has to parse and such). However, it will do. I have working search for my whole site after two days of fiddling around and working on it part-time.