An attractive alternative to rescanning the filesystem is to use filesystem monitoring events. Rather than periodically scanning to see if anything has changed, instead let the operating system notify you when something has change. Very efficient! Unfortunately, unlike a simple recursive traversal of directories, this approach has to be implemented separately on each major platform and each OS has its own pitfalls and gotchas. I will focus primarily on building this in python, although the same underlying mechanisms can be used in any language with appropriate bindings.
On Windows, filesystem events are available using the Python for Windows Extensions. It is not a particularly simple API to use, being a direct binding to the Windows system calls.
On OS X, PyKQueue offers a binding to kqueue, which is also available on BSD.
On Linux, there are two kernel interfaces, dnotify and inotify, depending on whether the kernel version is lesser or greater than 2.6.13. You can call inotify directly with pyinotify. You can also use a more generic library such as Gamin, which will use either inotify or dnotify, whichever is available. Of course, really old versions of Linux don't even have dnotify, and you'll have to fall back to periodic rescanning.
Every platform's filesystem monitoring API is different and each has different issues, however they generally share a common set of issues as well:
- Network drives don't generate events - you'll have to use periodic scanning for these
- Every folder to be watched must be registered separately - you can't request notifications for an entire directory tree. You can to register all the directories separately and when a new directory is create you have to remember to register it as well. You sometimes need to keep a file descriptor around for each directory you're watching, so watch out for running out of file descriptors.
- No special handling is done for shortcuts, aliases, or symlinks - if you're monitoring, say, a directory, and that directory has a shortcut (or the equivalent for that OS), you need to monitor two objects: the shortcut itself (in case its target is changed), and the targeted file or directory.
- Sometimes deleting a directory won't send deletion events for files in that directory or subdirectories. You have to maintain a copy of this information yourself and perform a virtual recursive delete on your database when the parent directory deletion event is received.
By the way, the users seem quite happy with the new version of the client that users filesystem monitoring events. No complaints about excessive CPU usage anymore!