September 12, 2009 will

Evolution of an Auto-Complete

My latests hobby-project has been pushed live, in invite-only beta form. Previously known as Links Desktop, I have now dubbed it Loci Desktop, after the Loci Method.

Screenshot of auto-completing urls in locidesktop.com

Auto-complete in Loci Desktop

One feature of Loci Desktop is that it will auto-complete URLs when you add new icons to your ‘desktop’. Auto-complete is one of those features that users expect these days. They want the app to figure out what they want with as few key-presses as possible – and quite rightly so, typing is such a chore!

The auto-complete system for Loci Desktop, in its initial state, was straight-forward to implement. The javascript and front-end was the most time-consuming part of the job, but the back-end Python code was trivial.

Amoeba

Alas, it was too slow to be practical. The list of URLs that I was auto-completing from came from a list of the top one million sites from Alexa.com, stored in MySQL and queried with the Django ORM. The query searched the urls for a substring, and sorted by the Alexa rank so the most popular sites were listed first.

Although it worked perfectly, the auto-complete code at the back-end hammered the server and took to long to return its result. Reducing the number of URLS to 100,000 helped, but didn't make it as usable as auto-complete in a desktop app.

Opposable Thumbs

There are still some beta invites for Loci Desktop available. Contact me if you want one.

I'm no expert on what goes on under the hood in a database, but the conclusion I came to was that there was no way that the DB could produce an index for substring searches on-the-fly, and had to resort to comparing the substring with every entry in the database. With a million entries, that could never be fast.

Caching helped, but only for URLs that were previously searched for. But it occurred to me that if the results for all possible searches were cached then auto-complete would be blisteringly fast. I almost dismissed that idea as crazy talk, but mulled it over anyway.

It turned out to be practical. There are a lot of substrings for any given URL. For example, “facebook” contains 8 one-character substrings, 7 two-character substrings ('fa', ‘ac’, ‘ce’, ‘eb’, ‘bo’, ‘oo’, ‘ok’), and so on. So there are going to be a log of substrings for each url – but there will be a lot of substrings common to many urls, and I only need to store 10 ‘hits’ for each substring.

Generating this substring index took quite a bit of brute force processing, but once uploaded to the server it means that I could use a single, extremely efficient query to generate the auto-completed urls. The query time went down from more than a second, to 0.002 seconds! A very satisfying result, which meant that the auto-complete would update almost as fast as I could type, at about 150 milliseconds per request.

Making Tools

Another optimization was to offload a bit of work to the client by caching in Javascript. It was trivial to implement, but not a particularity big win as it only speeded up auto-completed URLs that had been searched for previously (such as when you delete characters).

Geek here, make fire!

Although these optimizations made the auto-complete nice and fast, the small delay in receiving the first list of URLs meant that it wasn't obvious there was auto-complete if you hadn't used it. It would be preferable if the auto-complete selection appeared after the first key-press. So I generated a mapping of every letter and digit on to the corresponding list of urls and used that to auto-complete the first character, rather than make a round-trip to the server.

Making the first character auto-complete virtually instantaneous really made it feel snappier from the start. So a big win, for minimal effort.

Conclusion

Databases are highly tuned pieces of software, but you can get big wins if you massage your data in to a more efficient format!

Use Markdown for formatting
*Italic* **Bold** `inline code` Links to [Google](http://www.google.com) > This is a quote > ```python import this ```
your comment will be previewed here
gravatar
Bram
May be you would be better off with a search index/database such as Solr or Whoosh?
gravatar
Will McGugan
Bram, I didn't consider that. I guess the urls could be treated as small documents.
gravatar
Euan Goddard
I don't think that Solr or Whoosh would give much better performance than a DB server for this sort of thing. Although Whoosh is great, it is written in Python so is likely to be a lot slower than MySQL. In my experience the overhead of getting Solr running in Tomcat Apache probably isn't worth the effort either (although you might see a significant speed-up over Whoosh).

All that said, I'm only going on gut-feel rather than hard fact :)
gravatar
Naveen Michaud-Agrawal
Hey Will, what you want to use is a trie (http://en.wikipedia.org/wiki/Trie). The simplest form is an N-ary tree (in this case 26, 1 branch for each letter), and the substring is searched by branching at each letter.