For Every Evernote—Its Own Lucene Index!
One of the clever architectural techniques they use to make their notes so convenient by being instantly shared and organized is by creating a shard for every single note containing 3 defacto open source technologies: MySQL, Tomcat, and Lucene.
The graphic shows you how each note gets a shard containing three different storage systems for Metadata, Resources, and Searchable Text.
"All of the metadata about each note goes into structured tables in MySQL. And by “metadata”, I mean all of the fields in the data model structures for a Note and its Resources, except for the Resource’s raw data body and any recognition/alternate data files.
Those Resource files are de-duplicated in software on each shard (using MD5+length) and then stored on a relatively simple hierarchical file system using a folder tree derived from the MD5 checksum.
The combination of MySQL and the file system allows us to store the full contents of the data model and support the vast majority of our API calls. Text-based searches on our servers require some sort of Full-Text Search (FTS) engine to provide any sort of usable performance across large data sets." --Dave Engberg, Evernote
Evernote initially used MyISAM's FTS engine within MySQL itself to index the searchable text metadata in notes. They tried a few things with MyISAM including batch updates, but they eventually gave up and switched to Apache Lucene - a proven search library.
Why did they make the change? Evernote had high standards: "When users create or update notes, they expect those notes to immediately match any text searches," said Dave Engberg, the author of the post. Only Lucene could give them the virtually synchronous text indexing for each individual note after its creation.
When you use Evernote, every single note now has its own Lucene search index occupying a separate directory on the file system.
It wasn't so simple, however, to maintain the level of performance that they wanted, so there was definitely some Lucene and MySQL (even hardware) tuning that was required. Go ahead and read the post via the Resource Box link if you're interested in all the gory details of how they made Lucene work well for them.
Before you do, let's hear some thoughts from the search gurus out there (or just anyone really :) ) Do you think Evernote's got the right idea? Lucene is currently making twice as many IO operations as MySQL, but they expect they can bring that down with some eventual tuning.