Few years ago, I got a job in a team which was building a search engine from scratch akin to a real-estate directory (but not exactly that) where a user searches for houses in a town and can filter based on criteria such as price, proximity to amenities, etc.
Due to an extraordinary degree of collective laziness, we decided against IaaSes and went with a PaaS we were familiar with, the GAE. Very early on we noticed that users would only ever search for houses around a certain area and so we would never look across the entire dataset. There had to be a way to capitalize on this and sure enough, we found it: there was no need for a backend. The dataset for a locality and its surrounding was small enough to be downloaded to the user’s browser and be interrogated there.
I should clarify what I mean by backend. I’m referring to the component of a complete website which handles requests from a user’s browser and builds responses dynamically. I’m excluding anything that’s not directly involved in keeping the site operational (e.g. cron jobs).
It didn’t take long for everyone to be convinced that if our assumptions about lack of performance impact from user’s point of view hold, elimination of time and cost needed to build, test and maintain a backend that can handle growing traffic, is an immense win. Because then we could serve the entire site out of something like S3. Turns out there are other benefits as well:
- Zero potential for a vendor lock-in as everything is plain files that any webserver since the dawn of the web can serve.
- Near zero effort to scale.
- Zero influence on the code by constraints imposed by number of users.
Below is a simplified file structure of the site:
Except index.html, every other file is named after crypto hash of its content. This type of content addressed hierarchy, which resembles git, has several advantages:
- Updating the site is an atomic operation (ie updating root updates everything in one go) so an in-progress/failed update is not visible and is resumable.
- Everything can be infinitely cached by browsers and proxies, without the need to ever consult the server.
- Easy rollback to previous versions. As a bonus, this also let us provide a history feature.
- Simple integrity verification — helpful when moving files around.
So how is this file hierarchy built? To answer that, first a little background on how we collect the data. We continuously crawl 50000 sources spread across 3000 localities each with average of 200 listings. Each listing may have a dozen associated resources such as PDFs, photos, etc. That amounts to about 10 million searchable items. Data for each source is parsed and packaged into a single unprocessed blob which are indicated by magenta color in the diagram above.
A updater process then walks the existing tree (yes the actual files making up the live website) and determines what needs to be added, starting with the magenta blobs which indicate if the data for that source should even be processed. New blobs are then added (never overwritten except index.html) from bottom-top and finally the root is updated to make the changes visible. A copy of the old root is saved and pointed to by the new root to allow walk back in history.
The site is updated once every hour, while the sources are updated twice a day. Given that a single user is mostly searching in one area, s/he will be using the site as if it was an offline app. This makes the site incredibly fast especially on laggy/unreliable links like 3G.
The updater and crawler discussed above are all part of a collection of programs that run on a bunch of throw away VMs. They’re throw away because they only hold cache of crawled data. The source of truth is the live website. In other words, the file hierarchy is not just a view of the data for user consumption but the actual database as well and consequently the only thing we need to keep backups of. We do snapshot our VMs too, but only to speed up recovery.
You might think since we’re dealing with VMs anyway, why all this fanciness with generating a static site? Difference is mission criticalness of those VMs. The reliability and performance of our VMs are not that important, because they’re not bound by millisecond requirements of answering user queries. If they die, there is no impact on the immediate operation of the site. And we can simply start another instance off a snapshot with minimal planning. It is this freedom that is very attractive.
In a future post, I’ll write about the site’s performance and various issues we had to deal with.