The startup world at once fosters and excoriates the latest buzz phrases in tech: cloud computing, the pivot, nosql, crunches and bubbles, and big data.
“Big data” may currently be one of the most over-used terms but in practice it can refer to solid principles appropriate for building data systems that handle, well, a lot of data.
Nathan Marz describes these approaches in his book Big Data. Go buy this book!
It’s available in eBook form through the Manning Early Access Program (MEAP). Manning Publishing has built up a great library over the past couple years and Big Data looks like another promising read. The first six chapters are available online. Nathan is putting the final touches on these chapters before making the remaining chapters available. Looks like the book in its entirety is due out this summer.
I had a chance to read Big Data. It’s full of great ideas and I realized that we’ve implemented a core data pipeline using an approach similar to the one to which Nathan ascribes.
In the technical community there is a lot of discussion around storage frameworks, but in my opinion there really isn’t enough discussion around frameworks for both storage and computation. The book addresses both these aspects.
The central idea of the book is the lambda architecture, which provides a practical way to implement an arbitrary function on arbitrary data. The function is supported using three components: the batch layer, the serving layer, and the speed layer. These components alone would make for a great talk or blog post.
But that’s not exactly what I want to focus on here.
The first chapter of the book discusses the desired properties of a big data system. This list includes low latency reads and updates, horizontal scalability, extensibilty, ad hoc query support, minimal maintenance and debug time, and human fault-tolerance. These are true and most would agree they are important.
That last point stood out to me. Human fault-tolerance is one aspect that is easily overlooked when building big data systems and its related to the point about minimal maintenance.
A lot of things can go wrong when building data systems. I’ve been there. Work long enough on engineering systems and you will see and make mistakes, some stupid, some unlucky, some mind boggling. As Nathan points out, mistakes can include deploying incorrect code that corrupts potentially a lot of data, mistakenly deleting data, and incorrect job scheduling that causes a job to overwrite or corrupt data.
Hand-in-hand with human fault-tolerance goes maintenance. Maintenance can include debugging code, deploying code (if you’re deploying code manually you likely have bigger problems), adding nodes to a distributed setup, and generally keeping production systems running smoothly. The more maintenance a production system requires the more that manual human intervention becomes a priority. Setting aside the fact that resources are pulled away from working on actual features – this effort only increases opportunities for further mistakes. I’ve seen small teams lose multiple days dealing with not only the initial problem, but also dealing with the cascading errors caused by the efforts to fix the original bug.
Machines are good at automation. People are not. We should build our systems with this knowledge front and center. If the objective function is quality of life then systems should be built to minimize the maintenance parameters and maximize the automation parameters.
When dealing with “small” data, making a backup, downloading a backup, fixing specific pieces of corrupted data, or re-running a job is usually bounded in time, on the order of minutes or low number of hours. Performing these operations with big data is prohibitively time-consuming and expensive. In the limit, these operations become intractable.
A question that then arises is how do we design and build big data systems for human fault tolerance and minimal maintenance? The book addresses these aspects, which I’ve summarized here.
Simplicity. The more complex a system or component, the greater the risk of something going wrong. The lambda architecture pushes the complexity out of the core, batch components and out to the transient pieces of the system whose outputs are discardable after a short time. This includes distributed databases – rather than relying on the serving layer to hold state they are continuously over-written by the batch processes.
Immutability. Building data immutability into the core of your system imparts to it an inherent capacity to be human fault-tolerant. Keeping data immutable enables fault-tolerance in 2 big ways. One, it empowers you to work from an immutable master data set that represents a more raw state of your system. If data downstream is corrupted or lost the master data set can be retrieved and computations can be re-run on these data. This has a profound impact on the way you engineer and maintain your data pipeline. Second, it minimizes over-engineering. If updates and deletes were supported you would need to build and maintain an index over all these data to retrieve and update specific objects. In contrast, immutable data is read- and append-only. The master data set can be as simple as a flat file. HDFS is a good example of storage that supports immutable data.
Recomputation. Rather than building computations to incrementally update different data, build computations instead as “recomputations” that can be run in batch across the entire master data set. Even better is if these recomputations are idempotent so that they can be run any number of times over your data. These types of processes are better suited for human fault-tolerance. If any intermediate data are corrupted this approach allows you to simply re-deploy code and re-process. An incremental update approach would require you to determine which data were corrupted, which were okay, and figure out a way to fix just the right pieces of the data. This only increases the dependency on human intervention, which as we know only increases the possibility of more errors. Using recomputations also improves generalizability of your algorithms. Many algorithms, especially in machine learning, are easier to implement in “batch” mode.
These strategies may sound counter-intuitive. They demand increased storage and processing time, two things a startup team is wont to lack. These disadvantages however are outweighed in the long run by the simplicity, stability and human fault-tolerance of your system.
While reading the book a lot of the ideas rang true. They reflect some of our concrete implementations. One of our core pipelines embodies the ideas of the lambda architecture and simplicity, immutability and recomputation. We build daily snapshots of our master data set, allowing us to recover and redeploy full computations, while processing new data in a low latency workflow (a speed layer).
Complex systems break down. Complexity is difficult to extend, modify, maintain, and reason about. Incorporating the strategies discussed in Big Data can allow you to spend more time building features and less time stuck maintaining the status quo. Go buy the book for a lot more coverage and detail on this topic.
Update: Just after finishing this post I came across Nathan’s keynote page on the Strata Conf website, a talk he happens to be giving tomorrow morning! Total coincidence that his talk is titled “Human Fault Tolerance”. Glad to know the author of the book really appreciates this aspect to big data system design.
Update II: Feb 28th’s Manning Deal of the Day also happens to be Big Data.
Update III: Here is a link to Nathan Marz’s StrataConf talk on human fault-tolerance.