Data Council Blog

Data Council Blog

Rolling Your Own Distributed Column Store


When solving your customers' technical challenges push you to break the rules

A re-wording of one of the key maxims for startup success could be "KISS" - "keep it simple, stupid." If you've ever run your own startup, you also know the mantras of "focus" and "fail fast," and the critical reminder of how your product should be a "pain-killer not a vitamin."

Then there are another set of axioms like "don't reinvent the wheel" and (my favorite from the security world) "don't bring a knife to a gunfight." Which literally means, "don't attempt to roll your own crypto."

And why don't you roll your own crypto? Because many other, likely smarter, people have spent decades of work and research in creating mathematically sound, peer-reviewed, tested-in-the-real-world, crypographic algorithms that have stood the test of time and whose strengths (and limitations) are well understood. 

The same axiom could easily be applied to building a database. Let me say it plainly. "For the love of god, don't try to build your own database!"

So why - why on earth - would a startup ever do such a thing?

In his upcoming DataEngConf NYC talk, Sam Stokes of Honeycomb will be explaining to us the challenging analytics problem Honeycomb is working to address for its customers, and why this challenge drove the company to develop their own custom columnar storage engine from scratch.

Meet Sam Stokes of Honeycomb

Sam Stokes.png

Sam is a software engineer inexorably pulled towards operations. After years of watching systems catch fire, he knows we need better smoke detectors. Sam co-founded Rapportive and built recommendation systems at LinkedIn. He enjoys rock climbing and new cocktail recipes.



To explain why the company started down this road, it helps to understand the problem they are trying to solve. Honeycomb is building better tools for production observability - their goal is to visually weaponize grepping through logs in a powerful way like you've never seen before. They augment log aggregation by overlaying it with a real-time query system so engineers can rapidly ask questions and test hypotheses about how their systems are performing. To do that, they store raw events and let analysts run time-series queries against them in real time.

Existing time-series databases can't offer this flexibility as they throw away data by pre-aggregating metrics and suffer a combinatorial explosion of storage costs in the presence of high cardinality. And existing log aggregators are optimised for text search, not for time-series analytics of structured data.

Because the founding team started the company after working with Scuba at Facebook (which espouses some very similar principles), Honeycomb decided the best way to solve the problems their customers had with real-time analytics exploration was to build their own distributed columnar data store.

But don't forget - Honeycomb is a resource-constrained startup, so their success has been dependant on careful constraints they've been able to establish based on the very specific use-cases of their customers. They've also used best-of-breed open source tools as building blocks wherever they could.

As Sam told me, "What surprised me most was how simple a data store implementation could be if you lean on existing primitives - like Kafka, filesystems, and even rsync - while at the same time you're laser-focused on solving a tightly constrained problem rather than building a general purpose database."

This talk will cover some familiar ground regarding the advantages of the columnar model for analytics use cases, but will go beyond that to describe the whole distributed system - how data is ingested and partitioned, how Honeycomb runs distributed analytics queries, and how they handle operational concerns like fault tolerance and bootstrapping new nodes. 

The engineer's axiom might be true, "don't build your own database," but also don't ever pass up the opportunity to hear an awesome talk given by someone who did. 


New Call-to-action

Data Engineering, Event Updates, Databases

Pete Soderling

Written by Pete Soderling

Pete Soderling is the founder of Data Council & Data Community Fund. He helps engineers start companies.