Sub-second complex queries over large data volumes in real time: sounds like a tall order. That's what it takes to operate distributed systems at scale, when failures are diverse, unpredictable and hard to detect. To track down problems quickly, you need to look for patterns and correlations in your data, trying different ways of breaking it down. "Does the problem occur on just one host, or one partition, or for particular customers?"
Building Honeycomb, we needed to ingest and store customer events, and support complex queries over those events, including aggregates (like mean and percentiles) and breakdowns by fields of arbitrary cardinality. We made the unusual choice to implement our own data store - from scratch, in Go.
This talk describes Retriever, a low-latency, distributed, schemaless data store. Inspired by the Scuba paper from Facebook, Retriever partitions events across multiple nodes and fans out queries for parallelism. Unlike Scuba, Retriever stores events on disk rather than memory, using a column-oriented storage model. The mantra is "only read what you need": columnar storage lets us scan only the fields needed by the current query, and support sparse fields and flexible schemas.
With the limited resources of a startup, simplicity is key. We lean heavily on the filesystem for operational tasks, and leverage Kafka for replication and fault-tolerance. I'll share lessons learned from operating a hand-rolled database at production scale with paying customers.
Sam Stokes is a software engineer who can't leave well enough alone. He's compelled to fix broken things, whether they are software systems, engineering processes or cultures. After watching too many systems catch fire, he's building better smoke detectors at Honeycomb; in a past life he cofounded Rapportive and built recommendation systems at LinkedIn.