Data Council Blog

Data Council Blog

Building a Column-Oriented, Distributed Data Store for Analytics - The Story of Druid



Druid is a modern data store built for analytics use-cases. As the volume of data has exploded, and companies have sought deeper insights from their data, ad-hoc analytics have become difficult as more data is buried in distributed systems like Hadoop & Spark. The query model for these systems can result in long latencies making them sub-optimal for interactive analytics applications.

Meet Fangjin Yang of Imply



Co-Founder & CEO, Fangjin Yang has contributed extensively as an author of Druid, which takes a columnar approach designed from the ground up to solve problems like the ones discussed here, which Yang will discuss in greater detail at DataEngConf Sf 1'7. 



Where does the story of Druid start?

Druid was developed by Fangjin and several co-collaborators at Metamarkets in 2011 out of the need to supply clients with real-time analytics for their programatic advertising campaigns. Due to the increasing speeds introduced by programatic ad-buying, existing tools and architectures for building analytics systems were unable to meet the millisecond response times demanded by the market. Fangjin and team began work on Druid to solve the latency and architectural issues they discovered in existing serving layers common for analytical applications.


From OS project to funded startup

While Metamarkets developed and ran Druid in production for years, open-sourcing it along the way, in 2015 Fangjin decided the software had proven itself in the real world with enough users and launched a new company, Imply, to offer a Druid-backed platform for other companies to more quickly build their own analytics based applications. Backed by Khosla Ventures, Imply blends Druid on the backend with visualization tools on the front-end to form its complete offering.


What technical lessons did Fangjin learn?

When I spoke to Fangjin about his talk and asked him what surprised him most during the development of Druid, he told me that he didn't expect to see so many new problems emerge when working on data at scale. It took several years for Druid to be thoroughly battle-tested in a battery of situations - most of which a small group of engineers working on a couple main use-cases simply couldn't foresee.


Now Druid has several installs with petabytes of data ingesting millions of events per second. So consider the platform hardened and ready for your own custom analytics applications.


To learn more about the architecture and use-cases of Druid, join us at DataEngConf Sf '17 for Fangjin Yang's talk: Interactive Exploratory Analytics with Druid.


New Call-to-action


Data Engineering, Speaker Spotlight

Pete Soderling

Written by Pete Soderling

Pete Soderling is the founder of Data Council & Data Community Fund. He helps engineers start companies.