ABOUT THE TALK

At Heroku, we offer Apache Kafka as a service to a large number of distinct users, each with a varying number of use cases. Our goal is to provide this service to users while removing many of the operational headaches that comes with running infrastructure like this at scale.


While we run a robust service today, it took a lot of effort to get here. Early on, we uncovered a variety of failures scenarios. One such problem surfaced when a large number of concurrent admin commands, such as topic or consumer group creation, would cause a cascading broker failure. Our users would then observe not only enormous latencies around these admin commands, but also see a considerable drop in throughput until our infrastructure automatically recovered. It was important for us to address this because it would only get worse as our user base grew.


Once we overcame this hurdle, we discovered another problem later on where brokers would fail when a large number of users would constantly go above their quota. This was, in part, a result of our decision to leverage quotas to ensure our users could coexist and behave among each other. However, it turned out to be another case of cascading broker failures since the underlying quota implementation, based on traffic shaping, had a memory leak that caused out-of-memory errors one broker at a time.


In this talk, we will continue with our stories above in greater detail and share more anecdotes of how we discovered and addressed interesting behaviors and gotchas as we scaled our infrastructure to ensure a robust and trustworthy service for our growing user base.

Jeff Chao

| Heroku

Jeff Chao
BUY TICKETS


VIEW ON MAP

Location subheader text. Can be left blank if not needed.

Company Name

Company address, lorem ipsum dolor sit amet

BROUGHT TO YOU BY:

partner-85.png
partner-canvas.png
partner-dropbox.png

FEATURED MEETINGS