How do you take a platform designed for large scale storage of unstructured key-value data and optimize it for the structured world of Spark? In this talk we'll look at the real world lessons learned integrating Riak, the distributed key-value NoSQL database, with Spark. This will cover both the challenges and solutions for integrating these tools. We'll also dive into more advanced topics we encountered while creating the open source Spark-Riak connector including:

  • How to handle what is traditionally schema-less data across widely divergent use cases
  • Using dynamic data mapping to efficiently bridge NoSQL data into the Spark world of RDD and DataFrames
  • How to optimize performance by using advanced techniques such as parallel data extract and cluster’s coverage plan
  • Real-world examples using Spark SQL and Spark Streaming for time series use cases
  • Leveraging Riak’s built-in leader election service (LES) for Spark Master high availability (HA) that removes the need to use Apache Zookeeper

Download Slides

John Musser

Director Platform Enablement and FordLabs | Ford Motor Company

John Musser is VP of Engineering for Basho Technologies, creators of the NoSQL database Riak. John is a recognized industry expert having founded ProgrammableWeb, the leading online API resource for developers, as well as the DevOps service API Science. He is often quoted in the media including the Wall Street Journal, New York Times, Forbes, and Wired, and speaking at conferences including OSCON, QCon, SXSW, Dreamforce, and Web 2.0. He also consults on API and big data strategy with clients including Google, Microsoft, AT&T, and Salesforce. He has taught at Columbia University and University of Washington.

John Musser