Technical Talks

View All

Scaling model training: from flexible training APIs to resource management with Kubernetes

Kelley Rivoire Kelley Rivoire | Engineering Manager | Stripe

Model training can often be a manual process using notebooks or command line scripts run on a shared server or even a laptop. This is convenient for building intuition, but at some point fails to scale: notebooks and command line scripts generally aren't reproducible, which can lead to confusion about what was running in production when. Similarly, as a machine learning application benefits from an increasing count of models (e.g. a common pattern is developing user-specific models as well as a generic model) or increasingly large datasets, simple tasks like keeping track of training runs and managing computational resources quickly become untenable manually.

To help solve these problems, we built an easy-to-use API (that we call Railyard) for training machine learning models, allowing fast, reliable iteration on model training. The Railyard workflow provides an API contract for users. Railyard will fetch your features and labels, split the data into training and test sets, pass along any extra JSON you passed to the API, and handle serialization and evaluation for your fitted estimator, completing your job. Railyard is a Scala service that exposes JSON endpoints for training models and fetching the results of model training runs. The service kicks off the training jobs and performs job-state management to track what is being trained and when the jobs kick off and finish.

Railyard uses Kubernetes as an execution engine for all of the model training runs; Kubernetes performs resource allocation and management. This allows us to flexibly support different resource types for training runs requiring, e.g. more memory or GPUs. The combination of flexible API and execution engine facilitates continuous retraining of thousands of models every week, allowing us to quickly evolve machine learning models especially for adversarial machine learning applications like fraud, where models degrade more quickly. As part of continuous retraining, we can not only evaluate individual models, but also more sophisticated compositions of models.

In this talk, I'll describe the lessons we learned from building and evolving the Railyard API to support heterogeneous production model training workflows to support production models from logistic regression to deep learning and scaling model training using Kubernetes.

Kelley Rivoire
Kelley Rivoire
Engineering Manager | Stripe

Kelley Rivoire is an engineering manager at Stripe, where she leads the data infrastructure group, encompassing the storage systems for Stripe's data, the platforms for batch and streaming computation and machine learning that enhance Stripe's products and internal operations, and the core data pipelines powering analytics. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs after receiving a PhD at Stanford.