Version and Deploy Datasets at Scale

ABOUT THE TALK

Quilt (https://github.com/quiltdata/quilt) is an open-source tool that can package and version datasets at scale in a virtual "data package" that can be imported directly into Python (and soon other data science platforms). Like Docker and containerization has done for software, packaging for data enables reproducibility and portability--e.g., in machine learning and data analysis. Data packages can be pushed to a registry (https://github.com/quiltdata/quilt/tree/master/registry) from which they can be deployed to wherever the data are needed--on premises, in the public cloud, or to data scientists' laptops. Quilt automates deserialization and transport so that In code (e.g., Python), a data package is a tree of DataFrames, or other ready-to-access objects.

In this talk, Kevin will present the Quilt package specification and describe the implementation of the Quilt registry. I'll also discuss the current roadmap for future development. A Quilt package consists of a package manifest, a JSON-encoded Merkle tree that catalogs and describes its contents, and a set of binary data fragments. Quilt uses Pandas and Arrow to deserialize data fragments into DataFrames and other first-class Python objects based on the package manifest. Quilt borrows implementation and design from existing data versioning systems including git and Dat. Unlike those systems, Quilt is focused on data deployment and creating snapshots of data consumption instead of tracking updates to data.

Version and Deploy Datasets at Scale

Kevin Moore | Quilt

ABOUT THE TALK

Download Slides

kevin moore

CEO | Quilt

Experience more insightful data science & engineering talks at DataEngConf NYC '17