Single-cell sequencing generates a new kind of genomic data, and with it new storage and compute challenges. I'll talk about recent work parallelizing analysis of this data using a variety of distributed backends (Apache Spark, Dask, Pywren, Apache Beam). I'll also discuss the Zarr format for storing and working with N-dimensional arrays, that several scientific domains have recently gravitated toward in response to challenges using HDF5 in parallel and in the cloud.

Download Slides