Apache Spark is one of the most popular open source big data systems, with APIs available for Scala, Java, Python, and R. Spark's core building block of RDDs allow you to write functional program that are automatically distributed, and the new Dataset API allows us to also intermix relational style programming. In addition to allowing multiple types of programming on the same data, the Dataset API brings a more thorough optimizer and is able to better understand our programs. This talk will introduce the Dataset API while also looking how the optimizer and formats differ and the best ways to take advantage of this. The talk will wrap up with looking at ways in which Spark Datasets' can sometimes fail us (not everything is magic), and ways to work around them.
Holden Karau currently works on improvements to Apache Spark and help other developers contribute to Spark as a principal software engineer at IBM. She also frequently gives external facing talks on Apache Spark.