Technical Talks

Beyond Spark RDDs: Dataframes & Datasets for mixing functional & relational code
Missing value detected...
Video will be populated after the conference
Apache Spark is one of the most popular open source big data systems, with APIs available for Scala, Java, Python, and R. Spark's core building block of RDDs allow you to write functional program that are automatically distributed, and the new Dataset API allows us to also intermix relational style programming. In addition to allowing multiple types of programming on the same data, the Dataset API brings a more thorough optimizer and is able to better understand our programs. This talk will introduce the Dataset API while also looking how the optimizer and formats differ and the best ways to take advantage of this. The talk will wrap up with looking at ways in which Spark Datasets' can sometimes fail us (not everything is magic), and ways to work around them.

Principal Software Engineer
Holden Karau
IBM
Holden Karau currently works on improvements to Apache Spark and help other developers contribute to Spark as a principal software engineer at IBM. She also frequently gives external facing talks on Apache Spark.
Discover the data foundations powering today's AI breakthroughs. Join leading minds as we explore both cutting-edge AI and the infrastructure behind it. Reserve your spot at before tickets sell out!