ABOUT THE TALK

You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) HDFS or Block/Object Store? Which query engine? This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.

Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others. Adding to the data formats selection is which query engine works best for the data format & workload. Oh lets not forget the question: “Do I store that in HDFS or a block/object store?”

This talk will take a closer look at some of these trade-offs. Attendees will learn, based on a few real world use cases, the How, Why, and When of choosing one format over another (and will your choice of query engine affect this.). Covering the four major data formats (Plain Text, Sequence Files, Avro, and Parquet) we will provide insight into what they are and how to best use and store them in HDFS or a block/object store.

Stephen O'Sullivan

| Silicon Valley Data Science

Stephen O'Sullivan
BUY TICKETS


VIEW ON MAP

Location subheader text. Can be left blank if not needed.

Company Name

Company address, lorem ipsum dolor sit amet

BROUGHT TO YOU BY:

partner-85.png
partner-canvas.png
partner-dropbox.png

FEATURED MEETINGS