Data Council Blog

Data Council Blog
| |

Open Source Highlight: Apache Iceberg

Apache Iceberg is an open table format for very large analytic datasets. You can use it with Presto or Spark to add tables that use a high-performance format that vows to work just like a SQL table.

iceberg-1

While it was initially developed at Netflix, it is now open-sourced, with contributors from Apple, LinkedIn, GoDataDriven, Lyft, WeWork, and more. As a matter of fact, it is of great value to companies that need to query huge quantities of data; as its site points out, "Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine."

It is worth noting that users don't need to know about partitioning to get fast queries, in part thanks to hidden partitioning and other features aimed at preventing user mistakes. In addition, members of the community are working on making it easy to migrate from Hive to Iceberg, so stay tuned for more information on that front.

If you'd like to hear more about Iceberg, and also Arrow and other tools that can help you architect an open cloud data lake platform, make sure to check out the talk that Ryan Murray from Dremio recently gave as part of our virtual London meetup.

Data Engineering, SQL, Open Source