Data Council Blog

Data Council Blog

PyTorch Lightning, ksqlDB and More: Top 10 Links from Across the Web

Here are 10 recent relevant links for data professionals, from blog posts and tutorials to podcast episodes:

1. PyTorch Lightning: a gentle introduction

Former Data Council speaker Will Falcon published an interesting post on PyTorch Lightning, the lightweight PyTorch wrapper born out of his Ph.D. AI research at NYU CILVR and Facebook AI Research (FAIR). Framed as "a gentle introduction", it includes a side-by-side comparison of building a simple MNIST classifier PyTorch and PyTorch Lightning, in order to illustrate how to refactor one into the other. This is highly recommended reading if you are working on AI/ML research, be it as a professional researcher, student or in production.

Should Datacoral Power Your New Data Infrastructure?

Today's companies aim to be data-driven, but data infrastructure is time intensive and costly to build, maintain, and secure.  A coral is the exoskeleton of a small marine animal that attaches and grows on almost anything. Once it starts growing, it can create large reefs, which support a diverse ecosystem of plants and animals. So what happens if you apply that philosophy to the world of data?

How to "Democratize" the Responsibility for Data Quality Across your Organization

 

 

Writing endless data transformations wasn't sustainable for an engineering team handling hundreds of inputs. Here's how Clover Health enabled their business users to help.

It's rare to find an ETL system that's completely static. As organizations change and grow they develop new business requirements. Because of this their data pipelines must change and adapt, ultimately becoming more robust and full-featured. Yet constant development can make already brittle ETL systems seem even more fragile.

Furthermore, systems with large numbers of different types of inputs bring special challenges - building, testing and managing an exploding number of data transformations can become a daunting project for the engineering team. 

The Clover Health ETL system supports hundreds of inputs and more than 500 custom transformations in production as well as a large number of custom connections between their different ETL pipelines. When hearing about the magnitude of the system, one might rightfully wonder, "how does Clover guarantee and maintain data quality across so many different inputs and transforms?"

Exploring the development trajectory of Clover's system makes for a fascinating story; hearing about their data team's successes and pitfalls are illustrative lessons to other engineers as they seek to increase the robustness of their own ETL systems.

The Future of Distributed Databases is Relational

 

 

What if developers could ditch their No-SQL solutions and still get scalability from a more traditional relational datastore?

I've been noticing an interesting pattern recently where developers seem to be rejecting some of the newer, more en vogue data stores with limited functionality and use-cases (while promising easier scale) and returning to the comfortable tried-and-true paradigm of relational databases. It seems that we've hit a watershed point where developers finally believe they don't necessarily need to make a trade-off between database features on one hand and easy scalability on the other.

One such company enabling this return to the golden era of of RDBMS is Citus Data. Citus is blazing a trail in 'cloud-proofing' the gold standard of relational databases, PostgreSQL, through extensions that allow their customers to achieve much easier horizontal scalability than ever before. 

Functional Data Engineering — a modern paradigm for batch data processing


Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot.

In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. This post distills fragments of wisdom accumulated while working at Yahoo, Facebook, Airbnb and Lyft, with the perspective of well over a decade of data warehousing and data engineering experience.

| |

Data Science in the Media

 

 

The past few years have been an interesting time for data science everywhere, and the media in particular! We’ve seen some incredible new technologies emerge, like open-source machine learning platforms, as well as machine learning services. These developments have opened the door for new consumer products, like conversational AIs, and new technologies in the media and advertising industries.

How Data Has Evolved at The New York Times

 

Whether you love or hate their paywall, the Times successfully balances competeing business frictions using a deep view of data. 

Since our initial DataEngConf in 2015, The New York Times has been a key supporter of the conference. The very first ever DataEngConf talk was a keynote given by Chris Wiggins, the Times' Chief Data Scientist, who presented a broad yet fascinating perspective on "Data Science at The New York Times" (video here).

In the years since, we've had deeply technical talks from both data engineers and data scientists at the Times, and I'm excited that their involvement in DataEngConf this year is as large as it's ever been.

How Dremio Uses Apache Arrow to Increase the Performance

 

(Image source: http://arrow.apache.org/)

What if all the best open-source data platforms could easily share, ("ahem,") data with each other?

As data has proliferated and open-source software (OSS) has continued to dominate both the stacks and the business models of the top tech companies in the world, the number of different types of data platforms and tools we've seen emerge has accelerated.

Having a hard time keeping up with the differences between Kudu, Parquet, Cassandra, HBase, Spark, Drill and Impala? You're not alone, and obviously this is one of the reasons we bring together top OSS contributors to these platforms to share at DataEngConf.

But there's one new innovation that attempts to bind all the above projects together by enabling them to share a common memory format. It's a new top level Apache Project called Arrow that aims to dramatically decrease the amount of wasted computation that occurs when serializing and deserializing memory objects. The serialization pattern is commonly used when building analytics applications that interact between data systems which have their own internal memory representations.  

Introducing our Data Startups Track

 

Machine Learning, Neural Nets, "AI" and Computer Vision are changing the world. Discover the data startups that matter.

As an engineer turned founder I've been passionate for years about helping other technical founders succeed. There are a unique set of challenges faced by founders, and building support communities to help them successfully overcome their obstacles helps move innovation forward. 

More broadly speaking, I'm also a proponent of bringing engineers together - hence our efforts in the data community via meetups, our conference series and via organizing other, smaller, events for engineers, data scientists and CTOs through Hakka Labs for the past 5 years.

This is why I'm so excited to be introducing the intersection of these two efforts - supporting startups and supporting the data community - into our upcoming DataEngConf NYC.

How Big Data Can Help Improve the Meteorological Risk Models That Are Out of Date

According to a recent article published in The New York Times, water damage from hurricane Harvey extended far beyond flood zones. Now that the rescue efforts are underway, it’s clear that much of the damage occurred outside of the typical boundaries drawn on official FEMA flood maps.