Data Council Blog

Data Council Blog

Large Datasets, Are Dashboards Dead, and More: Top 10 Links From Across the Web

Here's our August 2020 roundup of good reads and great podcast episodes for anyone working with data:

1. Processing Large Datasets with Python

AI engineer and author J.T. Wolohan was recently a guest of the Heroku’s Code[ish] podcast to discuss the contents of his book, “Mastering Large Datasets with Python.” Listen to the episode here or read the transcript for some practical advice on using Python to deal with massive datasets, especially in the context of machine learning.

2. Real-Time Leaderboard with Kafka Connect and KSQL

Software engineer Avinash Bhardwaj wrote about a challenge facing his employer, Pratilipi, a giant Indian language storytelling platform with 21M+ monthly active readers. While it would like to provide its 250,000+ writers with real-time statistics, this is no mean feat when dealing with a reported “1.5B+ million requests per day spread across 40+ microservices with daily ingestion of 50GB data and 100k+ API requests per second.” Read the post to find out how Bhardwaj proposes to solve this with Kafka Connect and KSQL.
Update-- Here's a recent episode of DC_THURS with Senior Engineer at Confluent Vinoth Chandar and Nishith Agarwal from Uber discussing Apache Hudi, which they co-authored:


3. Dagster, the Data Orchestrator

Data Council habitué Nick Schrock wrote a well-worth-reading update on Dagster, which his team as Elementl now refers to as a data orchestrator. Published one year after Dagster was made public, the post covers what its creators have built so far, as well as lessons learned and principles developed along the way, which Nick Schrock also summarized in a Twitter thread.

4. Are Dashboards Dead?

Dashboards are dead, writes Taylor Brownlow, Head of Data at Count, a company building a data analysis platform around notebooks. Her point? “Data’s going portrait-mode,” with notebooks beating dashboards for collaboration and reporting. But not so fast, argues Fishtown Analytics CEO Tristan Handy: dashboards aren’t dead… although we shouldn’t shy away from discussing their problems. If you are even remotely interested in analytics, make sure to check out his post, which also includes very relevant comments from reader Alexander Jia.

5. Build It and They Won’t Come

Data engineer Kenny Ning wrote a recount of something that happens often, but is rarely talked about: building an ML project that never gets used by its intended business audience. His post-mortem on Better’s blog details what went wrong, and what his team learned -- do the simplest thing solution first, among other things. “I’ve had many many failed ML projects that sound just like this - kudos for sharing this story!” former Head of Data Science at Shopify Cameron Davidson-Pilon commented on Twitter.

6. Data Analyst 3.0

Sisu’s Solutions Engineering lead Sid Sharma published an analytics-centered piece on the evolution of data workflows, announcing a new phase he refers to as “Data Analyst 3.0.” After a quick look at the history of data analytics, he shares his take on its present and future: while the 2.0 phase was defined by a BI workflow, this new 3.0 phase will also be augmented by AI, as a way to automate the path to “why.” “We’re at the cusp of a third phase that not only affords better, faster processing of data, but also lets operational data analysts impact business decisions like never before,” he forecasts.

7. Sanity Checks for A/B Tests

There are 4 things you should check before sharing the results of your A/B tests, Radical CEO Nimrod Priell advises. Since these experiments aren’t perfect in real life, you will want to make sure that the effects you are observing aren’t the result of heterogeneous treatment; that you aren’t mixing shifts; that you’ve checked for exposure; and that your data isn’t suffering “death by a thousand (statistically insignificant) cuts.” Read his well-written post for more details on each of these pitfalls and how to avoid them.

8. A New KPI for Data Reliability

Data Council speaker and Monte Carlo CEO Barr Moses has a confession to make: she is obsessed with data availability. However, she doesn’t think it is accurately captured by traditional methods, and recommends focusing on a different KPI: “Data downtime — periods of time when your data is partial, erroneous, missing, or otherwise inaccurate — is an important measurement for any company striving to be data-driven,” she defends. Check out her proposed formula and reasoning in her blog post.

9. The Future of the Hadoop Ecosystem

Do you remember when Hadoop was touted as the future of big data analytics? Kamil Bajda-Pawlikowski does remember, but he also knows that many large enterprises are now wondering what to do with their current Hadoop infrastructure, and how to prepare for the future. In a post on Starburst’s blog, he outlines what is going on and how Starburst Presto can help companies that still want to access and extract insights from their Hadoop data until further notice.

10. How Pixar and Others Are Using GANs for Better Resolution

A blog post on the IBM’s Data Science Community forum highlights examples of how ML is used in production to improve visual resolution upscaling. For instance, Pixar implemented a GAN trained on a corpus of their own movies to reduce CPU usage. Meanwhile, Facebook is using neural supersampling for real-time rendering in a VR context, which comes with its own set of constraints. As a bonus, the post also brings up a corporate use case for deepfakes.

Have you created or enjoyed a post or podcast episode that you’d like to recommend to the data community? Make sure to let us know:

Data Science, Data Engineering, Open Source, Analytics