Data Council Blog

Data Council Blog

Hot Data Tools pt. 2, End-to-End Data Scientists, and More: Top 10 Links From Across the Web

Here's our September 2020 roundup of good reads and podcast episodes that might be relevant to you as a data professional:

1. What Data Tools Don't Do (Data Council)

Our founder Pete Soderling co-authored a follow-on piece to his previous post with Great Expectations' core contributor Abe Gong and Partner at Amplify Partners Sarah Catanzaro, for which they had interviewed the makers of some of the hottest data tools. The focus is still the same: rather than what their data tools can do, we hear about what they don't do, as a way to better understand how they fit together. From ApertureData to Xplenty, this new installment covers 21 new tools, and you can read it here.

2. Data Tools' Go-To Market Strategies (Work-Bench)

Looking at hot new data tools inspired Priyanka Somrah to revisit the topic from a different angle. As a VC Analyst at Work-Bench, she decided to analyze the range of go-to market and pricing strategies that makers have adopted. This led her to notice some very interesting trends in the enterprise space, such as the growing importance of open-source offerings and self-serve adoption. Click here to read her take.

3. Diving Into Delta Lake (Databricks)

Databricks published an online learning series about Delta Lake, its open-source storage layer for data lakes. The series consists of three workshops led by its engineering team: "Unpacking the Transaction Log," "Enforcing and Evolving the Schema," and "DML Internals: Delete, Update, Merge The Delta Lake." Videos aside, the tutorials also include links to notebooks and slides to download.

4. Notebook Pipelines in JupyterLab (IBM)

Patrick Titzler is a Developer Advocate at IBM's Center for Open-Source Data & AI Technologies (CODAIT) and recently published a straightforward tutorial for running notebook pipelines locally in JupyterLab. His method is based on the recently released v1.1 of Elyra, the open-source set of AI-centric extensions to JupyterLab Notebooks that already facilitated running pipelines remotely on Kubeflow Pipelines.

5. PyTorch Lightning Bolts (PyTorch Lightning)

The PyTorch Lightning team shared an article focused on PyTorch Lightning Bolts, "a collection of PyTorch Lightning implementations of popular models that are well tested and optimized for speed on multiple GPUs and TPUs." As we previously mentioned, PyTorch Lightning is a lightweight PyTorch wrapper that can be used for research and in production, and you can now take advantage of the new Lightning Bolts collection "to try crazy research ideas with just a few lines of code."

6. Iterative Sweeps for Deep Learning (Weights & Biases)

Here's a good read on hyperparameter search, including advice on how to run an effective hyperparameter search on deep learning models with iterative sweeps. It features visual examples from Weights & Biases, where author Stacey Svetlichnaya is a Deep Learning Engineer. Meant as an overview that "helps you tune deep learning models a bit faster," her post also concludes with links to resources to get started with Weights & Biases Sweeps.

7. Should Data Scientists Be End-to-End? (Amazon)

Should data scientists be more "end-to-end"? This is the self-proclaimed "unpopular opinion" defended by Eugene Yan in a great blog post. "While this is frowned upon (too generalist!), I've seen it lead to more context, faster iteration, greater innovation—more value, faster," he argued in a tweet linking to his post, which features examples from Netflix and Stitch Fix.

8. Large Scale Experimentation (Stitch Fix)

Stitch Fix proposed a model for large scale experimentation, which is based on "a Bayesian setting that implicitly captures the opportunity cost of having multiple interventions to test." The post is an intellectual exercise that can be applied in real-world conditions at companies like Stitch Fix, which are always trying to find more discoveries in a shorter time frame.

9. Simplified Data Architectures w/ Presto (Starburst)

Episode 149 of the Data Engineering Podcast focused on how to simplify your data architecture with the Presto distributed SQL engine. Podcast host Tobias Macey interviewed Starburst CTO Martin Traverso on the evolution of the Presto ecosystem over the last 2 years, as well as the tradeoffs of using it on top of a data lake vs. a vertically integrated warehouse solution, and several other topics that will be of interest to you if you are considering querying and combining data where it resides.

10. Data Mesh Advice (Monte Carlo)

Check out this post on Monte Carlo's blog if you are wondering what a data mesh is and why you should build one. This is a very good starting point if you are interested in this topic, with links to several posts and talks that will help you investigate further and learn how other companies such as Zalando managed this transition.

Have you created or enjoyed a post or podcast episode that you’d like to recommend to the data community? Make sure to let us know:

Data Science, Data Engineering, Startups, Data Pipelines, Machine Learning, Open Source, Analytics