Data Council Blog

Data Council Blog

Apache Airflow, Beyond Spreadsheets, and More: Top 10 Links From Across the Web

Here's our July 2020 roundup of relevant links for data professionals, from blog posts to podcast episodes:

1. The State of Airflow

Software Engineering Daily recently invited Apache Airflow's creator Maxime Beauchemin and Astronomer engineers Vikram Koka and Ash Berlin-Taylor to discuss the state of Airflow. Listen to the podcast episode or read the transcript to hear their comments on Airflow's use cases, its purpose, the open source ecosystem, and more.

2. Data Science Leadership Skills

The Data Science Podcast recently featured an interview of Data Scientist at IBM Cesar Mierzejek, in which he shared 3 skills he didn't know he needed to be a successful data scientist. No full spoilers here, but the 18-minute conversation covers communication, prioritization, and learning, with an eye on leadership.


3. Avoiding Data Leakage

Jason Brownlee's Machine Learning Mastery is known for its tutorials, and June 22's one is particularly relevant for anyone working on data preparation for ML models. Its focus is how to prevent data leakage, reminding readers that data preparation must be prepared on the training set only, and making sure that by the end of the tutorial, they will know how to avoid data leakage for train-test splits and k-fold cross-validation in Python.

4. Shopify's Data Science & Engineering Foundations

Data Science Manager at Shopify Marc-Olivier Arsenault published a post on the company's engineering blog in which he explains the founding principles that guide its data warehousing and analysis. From data consistency and open access to communication guidelines and required readings, this will be an interesting guide for teams operating consistently at a similar scale.

5. Let's Write Some Code

"Long live code": this is the title of the well-written essay that's CEO Barry McCardel published on Medium. Its key point is that "no code" tools may end up limiting their users. "Truly empowering users doesn’t mean getting rid of code, but embracing it," he argues.

6. Ray and ML Platforms

Anyscale is the company founded by the creators of open source project Ray, and its co-founder Ion Stoica recently co-authored a blog post with Ben Lorica on "five key features for a machine learning platform." Going through a list of elements that ML platforms should possess, such as ecosystem integration and easy scaling, it makes the case for Ray as the foundation of future ML platforms.

7. The Analytics Setup Guidebook

Full-stack data platform Holistics published a free 'Analytics Setup Guidebook' with relevant insights on building scalable analytics stacks. "This book is written for people who need a map to the world of data analytics," the introduction points out. Click here to access the ungated version of the guide (you can also check out the Hacker News discussion one of the authors participated in.)

8. Fiber Distributed

Fiber is a Python-based distributed computing library that Uber built and open-sourced. You can find out more about it on the Uber Engineering blog, with a post that tells the story behind Fiber and the key features that make it useful for modern computer clusters. The repository is also available on GitHub.

9. Beyond Spreadsheets

Jeff Sternberg from Google Cloud wrote a Forbes column titled "Beyond Spreadsheets", which was also the topic of a talk he recently gave at our virtual NY meetup (video available here). As he explained in a Twitter thread, spreadsheets are incredibly useful, but they have limits, which notebooks can bypass.

10. Eliminating PyTorch Bottlenecks

Towards Data Science is featuring a multi-part series on Efficient PyTorch, and part 1 focuses on an important aspect: eliminating bottlenecks, namely I/O and CPU ones. Its author Eugene Khvedchenya is a computer vision & machine learning engineer who previously authored pytorch-toolbelt, so some previous knowledge of PyTorch is required, but it is still accessible if you are early in your journey with data (thanks to our KL meetup organizers for the recommendation!)

Have you created or enjoyed a post or podcast episode that you'd like to recommend to the data community? Make sure to let us know:

Data Science, Data Engineering, Data Warehousing, Machine Learning, Open Source, Analytics