Here's our March 2021 roundup of links from across the web that we selected for you:
1. How to Build a Community (Fishtown Analytics)
Claire Carroll's first personal blog post on community-building is a must-read. As Fishtown Analytics' community manager for the last 2.5 years, she's arguably behind the success of the dbt community and its best-in-class practices, so we expected good advice… but she really hit the ball out of the park with this one! The key takeaway is that you should start with wondering 'why' you want to build a community. Make sure to read the full post to understand why it received so much praise.
2. Metadata Management (Acryl Data)
The co-founder & CTO of recently-announced startup Acryl Data, Shirshanka Das, previously worked for over a decade at LinkedIn where he built several data infrastructure projects. In a recently-published OCS 2020 Breakout session, he shared key insights into metadata, Apache Gobblin, and DataHub (which we wrote about last November). This is a fascinating conversation about the fast-emerging metadata category, some of its expected challenges & opportunities, and more.
3. Deep Learning Baseline with PyTorch (Grid.ai)
PyTorch Lightning's creator Will Falcon shared a tutorial for 'setting a strong deep learning baseline in minutes with PyTorch.' As he points out, rapid baselining can help data scientists and ML engineers ship products faster, or let researchers speed up their processes - with the shared goal of building great end models iteratively, starting from a good baseline. According to Falcon, this can be achieved easily and quickly with the combination of Flash, PyTorch and Lightning that he presents, code snippets included.
4. MLOps and Feature Stores (Tecton)
Former Data Council speaker Willem Pienaar was recently on the Data Stack Show for a great discussion of MLOps and feature stores. Give it a listen to learn the basics of feature stores, starting with what a feature is, up to a feature store's canonical elements and integrations with data discovery tools. It is a space Pienaar knows very well: he previously worked at Go-Jek, which led the creation of Feast alongside Google Cloud; and more recently joined Tecton, which is also a core contributor to the project.
5. Mapping Data Dependencies (Iteratively)
Iteratively published a thorough overview of dependency mapping. Penned by their head of growth, Franciska Dethlefsen, the blog post covers the benefits of mapping data dependencies, the different tools and methods available, and other important considerations. Its main point is that it is not a waste of time, but rather a key step to improve data quality and security, while making sure nothing breaks due to changes. In their words: "Take the time to map out your data dependencies; it will save you lots of time hunting down what other system is affected by the failure."
6. Reverse ETL (Redpoint Ventures)
VC Astasia Myers did a nice job at explaining the new trend known as 'reverse ETL' - "the process of moving data from a data warehouse into third-party systems to make data operational." (Another post explaining it with helpful diagrams is here.) This category is now represented by startups and projects including Hightouch, Census, Grouparoo, Headsup, Polytomic, and Seekwell. She notes that these can help companies have a consistent view of customer data across systems without having to write and maintain connectors; which is very useful for sales, marketing, and analytics teams. (Whether or not you think this is good news for data practitioners is another matter, discussed here.)
7. OSS Contributions (Airflow)
Here's an interesting video from the Apache Airflow Summit 2020: "Contributing to Airflow, Your OSS adventure," a joint talk by PMCs & committers Aizhamal Nurmamat kyzy and Jarek Potiuk. While Airflow is in itself a theme of interest for anyone interested in data orchestration, you may also want to watch the video for a broader perspective on how to contribute to open-source data projects. If you are involved with a project of your own, it is also worth watching as an example of how to communicate with potential contributors.
8. Change Data Capture (Shopify)
Senior data engineer John Martin and staff data developer Adam Bellemare have published a very solid long read on the process their team has followed to 'capture every change from Shopify's sharded monolith'. They have done it with CDC ('Change Data Capture'); and more precisely, with Kafka and Debezium. This is a very good write-up, detailing pros and cons of their decisions, as well as considerations related to Shopify's whopping scale, with "400TB+ of CDC data in [their] Kafka cluster [and] ~150 Debezium connectors across 12 Kubernetes pods."
9. Dataflow Optimization for Wrapped 2020 (Spotify)
If you are using Spotify to listen to music, there's a good chance that you spent some time a few weeks ago reviewing your '2020 Wrapped' personal data digest. In a blog post, a group of engineers are now sharing what happened behind the scenes to optimize 'Spotify's largest Dataflow job ever.' This involved a join technique known as Sort Merge Bucket (SMB), which the team credits with big wins: "By adopting SMB, we were able to perform extremely large joins that were previously either unfeasible or cost-prohibitive, or that required custom workarounds like Bigtable," they write.
10. Obstacles to Active Learning Adoption (Alectio)
If you have watched her DC_THURS interview and other talks, you know that Alectio's CEO Jennifer Prendki is very passionate about active learning. Yet, her recent blog post takes a different perspective, pointing out '9 reasons why active learning is not widely adopted.' This is well-worth reading if you are an ML practitioner who hasn't considered active learning yet, or tried it but perhaps given up too early: as the post insists in its conclusion, "it isn't an algorithm, but a process that requires it to be developed and tuned properly."
Have you created, read, or listened to anything that you’d recommend to the data community? Feel free to let us know: email@example.com