Data Council Blog

Data Council Blog

State of AI, Data Quality, and More: Top 10 Links From Across the Web

Here's our October 2020 roundup of good reads and podcast episodes that might be relevant to you as a data professional:

1. Multiplayer Editing: a Pragmatic Approach (Hex)

Data collaboration startup Hex published a great long read on its approach to live collaboration. Written by software engineer Mac Lockard, it takes a look at the respective pros and cons of Operational Transforms and Conflict-free Replicated Data Types (CRDTs), before explaining the solution that Hex adopted. Inspired by Figma's hybrid approach, it can also be described as "Atomic Operations (AO), as all edits to application state are broken down to their smallest atomic parts." "If the application you are building can rely on last-writer-wins semantics, Atomic Operations might provide a more pragmatic approach," the post concludes. This is a highly recommended read if you are pondering about a similar decision.  


2.Tools for ML Deployment (Arize AI)

Co-Founder and CPO of data startup Arize AI Aparna Dhinakaran wrote a very information-rich post on ML infrastructure tools available for the model deployment and serving stages of our production workflows. Her premise? "The ML Infrastructure space is crowded, confusing, and complex" - which is why her post also includes a graph mapping key tools that can be used at each stage of ML in production. As for model deployment and serving specifically, she lists some key questions to consider in order to pick the right tool for one's organization, such as security and performance requirements. Check out the full post for more details, and note that it is part of a series - you can find the other installments via Aparna's author profile.

3. Multi-Armed Bandits for Experimentation (Stitch Fix)

It is fairly well-known that Stitch Fix has built a centralized experimentation platform for its teams, and also that it has been using multi-armed bandits as an alternative to traditional A/B testing. In a beautifully-illustrated blog post, Data Platform Engineer Brian Amadio presents Stitch Fix's experimentation platform architecture, as well as some of the theory behind multi-armed bandits, and shows how the company has now incorporated them as a first-class feature "which can provide more efficient optimization than standard A/B tests in some circumstances, and a more flexible method for long-term optimization."

4. AI Trends and Predictions (State of AI)

UK-based AI investors Nathan Benaich and Ian Hogarth released the 2020 edition of their State of AI Report, which you can find here. The presentation covers AI from the research angle, but also from a talent, industry and political perspective. It ends with predictions for the next 12 months, including one that made headlines: it isn't unlikely that Arm's acquisition by Nvidia will get blocked. It is well worth reading in full, but if you don't have time to go through its 170+ slides, you can also check out the latest issue of Nathan Benaich's newsletter for selected highlights, from the inequality and lack of openness in AI research, to the enduring brain drain, and the explosion of AI-powered biology and military.

5. Data Quality for MLOps (Great Expectations)

The State of AI Report points out that many of 2020's fastest-growing GitHub projects are related to MLOps, a term that is also seeing an uptick in Google searches. In this context, Great Expectations published a blog post taking a look at a key factor for successful MLOps: data quality. The post includes a definition of MLOps, an explanation of how data testing and documentation fit into this flow,  and a prediction that companies will move from in-house testing solutions to off-the-shelf tools and platforms. You might alsio want to read the second part for a closer look at how Great Expectations fits into MLOps.

6. Dataset Quality Improvement (Aquarium)

Dataset quality was also the topic of a recent episode of the Software Engineering Daily podcast, in which host Jeff Meyerson welcomed Aquarium CEO Peter Gao. "An ML model is a combination of code and data. […] When you look into a production application of ML, the vast majority of the improvement to a model's performance comes from the data, because that's something that you have a lot of control over," Gao explained, also referring to his former experience at Cruise. Listen to the full discussion or read the transcript for more details on how Aquarium helps users probe and improve their datasets.

7. Data Cleaning IS Analysis (Counting Stuff)

The perception of data cleaning is decidedly starting to change, and Randy Au's newsletter installment on this topic received quite a bit of praise. A Quantitative UX Researcher at Google Cloud Platform, Au argues that data cleaning is much more than "menial work that's somehow “beneath” the sexy “real” data science work." Instead, cleaning your data IS analysis, and allows you to know your data. That was the TL;DR, but his myth-debunking post is actually much longer, with great points on what you should clean and document (datasets), with a final recommendation: let's redefine data cleaning as "Building Reusable Transformations."

8. Getting Started with Applied ML Research (Elvis Saravia)

Educator at Elastic and researcher in NLP & ML Elvis Saravia shared advice in his Substack newsletter on getting started with applied ML research without getting discouraged along the way.  "These tips do not guarantee success but they give you a useful framework for enabling creativity and avoiding common pitfalls in research," he explained.  With recommendations on picking a problem that matches your profile, when to start reading papers ("not yet"), and when to publish (not yet either!), this is  "a very motivating, awesome write-up," as fellow researcher Sebastian Raschka described it on Twitter.

9. The Future of ML (

Jeremy Howard was recently a guest on the Weights & Biases podcast, for a discussion that covered many topics, including's past, present, and future. However, much of the conversation that ensued online focused on a comment he made in passing about Python "not [being] the future of ML." By saying so, Howard was also expressing his hopes for Julia as a language. He later added a correction on Twitter based on additional data from Julia Computing's team indicating that it might not be as reliant on venture capital as Howard initially thought - which is likely to reinforce Howard's enthusiasm.

10. Data Deep Dive with Nemo (Facebook)

From Airbnb to Lyft, from Netflix to Uber, every company with terabytes of data has been looking to build solutions for data discovery at scale, and Facebook is no exception. Here's an interesting overview of its custom tool, Nemo, whose Elasticsearch cluster predecessor was showing limits. In contrast, Nemo stores data with Unicorn, the same search infrastructure used for Facebook's social graph, which resolves scalability issues and supports more sophisticated search, including in natural language. If you are wondering where to find the number of weekly active users on Instagram, just ask Nemo. As for its architecture and lessons learned building it, read the post!
Have you published anything that you’d like to recommend to the data community? Make sure to let us know:

Data Science, Data Engineering, Startups, Machine Learning, Artificial Intelligence, Open Source, Analytics