Here's our monthly roundup of relevant links for data professionals, from blog posts and tutorials to podcast episodes:
1. Product Management for AI
Peter Skomoroch and Mike Loukides co-authored a very interesting post on what makes product management different in the context of AI. Based on the specificities of AI software development, they make a series of recommendations for a process that also takes business priorities into account. Their post also ends with a list of relevant resources, so it is worth checking out.
2. A Federated Information Infrastructure that Works
Xavier Gumara wrote a Medium post that is based on the talk he gave at Data Council BCN '19. A Data Engineering Manager at Adevinta, which is present in 16 countries via multiple marketplaces, he makes recommendations for what he dubs "a federated information infrastructure that works." This itself is the result of his team's work on answering 3 key questions: How to find the right level of authority; How to govern data sets; and How to build common infrastructure as a platform. Their main achievement? The autonomy they have granted to data consumers thanks to this architecture.
3. Surfacing Insights
In this recent episode of Software Engineering Daily, podcast host Jeff Meyerson interviews Peter Bailis, CEO of Sisu Data. As the podcast description explains, "Sisu is a system for automatically surfacing insights from large data sets within companies. A user of Sisu can select a database column that they are interested in learning more about, and Sisu will automatically analyze the records in the database to look for trends and relationships between that column and the other columns." In the interview, Bailis gives examples of how clients such as Samsung have been using it. He also discusses its potential in the larger context of "every employee in almost every industry (…) becoming an analyst," a trend VC Ben Horowitz pointed out when announcing a16z's investment into Sisu.
4. Say No to 'Just One More Stratification'
Former Data Council speaker Sam Bail shared great insights on Great Expectations' blog about a problem she identified: "I noticed a difference between how stakeholders approached software engineering work compared to data work. It almost seemed as though they considered it to be… somewhat easier to “just run some numbers real quick”, whereas I hardly ever received requests to just “build a new feature real quick”." Luckily, she has some advice on how 'data people' can respond to these ad-hoc requests and "avoid getting stuck in an endless loop of “just one more stratification”."
5. AutoML and Its Place in Data Science
The perspective of automating ML has many people excited in our industry – with AutoML becoming somewhat of a buzzword. If you are curious about what it is and the actual impact it could have on your work, we recommend you to watch the replay of the presentation that Paco Nathan gave on this topic to the IBM Developer online meetup (sign-in required). You may also want to check out host Upkar Lidder's dedicated repo for the presentation's slides, code examples and links to additional resources.
6. The New York Times' Digital Transformation
The New York Times' outgoing CTO Nick Rockwell wrote a widely-praised account of his 4 years at the media company, which he recently left to run Engineering at Fastly. His post covers a large range of topics, from data infrastructure changes to culture, team growth and key metrics, so it is well worth reading if you are curious about the challenges and learnings of such a role.
7. Baking Bread with Streamlit
We recently picked out app framework Streamlit as our 'Open Source Project of the Month', in part because we have been very impressed by how fast and how widely it has been adopted. While this early adopter community includes engineers from Uber, Twitter, Stitch Fix, and Dropbox, the post we are featuring here illustrates its use at the other end of the spectrum. Created by self-described neophyte data scientist Hamilton Chang, it is a step-by-step description of how he built a sourdough hydration calculator with Streamlit. You can find his lockdown-inspired tutorial here, and the resulting app here.
8. CD for ML (#CD4ML)
ML startup Booklet published a post by its co-founder and CTO Derek Haynes making a powerful case: "Machine learning deserves its own flavor of Continuous Delivery." His article exposes the differences between ML and software development that make it difficult to apply classical CD to ML projects. However, there are still ways to implement CD4ML, and not just at large companies: in Haynes' words, "it's possible to create a Machine Learning-specific flavor of Continuous Delivery (CD4ML) for non-enterprise organizations with existing tools (git, Cookiecutter Data Science, nbdime, dvc, and a CI server).
9. NoSQL: After the Hype
Heroku published a blog post on Hacker Noon to take a closer look at NoSQL, which at some point was perhaps misguidedly presented by some as a replacement for relational data stores, rather than merely a potential complement. Now that the hype is mostly behind us, does NoSQL actually deliver on its key promises – namely scale, fault tolerance, and different data models? And should you adopt it? As you can imagine, the answer is a longer version of "it depends", so read the full post for more details.
10. Hot New Data Tools and What They DON'T Do
If you find yourself constantly wondering about the overlap between the latest data tools you are hearing about, you're not alone. This confusion is natural, but also damaging. This is why our very own Pete Soderling teamed up with Sarah Catanzaro (partner, Amplify Partners) and Abe Gong (co-founder, Superconductive/Great Expectations) to interview the makers of 25 of the hottest data tools, from Databand to Preset. Check out the post for their answers to these 2 questions: 1. What is your tool uniquely good at? 2. What does your tool NOT do?
Have you created or enjoyed a post or podcast episode that you'd like to recommend? Make sure to let us know: firstname.lastname@example.org