Data Council Blog

Data Council Blog

The Modern Data Stack, Metadata Architectures, and More: Top 10 Links From Across the Web

Here's our December 2020 roundup of links from across the web that could be relevant to you:

1. The Modern Data Stack (Fishtown Analytics)

This long-form post on the dbt blog is a must-read. Titled “The Modern Data Stack: Past, Present, and Future,” it answers the question that Tristan Handy has been asking himself for the past two years: “What happened to the massive innovation we saw from 2012-2016?” His carefully thought-out analysis covers the natural cycles of technological shifts, defines the phase we are in as a ‘deployment’ one, and points out high-impact opportunity areas for the next few years - which you might find particularly useful if you are considering launching a new product.

2. Metadata Architectures (LinkedIn)

Shirshanka Das published a post on the LinkedIn Engineering blog that you should check out if you are wondering which data discovery solution to implement. “The architecture of your data catalog will influence how much value your organization can truly extract from your data,” Shirshanka explains. According to his analysis, also summarized in graphs, existing data catalogs belong to three generations of architectures, with their pros and cons: pull-based ETL, service with Push API, and end-to-end data flow - LinkedIn’s DataHub belonging to the latter.

3. What’s Next When Data Test Fails (Great Expectations)

Great Expectations shared advice on its blog about what should happen next when your data tests fail. Implementing data testing in your pipeline is not enough to guarantee data quality: you need to have a process in place to address issues that emerge. The post covers the different steps to include and the questions to ask yourself: what should the system do next? Who will be alerted? Last but not least, you will want to check that you are actually testing for the right things; otherwise, the fact that your tests have passed will be meaningless.

4. Funding the Next 1,000 Engineer-Founders (Data Community Fund)

Data Council founder Pete Soderling shared his journey into founding, supporting, and funding engineer-led startups. From his move to SF in 2010 to his debuts as a mentor and angel investor, his Medium post provides the backstory that led to his latest venture: the Data Community Fund. Set to invest in 15–20 companies a year across North America and the E.U., it focuses on very early stage engineering-led B2B startups that have significant differentiation via applications of innovative data technology - and supports Pete’s goal to help 1,000 engineers start companies.

Planning to launch a startup in 2021 and could use some help? Book an office hours with Pete Soderling.

5. Building Successful Data Teams (Jesse Anderson)

Data engineer and author Jesse Anderson was a recent guest on the Data Engineering Podcast, for a conversation that centered on “proven patterns for building successful data teams.” Talking to host Tobias Macey, Jesse shared advice on team structures and hiring strategies that yield the best results, based on common challenges and useful patterns he has identified over the years. One key thought that also inspired his latest book, ‘Data Teams’: “Early success or failure isn’t the result of technology; it is the result of management.”

6. Multi-Task Learning to Recommend Related Products (Pinterest)

Pinterest published insights into work conducted to improve recommendations of related products for shopping discovery. Basing the ranking model’s architecture on a multi-task neural network proved a success, with better engagement metrics and “more interpretable outputs, which are not only useful for debugging purposes but also for model performance”. On the other hand, the Bayesian optimization it tested didn’t outperform other weight selection methods - maybe because it was offline, which is why online Bayesian optimization is set to be tested next.

7. When Data Disappears (Anomalo)

Startup founder Jeremy Stanley wrote about data disappearance, which he sees as the most common data quality issue. He points out examples of how data can go missing at each step of the data process, and the consequences it may have, from inaccurate dashboards & data products to biased ML models. To prevent that, monitoring the systems that produce the data isn’t the right answer, he argues: “At Anomalo, we’ve found the only way to be certain your data is available is to test it independently from the systems producing it.”

8. DeepMind’s AlphaFold Beyond the Hype (a16z)

A recent episode of a16z’s 16 Minutes in the News podcast focused on Google’s DeepMind having won the CASP14 challenge with its AI-based protein structure prediction program AlphaFold. Talking to Sonal Chokshi, a16z general partner Vijay Pande explained why this challenge matters, and also what it means and doesn’t mean (yet) for drug discovery, startups, and more. Having founded the Folding@home project, he noted that he wouldn’t consider this win a breakthrough, as it was not a surprise for those who expected computational approaches to work; but praised the AlphaFold team for dramatically accelerating that moment.

9. Test-Driving Elasticsearch (Instaclustr)

Evangelist Paul Brebner recently revised a post he wrote earlier this year about his test drive of Elasticsearch. His piece covers the basics of Elasticsearch, such as Documents, Mappings, and Indexing; as well as some of the computational morphological tricks that can be leveraged for inexact matching: Stemming, Lemmatization, Levenshtein Fuzzy Queries, N-grams, Slop, and Partial Matching. There is also a nice reference to Douglas Adams’ “Hitchhiker’s Guide to the Galaxy,” among other quotes, so don’t forget to bring a towel!

10. Human-Centric Data Science at Spotify (WiDS Podcast)

Professor Margot Gerritsen was recently joined by Lillian Carrasquillo for an episode of the Women in Data Science (WiDS) Podcast. Having recently become an Insights Manager for Spotify’s Personalized Home, Lillian discussed stepping into a new team leadership role while working from home during the COVID-19 pandemic, mentorship, and diversity. The conversation also covered the ‘human-centric’ approach of her team of data scientists and user researchers, whose ultimate goal is to help listeners and creators connect.

Have you shared or enjoyed anything that you’d like to recommend to the data community? Feel free to let us know:

Data Science, Data Engineering, Startups, Data Strategy, Machine Learning, Artificial Intelligence, Open Source, Analytics