Here's our February 2021 roundup of links from across the web that we picked for you:
1. dbt at Shopify (Data Engineering Podcast)
The Data Engineering Podcast recently featured a very interesting discussion about dbt at Shopify. Engineering manager Zeeshan Qureshi and senior data engineer Michelle Ark explained how dbt answered Shopify’s need for an SQL-based solution that its data scientists could use autonomously. They also mentioned some of the best practices they followed for staging, and cost considerations related to BigQuery. Last but not least, they touched on some extensions they are considering, such as implementing Great Expectations for data quality control.
2. ML and the Evaluation Store (Josh Tobin / Stealth Startup)
Former OpenAI researcher Josh Tobin gave a great online talk for the Utah Data Engineering Meetup, in which he gave an overview of the current state of ML. After stating that ML is now a product engineering discipline, to be measured by its value-add, he expressed his hope for the impact it can have on the long tail of businesses. If you are considering starting a company in this space, you will also be very interested in his overview of the different tools across the infrastructure stack, and what he thinks is still missing – ending with his proposal for an evaluation store: “a central place to store and query online and offline ground truth and approximate model quality metrics.”
3. Reducing Costs with Active Learning (Alectio)
Alectio published a short case study of “how active learning can massively reduce aerial imagery labeling costs.” ML scientist Prateek Malhotra summed up an experiment whose goal was to determine whether randomly chosen data could be outperformed by reduced amount of actively-picked data – i.e., active learning. The answer was yes, which is good news for anyone concerned about the growing costs of ML resulting from using ever bigger datasets. Make sure to check out the details for the different methods used and other specs of the experiment.
If you’re interested in active learning and its use cases, don’t miss our Feb. 18 DC_THURS episode with Alectio CEO Jennifer Prendki. Just RSVP here!
4. Unsupervised Data Monitoring (Anomalo)
Anomalo founder & CTO Jeremy Stanley started a blog series about unsupervised data monitoring. In this first installment, he exposed the challenge of monitoring structure data at scale, and how Anomalo’s uses unsupervised learning to identify data chaos in the long tail of tables that can be found in large data warehouses. Future posts will compare this to other approaches, go into further detail about Anomalo’s architecture, and present the way it minimizes false positives, so stay tuned for the rest of the series.
5. A Framework for Building Your Data Stack (Datacoral)
Building your data stack is confusing… but it doesn’t have to be, writes Datacoral’s CEO Raghu Murthy. His post on Towards Data Science is geared at anyone looking to uplevel their company’s data stack, whether at a startup or an established company. After presenting what would be a state-of-the-art modern data stack, he proposes a 3-layer framework: data flow, metadata, and DevOps tooling; and how to evaluate well-known tools through this lens, like Datacoral does. A key takeaway: “a metadata-first implementation simplifies the processes.”
6. BI at Scale with Superset (Airbnb)
The Airbnb Engineering & Data Science blog featured a collective article on how Airbnb customized Superset for business intelligence at scale. The company has indeed remained committed to Superset since open-sourcing it in 2016, with developments over the years to maintain it as the core of its BI self-serve solution. From daily offline jobs to warehouse optimizations, the blog post gives several examples of features that Airbnb created in and around Superset to ‘supercharge’ it, while referencing the way other large tech companies have leveraged it too.
Breaking news: Our annual OSS Data Tools survey is going viral across the data community... make sure to vote now before Feb. 26!
7. The Rise of Metadata Management Systems (Gradient Flow)
Investment manager at Intel Capital Assaf Araki (writing in a personal capacity) and Gradient Flow founder Ben Lorica co-authored an article on the growing importance of metadata management systems. Their prediction is that “metadata will be the foundation for data governance solutions, data catalogs, and other enterprise data systems.” The article also includes useful graphs on the metadata stack’s 3 layers – unified schema, data catalog, and governance – and the companies, tools and OSS projects that are available in each of these layers.
8. 2020 Data Science Notebooks in Review (Deepnote)
The team at Deepnote ran research to identify 2020 trends for Jupyter notebooks, surfacing the most popular libraries, search trends, and other stats. Among notebooks on GitHub, Deepnote found out that the two most starred notebook repositories of 2020 were Fast AI’s Fastbook and FastPages; while matplotlib, numpy, and pandas were the most popular Python library in this sample. Read the article for more details, and if you want to go beyond the summary, note that you can access the notebook and data sources with which the analysis was conducted.
9. Metric Standardization (Uber)
Members of Uber’s engineering team wrote a blog post to introduce uMetric, the company’s “unified internal metric platform that powers the full lifecycle of a metric from definition, discovery, planning, computation, and quality to consumption.” Such an engineering-based solution had become a necessity to prevent a democratized access to metrics from resulting in discrepancies in business-critical information and subsequent decisions. The article presents the pillars uMetric is built on, its ecosystem, and lessons learned along the way.
10. Scaling an ML Team (Aquarium)
Aquarium CEO Peter Gao shared advice on scaling an ML team from 0 to 10 people. Some tips to consider: “always try to utilize great open source or paid tooling rather than building something hacky yourself”; when hiring, consider looking for an ML Operations Manager and/or an ML Product Manager who can help your team ship faster; and keep on deploying new models. If you found this interesting, or are at a later stage, you will also want to read Part 2: “Scaling An ML Team (10–100+ People)”.
Have you fpublished, read, or listened to anything that you’d recommend to the data community? Feel free to let us know: firstname.lastname@example.org