Data Council Blog

17/11/20 12:30 | by Data Council

NLP Heroes, Pinot, Data Testing, and More: Top 10 Links From Across the Web

Here's our November 2020 roundup of good reads and podcast episodes that might be relevant for your career in data:

1. Heroes of NLP: Quoc Le (Deeplearning.ai)

NLP researcher Quoc Le was recently Andrew Ng’s guest as part of the ‘Heroes of NLP’ video series. Their discussion covered Le’s impressive journey, from growing up in Vietnam and developing his first basic chatbot in high school to becoming Google Brain’s first intern, and everything that followed. This includes the ‘Google Cat’ experiment, the Meena chatbot project, and work on Seq2Seq models. Check out the conversation here, and consider subscribing to the series to hear from other guests such as Chris Manning, Kathleen McKeown, and Oren Etzioni.

2. Pinot at Uber’s Scale (Uber)

Here’s an interesting post on Uber Engineering’s blog about the way it operates Apache Pinot, the distributed OLAP datastore aimed at delivering scalable real-time analytics with low latency. The post highlights the importance of real-time analytics use cases at Uber and the scale of its data, before explaining how Pinot fits in, the contributions that Uber’s Pinot team made to improve its reliability and query flexibility, and some of the lessons learned along the way.

Pinot was a recent topic on DC_THURS - watch the episode here.

3. Identifying the Right Job Offers (Build a Career in Data Science)

Data scientists Jacqueline Nolis and Emily Robinson published a book earlier this year on “Building a Career in Data Science,” and are now discussing this very same topic in a podcast series. In a recent episode, they discussed important things that newcomers to data science should consider when looking at job offers. From seeing through job titles to analyzing ‘required’ technical skills, there are some useful tips you might want to use or recommend around you.

4. A Comparison of Data Version Control Tools (DAGsHub)

Platform for data science collaboration DAGsHub published a comparison of tools that bring version control to data and machine learning, such as DVC and Delta Lake. The post includes a summary, pros and cons for each, as well as a comparison table recapping whether or not it is open source, data format agnostic, and more. DAGsHub also asks an important question: do you really need data versioning? In some cases, maybe not, but if you do, these tools will prove useful.

5. A Primer on Data Quality (Redpoint Ventures)

Data quality is decidedly a hot topic these days, and VC Astasia Myers wrote a good primer about it. Her blog post looks into why managing data quality is more painful than ever for those in charge, and the startups and tools that can help them manage these issues. Visual elements also include a flower-shaped diagram on data quality attributes, from accuracy and completeness to format, integrity, and consistency.

6. Testing, Testing, Testing (A.P. Moller - Maersk)

Towards Data Science promoted a post by data engineer Micha Kunze on practical recommendations to achieve data quality. His main exhortation is to “test your data until it hurts.” “If you do not test you do not know, simple as that,” Kunze explains. He then goes on to explain how his team continuously tests data thanks to Great Expectations, with some tangible results in preventing potentially costly errors.

7. Achieving Data Quality (Airbnb)

Still on data quality but from an organizational perspective, Airbnb shared insights into its Data Quality Initiative, which it started implementing in 2019. At the time, the company was definitely no longer a small startup, and needed to adjust its practices accordingly. The Initiative is still ongoing, and focuses on 5 key points: data ownership, pipelines, validation, documentation, and discoverability; with an impact on hiring, architecture, and more.

8. Data Platform Scaling (DoorDash)

“How DoorDash is Scaling its Data Platform to Delight Customers and Meet our Growing Demand”: despite this somewhat bragging title, this is a long, interesting read that also covers the challenges the company’s engineering team faced along the way, while considering its top goals. It points out that there are many solutions to any given problem, with no one-size-fits-all answer, which is precisely why it’s interesting to find out more about the choices DoorDash made (build vs. buy, focus on SLAs…) and the rationale behind them.

9. What is a Feature Store? (Feast/Tecton)

Tecton’s CEO Mike Del Balso and Feast’s creator Willem Pienaar have co-authored a post defining what a feature store is. Yes, Feast and Tecton both are, but the post goes beyond that and covers what feature stores in general can for data science teams putting ML into production. “We wrote this blog post to provide a common definition of feature stores as they emerge as a primary component of the operational ML stack. We believe the industry is about to see an explosion of activity in this space,” they predicted.

10. What’s to Love About Postgres (Crunchy Data)

The Changelog podcast recently featured a discussion between host Jerod Santo and guest Craig Kerstiens, a PostgreSQL aficionado who joined Crunchy Data a few months ago after previous roles at Heroku, Citus Data, and Microsoft. Also available as a transcript, the conversation focused on ‘what’s so exciting about Postgres’ -- from its best features and current use cases (including lesser-known ones) to its future. You may also want to check out comments about the episode on Hacker News, including some replies from Craig himself.

Have you published anything that you’d like to recommend to the data community? Make sure to let us know: community@datacouncil.ai