Data Council Blog

06/04/20 12:16 | by Data Council

PyTorch Lightning, ksqlDB and More: Top 10 Links from Across the Web

Here are 10 recent relevant links for data professionals, from blog posts and tutorials to podcast episodes:

1. PyTorch Lightning: a gentle introduction

Former Data Council speaker Will Falcon published an interesting post on PyTorch Lightning, the lightweight PyTorch wrapper born out of his Ph.D. AI research at NYU CILVR and Facebook AI Research (FAIR). Framed as "a gentle introduction", it includes a side-by-side comparison of building a simple MNIST classifier PyTorch and PyTorch Lightning, in order to illustrate how to refactor one into the other. This is highly recommended reading if you are working on AI/ML research, be it as a professional researcher, student or in production.

2. Data visualization and responsibility in times of COVID-19

Know what you don't know; this is the takeaway message of the special episode of the PolicyViz podcast on DataViz in the Time of COVID. Guests Alberto Cairo, Amanda Makulec, Jennifer Manganello and Kenneth Field had a word of warning: don't share potentially critical insights that haven't been at least vetted by experts. If you can't team up with experts but still want to have a go at coronavirus-related data, they suggest you to focus on visualizations that are still topical but without life-or-death impact. From toilet paper calculators to interactive maps locating free meals for children during school closures, the choice is yours - so please, be responsible and don't fall victim to the Dunning-Kruger effect.

3. Lying to your model to understand class sensitivity

Community portal Analytics Vidhya is publishing a three-part series by Alectio on their research on labeling strategy and stress-testing ML models, which you can also follow on Alectio's Medium. In part 1, co-authors ML Engineer Akanksha Devka and founder Jennifer Prendki present findings from a case study they conducted based on the CIFAR-10 dataset. Their premise: "To find out which classes are easy for a model to confuse, just try confusing the model and see how easily it gets fooled." This led them to 2 experiments: label noise induction and data reduction; with interestingly contrasted results.

4. Building streaming apps with SQL statements

In a recent episode of 'Ask Confluent', Gwen Shapira was joined by Vinoth Chandar to mostly discuss ksqlDB, a project of which he is one of the main drivers. As you may remember, Vinoth previously worked at Uber, where he co-created Apache Hudi. As for ksqlDB, it is a Kafka-native event streaming database purpose-built for stream processing applications. To learn more about it and potential use cases, make sure to watch the full video.

5. Democratizing analytics engineering

Fishtown Analytics' CEO Tristan Handy wrote a very insightful blog post to accompany the launch of the dbt IDE and explain the thinking behind it. To sum it up, it stems from the view that one of the main issues in data is the chasm between IT and business users. This is a chasm dbt aims to bridge, and it is now taking it one step further with its Integrated Developer Environment. By lowering the bar for analytics engineering, it in turn facilitates the creation and dissemination of knowledge inside the organization.

6. The rise of data engineering

VC analyst at Work-Bench Priyanka Somrah published a post on Medium on solutions and people that are bridging the gap between the data science and data engineering workflows, from dbt to Dagster and others. As an investor's take on this trend, it is worth reading in itself, but also for the resources it features for further reference. These include the article by Tristan Handy that we mentioned above, a Data Council talk by Apache Calcite creator Julian Hyde and another one by Apache Airflow & Apache Superset creator Max Beauchemin, as well as a few other links you might want to check out.

7. Fighting COVID-19 with NLP and AI techniques

The Allen Institute for Artificial Intelligence (AI2) and others joined forces to create the COVID-19 Open Research Dataset (CORD-19), which consists of more than 44,000 scholarly articles related to the current pandemic. The next step is for researchers to "apply recent advances in natural language processing and other AI techniques [to the dataset] to generate new insights in support of the ongoing fight against this infectious disease," Kaggle details on the homepage of the related AI challenge. Now, that's a way our community is best placed to help - so if it's up your alley, please check it out!

8. Scaling data to the mesh with OSS

With telecommunications on our minds perhaps more than ever, you might want to listen to the 53rd episode of Heroku's Code[ish] podcast. Recorded during NodeConfEU, it is an interview by senior developer advocate Julián Duque of Luca Maraschi, chief architect at TELUS Digital. As you may know, Telus is one of Canada's main telcos, so it is quite refreshing to hear how Luca's team went "open source first and cloud native as much as [they could]," including adopting very recent developments like Kuma and Kong.

9. Federated learning techniques and applications

Courtesy of IBM, here's a two-part series on federated learning, which is going to be crucial to solve the conundrum of preserving data privacy without thwarting machine learning applications. Author Nathalie Baracaldo leads the AI Security and Privacy Solutions team and is a Research Staff Member at IBM's Almaden Research Center. In part 1, she exposes a technique that goes beyond 'vanilla federated learning' to truly enable the collaborative training of ML models while preventing inference attacks. In the second part, which is more relevant if you're already familiar with the paper behind part 1, she details the difference between horizontal and vertical federated learning and explores potential applications, from manufacturing to healthcare.

10. 2019, an impressive year for NLP

With 2019 now feeling like a decade ago, Elvis Saravia's thorough recap comes handy to remember the most relevant developments that took place with regard to NLP. From publications to datasets, it's got you covered - with a special mention for consolidating trends such as explainability and health applications, as well as the growing conversation around ethics and fairness.

Have you created or stumbled upon a piece of content that you'd like to share? Make sure to let us know: community@datacouncil.ai