Data Council Blog (2)

02/02/21 14:15 | by Data Council | in Data Science, Data Engineering, big data, Data Pipelines, Machine Learning, Open Source, Data Discovery

Open Source Highlight: OpenLineage

OpenLineage is an API for collecting data lineage and metadata at runtime. While initiated by Datakin, the company behind Marquez, it was developed with the aim to create an open standard. As Datakin’s CTO Julien Le Dem explained in a blog post announcing the launch, OpenLineage is meant to answer the industry-wide need for data lineage, while making sure efforts in that direction aren’t fragmented or duplicated.

21/01/21 10:56 | by Data Council | in Data Science, Data Engineering, Startups, Data Strategy, Machine Learning, Artificial Intelligence, Open Source, Analytics

Storing Cold Metadata, Snowflake Data Cloud, and More: Top 10 Links From Across the Web

Here's our January 2021 roundup of links from across the web that could be relevant to you:

1. Storing Cold Metadata with Alki (Dropbox)

Dropbox shared insights into Alki, the petabyte-scale metadata store it designed for infrequently accessed metadata (“cold data”). The post details how one-size-fits-all database Edgestore was reaching capacity limits, and why audit logs were a good candidate to be moved elsewhere than on costly SSDs. After considering off-the-shelf options, the team settled on building its own solution on top of AWS services: Alki; with DynamoDB as the hot store, and S3 as the cold store. Like HBase or Cassandra, Alki is based on log-structured merge-trees (LSM trees), but is better suited to handle hot-then-cold audit logs, as well as future use cases at Dropbox.

07/01/21 10:45 | by Data Council | in big data, Data Pipelines, Machine Learning, Open Source

Open Source Highlight: Orchest

Orchest is an open-source tool for creating data science pipelines. Its core value proposition is to make it easy to combine notebooks and scripts with a visual pipeline editor (“build”); to make your notebooks executable (“run”); and to facilitate experiments (“discover”).

21/12/20 11:52 | by Data Council | in Data Science, Data Engineering, Startups, Data Strategy, Machine Learning, Artificial Intelligence, Open Source, Analytics

The Modern Data Stack, Metadata Architectures, and More: Top 10 Links From Across the Web

Here's our December 2020 roundup of links from across the web that could be relevant to you:

1. The Modern Data Stack (Fishtown Analytics)

This long-form post on the dbt blog is a must-read. Titled “The Modern Data Stack: Past, Present, and Future,” it answers the question that Tristan Handy has been asking himself for the past two years: “What happened to the massive innovation we saw from 2012-2016?” His carefully thought-out analysis covers the natural cycles of technological shifts, defines the phase we are in as a ‘deployment’ one, and points out high-impact opportunity areas for the next few years - which you might find particularly useful if you are considering launching a new product.

03/12/20 11:02 | by Data Council | in big data, Data Pipelines, Machine Learning, Open Source, Audio Research

Open Source Highlight: Klio

Klio is a framework for easy large-scale processing and ML research on binary files, such as audio files -- its original use case. As a matter of fact, it was developed for audio intelligence at Spotify, which open-sourced it earlier this year at the 2020 International Society for Music Information Retrieval Conference.

17/11/20 12:30 | by Data Council | in Data Science, Data Engineering, Startups, Data Strategy, Machine Learning, Artificial Intelligence, Open Source, Analytics

NLP Heroes, Pinot, Data Testing, and More: Top 10 Links From Across the Web

Here's our November 2020 roundup of good reads and podcast episodes that might be relevant for your career in data:

1. Heroes of NLP: Quoc Le (Deeplearning.ai)

NLP researcher Quoc Le was recently Andrew Ng’s guest as part of the ‘Heroes of NLP’ video series. Their discussion covered Le’s impressive journey, from growing up in Vietnam and developing his first basic chatbot in high school to becoming Google Brain’s first intern, and everything that followed. This includes the ‘Google Cat’ experiment, the Meena chatbot project, and work on Seq2Seq models. Check out the conversation here, and consider subscribing to the series to hear from other guests such as Chris Manning, Kathleen McKeown, and Oren Etzioni.

04/11/20 11:58 | by Data Council | in Data Science, big data, Machine Learning, Open Source, Data Discovery

Open Source Highlight: DataHub

DataHub is a generalized metadata search & discovery tool. Originally created at LinkedIn, it was open sourced in February of this year , and has been adopted by other companies such as Expedia and Typeform, with the ambition to help connect employees to data that matters to them.

19/10/20 05:33 | by Data Council | in Data Science, Data Engineering, Startups, Machine Learning, Artificial Intelligence, Open Source, Analytics

State of AI, Data Quality, and More: Top 10 Links From Across the Web

Here's our October 2020 roundup of good reads and podcast episodes that might be relevant to you as a data professional:

1. Multiplayer Editing: a Pragmatic Approach (Hex)

Data collaboration startup Hex published a great long read on its approach to live collaboration . Written by software engineer Mac Lockard, it takes a look at the respective pros and cons of Operational Transforms and Conflict-free Replicated Data Types (CRDTs), before explaining the solution that Hex adopted. Inspired by Figma's hybrid approach, it can also be described as "Atomic Operations (AO), as all edits to application state are broken down to their smallest atomic parts." "If the application you are building can rely on last-writer-wins semantics, Atomic Operations might provide a more pragmatic approach," the post concludes. This is a highly recommended read if you are pondering about a similar decision.

15/10/20 09:30 | by Data Council | in Data Pipelines, Open Source

Open Source Highlight: n8n

Created by Berlin-based developer Jan Oberhauser in 2019, n8n presents itself as “a free and open workflow automation tool”. Think of it as a locally hosted Zapier on steroids.

22/09/20 10:18 | by Data Council | in Data Science, Data Engineering, Startups, Data Pipelines, Machine Learning, Open Source, Analytics

Hot Data Tools pt. 2, End-to-End Data Scientists, and More: Top 10 Links From Across the Web

Here's our September 2020 roundup of good reads and podcast episodes that might be relevant to you as a data professional:

1. What Data Tools Don't Do (Data Council)

Our founder Pete Soderling co-authored a follow-on piece to his previous post with Great Expectations' core contributor Abe Gong and Partner at Amplify Partners Sarah Catanzaro, for which they had interviewed the makers of some of the hottest data tools. The focus is still the same: rather than what their data tools can do, we hear about what they don't do, as a way to better understand how they fit together. From ApertureData to Xplenty, this new installment covers 21 new tools, and you can read it here.

Data Council Blog

Open Source Highlight: OpenLineage

Storing Cold Metadata, Snowflake Data Cloud, and More: Top 10 Links From Across the Web

1. Storing Cold Metadata with Alki (Dropbox)

Open Source Highlight: Orchest

The Modern Data Stack, Metadata Architectures, and More: Top 10 Links From Across the Web

1. The Modern Data Stack (Fishtown Analytics)

Open Source Highlight: Klio

NLP Heroes, Pinot, Data Testing, and More: Top 10 Links From Across the Web

1. Heroes of NLP: Quoc Le (Deeplearning.ai)

Open Source Highlight: DataHub

State of AI, Data Quality, and More: Top 10 Links From Across the Web

1. Multiplayer Editing: a Pragmatic Approach (Hex)

Open Source Highlight: n8n

Hot Data Tools pt. 2, End-to-End Data Scientists, and More: Top 10 Links From Across the Web

1. What Data Tools Don't Do (Data Council)

Subscribe to Email Updates

Fresh Posts

Categories