Fivetran is a data pipeline that syncs data from apps, databases and file stores into our customers’ data warehouses. The question we get asked most often is “what data warehouse should I choose?” In order to better answer this question, we’ve performed a benchmark comparing the speed and cost of three of the most popular data warehouses — Amazon Redshift, Google BigQuery, and Snowflake.
Batch data processing — historically known as ETL — is extremely challenging. It’s time-consuming, brittle, and often unrewarding. Not only that, it’s hard to operate, evolve, and troubleshoot.
In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. This post distills fragments of wisdom accumulated while working at Yahoo, Facebook, Airbnb and Lyft, with the perspective of well over a decade of data warehousing and data engineering experience.
One might surmise that data "analysis" is, first and foremost, about data "access." It goes without saying that someone in the analyst's role must first obtain access to the data they wish to analyze. And with data being spread all over the inside, and now outside, of the enterprise (think of both your on-premises data stores, plus all the cloud and SaaS vendors you're currently using) modern day analysts face deeper challanges than ever before in obtaining access to the data they need.
And of course, techno-philosophical concepts like "democratizing acess to data" do nothing at all to help one overcome any of the actual technical integration challenges required to practically enable such unfettered access to one's data.
The past few years have been an interesting time for data science everywhere, and the media in particular! We’ve seen some incredible new technologies emerge, like open-source machine learning platforms, as well as machine learning services. These developments have opened the door for new consumer products, like conversational AIs, and new technologies in the media and advertising industries.
Since our initial DataEngConf in 2015, The New York Times has been a key supporter of the conference. The very first ever DataEngConf talk was a keynote given by Chris Wiggins, the Times' Chief Data Scientist, who presented a broad yet fascinating perspective on "Data Science at The New York Times" (video here).
In the years since, we've had deeply technical talks from both data engineers and data scientists at the Times, and I'm excited that their involvement in DataEngConf this year is as large as it's ever been.
(Image source: http://arrow.apache.org/)
As data has proliferated and open-source software (OSS) has continued to dominate both the stacks and the business models of the top tech companies in the world, the number of different types of data platforms and tools we've seen emerge has accelerated.
Having a hard time keeping up with the differences between Kudu, Parquet, Cassandra, HBase, Spark, Drill and Impala? You're not alone, and obviously this is one of the reasons we bring together top OSS contributors to these platforms to share at DataEngConf.
But there's one new innovation that attempts to bind all the above projects together by enabling them to share a common memory format. It's a new top level Apache Project called Arrow that aims to dramatically decrease the amount of wasted computation that occurs when serializing and deserializing memory objects. The serialization pattern is commonly used when building analytics applications that interact between data systems which have their own internal memory representations.
As an engineer turned founder I've been passionate for years about helping other technical founders succeed. There are a unique set of challenges faced by founders, and building support communities to help them successfully overcome their obstacles helps move innovation forward.
More broadly speaking, I'm also a proponent of bringing engineers together - hence our efforts in the data community via meetups, our conference series and via organizing other, smaller, events for engineers, data scientists and CTOs through Hakka Labs for the past 5 years.
This is why I'm so excited to be introducing the intersection of these two efforts - supporting startups and supporting the data community - into our upcoming DataEngConf NYC.
What do you do when you find yourself in a situation where you need to scale out your RDBMS to support greater data volumes than you originally anticipated? Traditionally, one would either need to vertically scale their infrastructure by putting their database on more powerful (costlier) machines or sharding their data across multiple workers.
A re-wording of one of the key maxims for startup success could be "KISS" - "keep it simple, stupid." If you've ever run your own startup, you also know the mantras of "focus" and "fail fast," and the critical reminder of how your product should be a "pain-killer not a vitamin."
Receive the latest news, tips and special events from our community directly to your inbox once in a while (we promise no spam)