Traditional data architectures are not enough to handle the huge amounts of data generated from millions of users. In addition, the diversity of data sources are increasing every day: Distributed file systems, relational, columnar-oriented, document-oriented or graph databases.
Letgo has been growing quickly during the last years. Because of this, we needed to improve the scalability of our data platform and endow it further capabilities, like “dynamic infrastructure elasticity”, real-time processing or real-time complex event processing.
In this talk, we are going to dive deeper into our journey. We started from a traditional data architecture with ETL and Redshift, till nowadays where we successfully have made an event oriented and horizontally scalable data architecture. We will explain in detail from the event ingestion with Kafka / Kafka Connect to its processing in streaming and batch with Spark.
On top of that, we will discuss how we have used Spark Thrift Server / Hive Metastore as glue to exploit all our data sources: HDFS, S3, Cassandra, Redshift, MariaDB ... in a unified way from any point of our ecosystem, using technologies like: Jupyter, Zeppelin, Superset … We will also describe how to made ETL only with pure Spark SQL using Airflow for orchestration.
Along the way, we will highlight the challenges that we found and how we solved them. We will share a lot of useful tips for the ones that also want to start this journey in their own companies.
Ricardo Fanjul is a Data Engineer at Letgo designing new data architectures. He specializes in high scalable technologies like Spark, Flink, Hadoop, Kafka, Cassandra, and Akka. Previously, he worked developing highly-scalable distributed systems for ING BANK where he worked on the new architecture of the bank while on the core team. He holds a Bachelor's degree in computer science and a Master in Web Engineering.