This is an experience report on implementing and moving to a scalable data ingestion architecture. The requirements were to process tens of terabytes of data coming from several sources with data refresh cadences varying from daily to annual. The main challenge is that each provider has their own quirks in schemas and delivery processes. To achieve this we use Apache Airflow to organize the workflows and to schedule their execution, including developing custom Airflow hooks and operators to handle similar tasks in different pipelines. We are running on AWS using Apache Spark to horizontally scale the data processing and Kubernetes for container management.
We will explain the reasons for this architecture, and we will also share the pros and cons we have observed when working with these technologies. Furthermore, we will explain how this approach has simplified the process of bringing in new data sources and considerably reduced the maintenance and operation overhead, but also the challenges that we have had during this transition.
Dr. Johannes Leppä is a Data Engineer building scalable solutions for ingesting complex data sets at Komodo Health. Johannes is interested in the design of distributed systems and intricacies in the interactions between different technologies. He claims not to be lazy, but gets most excited about automating his work. Prior to data engineering he conducted research in the field of aerosol physics at the California Institute of Technology, and holds a PhD in physics from the University of Helsinki. Johannes is passionate about metal: wielding it, forging it and, especially, listening to it.