Technical Talks

View All

Cloud-Native Stream Processing

Sid Anand Sid Anand | Hacker at Large, Co-chair @ QCon & Data Council, PMC & Committer @ Apache Airflow | Apache Software Foundation

Big Data companies (e.g. LinkedIn, Facebook, Google, and Twitter) have historically built custom data pipelines over bare metal in custom-designed data centers. While this affords greater control over performance and cost, it often creates a division between operations and development that leads to decreased agility and velocity. Operations see their clients (developers) as internal and hence often don’t invest in self-service tooling, preferring archaic human-backed ticketing systems with human-scale turnaround SLAs (e.g. from hours to days to months to get new machines). The public cloud is a game changer because all users of infrastructure are deemed to be external - hence, all resources from Kinesis or PubSub streams to S3 or Cloud Storage buckets to DynamoDB or BigTable tables can be requested and provisioned on the fly!

Thanks to recent cloud improvements, data infrastructure (i.e. databases, data pipelines, search engines, blob stores) can now not only be provisioned on the fly but also autoscaled and auto-healed without the developer being aware. Autoscaling of EC2, introduced ~8 years ago, has recently been replaced by serverless (e.g. AWS Lambda) & fully-hosted ( e.g. AWS Elasticache, ElasticSearch, DynamoDB ) approaches. Provisioning automation such as Chef, Puppet, and Ansible, is increasingly being obsoleted by Terraform. Developers of Data Pipelines, Predictive and ETL, can focus more on the differentiated aspects of their work, leaving the management of data infrastructure to AWS for the most part.

Agari, a leading email security company, is applying big & fast data best practices to both the security industry and to the cloud in order to secure the world against email-bourne threats. We do this by building near-real time stream processing predictive data pipelines & control systems in the AWS cloud that are infinitely scalable, highly available, low latency, and easy to manage. Come to this talk to learn more.

Sid Anand
Sid Anand
Hacker at Large, Co-chair @ QCon & Data Council, PMC & Committer @ Apache Airflow | Apache Software Foundation

Sid Anand recently served as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search & Data Analytics @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid enjoys spending time with family and friends.