As exemplified by the success of GPT-4 and Midjourney, it is without a doubt that the future of AI is multi-modal. However, managing embeddings, images, audio, video, and other unstructured data types brings a host of new data management challenges that cannot be met with the current generation of the data stack. New data types are generally much larger in size and it is common for AI datasets to have deeply nested schemas. Workloads are also very different from traditional analytical tasks; from training, evals, debugging, and EDA, core data infrastructure needs to be natively optimized for both scans and random access, on both scalar columns and large binary or tensor columns. In this talk, we'll cover a few use cases across training, data management, and EDA of image datasets to illustrate how the current generation of data infrastructure fails for AI. We'll introduce Lance columnar format, the critical open source foundation for a new generation of data infrastructure for AI, and how we can build a new kind of Lakehouse optimized for multi-modal AI.
Chang She is CEO/Co-founder at Eto Labs building modern data infrastructure for AI. Previously he architected the ML and experimentation stack at TubiTV as VP of Engineering. In the mythical pre-pandemic epoch, Chang was the 2nd major contributor to Pandas, CTO/Co-founder of DataPad, and a recovering financial quant.