Devoxx Ukraine 2018
from Friday 23 November to Saturday 24 November 2018.
Reza currently leads the Hadoop-Platform team at Uber where his team builds the required reliable/scalable data platform that serves petabytes of data utilizing technologies such as Hadoop, Hive, Kafka, Spark, Presto, etc. Reza is one of the founding engineers of the data at Uber and helped scale Uber's data platform from a few TB to 100+ PetaBytes while reducing the big data latency from 24+ hours down to minutes. Reza holds a Ph.D. degree in Computer Science from the University of Illinois @ Urbana-Champaign and had previously worked at Twitter and Apple on similar infrastructure/big data platforms.
See also https://www.uber.com/
Data-driven decisions rely heavily on storing an ever-increasing amount of data in addition to providing faster, more reliable, and more-performant access. An effective Big Data Platform that can serve 100+ PB of data with min-level latency while minimizing the hardware cost is not a straightforward solution.
This talk outlines the design and architecture of Hudi: an open-source analytical storage system created at Uber to manage petabytes of data on HDFS-like distributed storage. Hudi provides near real-time ingestion and provides different views of the data – read optimized view for batch analytics, real-time view for driving dashboards, incremental view for powering data pipelines. Hudi also effectively manages files on underlying storage to maximize operational health & reliability. In this talk, We'll dive into the technical aspect of how Hudi lowers data latency across the board while simultaneously achieving orders of magnitude of efficiency over traditional batch ingestion. We will make the case for near real-time dashboards built on top of Hudi datasets, that can be cheaper than pure streaming architectures.