+

We'd love to work with you!

By answering these 7 questions, we can find out if we're a match

Building an Open Data Lakehouse Using Apache Iceberg

Course Details Find Out More
Code DENG-251
Tuition (CAD) Array
Tuition (USD) Array

The Open Data Lakehouse is a modern data architecture that enables versatile analytics on streaming and stored data within cloud-native object stores. This architecture can span hybrid and multi-cloud environments. This course introduces Apache Ozone, a hybrid storage service addressing the limitations of HDFS. You'll also explore Apache Iceberg, an open-table format optimized for petabyte-scale datasets. The course covers Iceberg's benefits, architecture, read/write operations, streaming, and advanced features like time travel, partition evolution, and Data-as-Code. Over 25 hands-on labs and a capstone project will equip you with the skills to build an efficient, performant Open Data Lakehouse within your own environment.

Who Can Benefit

  • This course is designed for data professionals within organizations using Cloudera Data Warehouse or Cloudera Data Engineering solutions. If you're building an Open Data Lakehouse powered by Apache Iceberg, this course will provide the knowledge and skills you need. Ideal roles include Data Engineers, Hive/Impala SQL Developers, Kafka Streaming Engineers, Data Scientists, and CDP Admins.

Skills Gained

  • Open Data Lakehouse Fundamentals
  • Understand core Open Data Lakehouse concepts and benefits.
  • Introduction to Apache Ozone and its integration within the CDP Ecosystem.

Prerequisites

  • A basic understanding of HDFS and experience with Hive and Spark are prerequisites.

Course Content

Day 1

  • Iceberg Introduction
  • DataLake Concepts
  • Open Lakehouse
  • Hive Architecture and Tables
  • Introduction and working with Ozone
  • Transfer data between HDFS & Ozone
  • Ozone Application Integration
  • Iceberg Architecture
  • Iceberg Spark, SQL Setup
  • Iceberg Catalog Review
  • Iceberg Tables: Managed & External
  • Table Design and Practice
  • Iceberg Table Tune for Read vs Write

Day 2

  • Schema Evaluation, Understand various data types issues between Hive and Iceberg during migration
  • Hidden Partition: How partition works in the Iceberg table. Compare Hive and Iceberg Partition
  • Time Travel. Various ways of Time Travel and How it helps for testing.
  • Data-As-Code including WAP - For ETL, branching & Tags - For Zero Copy Clone for Testing QA and ML
  • Iceberg Metadata for Maintenance.

Day 3

  • Change Data Capture CDC
  • Rollback Data
  • Migration – Practice various Hive to Iceberg migration
  • Shallow Migration
  • In-Place Migration
  • Hybrid Migration
  • Snapshot migration for testing
  • Late Late-arriving data migration
  • RunBook build
  • Table Maintenance
  • Streaming

Day 4

  • The capstone project aims to create a Type 2 table data flow, which is a system for managing historical changes to data in a database table. In a Type 2 table, each record maintains historical information, allowing users to track changes over time. This is crucial for data warehousing and analytics, where historical data is often required for analysis and reporting purposes.

Find Out More

  • This field is for validation purposes and should be left unchanged.