Build a Modern Enterprise-Grade Data Lakehouse with Open Source Tools
This 3-day intensive course guides participants through the creation of a modern, enterprise-grade data lakehouse architecture using only open-source technologies. The course adopts a fully hands-on, learning-by-doing approach, where participants build a complete, end-to-end data platform from scratch.
From ingestion and transformation to data storage, quality assurance, querying, and visualization, learners will design and deploy their own modular data stack centered on data lakehouse principles, combining the scalability of data lakes with the structure and governance of data warehouses.
Throughout the course, participants will work on practical, scenario-based projects using datasets from the Luxembourg open data ecosystem, simulating real-life challenges such as combining diverse data sources, maintaining data quality, and ensuring version control.
By the end of the training, participants will have gained real-world experience with industry-leading open-source tools including Airbyte, Apache Airflow, dbt, Soda, MinIO, Apache Iceberg, Nessie, Dremio, and Apache Superset. The course emphasizes governance, scalability, and cost-efficiency, equipping learners with the skills to implement future-proof, vendor-independent data platforms.
Content
- Data ingestion using Airbyte to files, database and REST APIs, ans open data sources
- Storage using Apache Iceberg & MinIO
- Data modeling & transformation with dbt
- Pipeline orchestration with Apache AirFlow
- Data quality control with Soda
- Data versioning and branching with Nessie
- SQL-based querying via Dremio
- Interactive analytics with Apache Superset
- Local orchestration via Docker Compose
- Hands-on project work using datasets from the open data ecosystem
Learning Outcomes
Upon completion of the course, learners will be able to:
- Understand the architecture of a modern open-source data stack
- Set up and operate ingestion pipelines using Airbyte
- Build transformation workflows with dbt
- Implement data versioning using Nessie and Apache Iceberg
- Validate and test data using Soda
- Query and visualize data using Dremio, Apache Superset
- Deploy a fully functional data platform using Docker Compose
- Evaluate when and how to use these tools in a real enterprise context
Training Method
- Instructor-led, fully hands-on
- Learning-by-doing, guided project work
- A fictional enterprise scenario, powered by open data sources, serves as the backbone for building a real-world data lakehouse
Certification
Certificate of ParticipationPrerequisites
- Familiarity with SQL and basic command-line usage
- (Optional) Some exposure to data pipelines and concepts such as ELT/ETL
- (Optional) Familiarity with Python and Docker is helpful
Planning and location
09:00 - 17:00
09:00 - 17:00
09:00 - 17:00
ESCO Skills
ESCO Occupations
Your trainer(s) for this course
Bruno WOZNIAK
See trainer's courses.Seasoned tech executive turned expert in digital transformation, innovation, and data-driven sense-making, I am driven by a passion for giving back and empowering others through practical learning and I create hands-on training experiences that focus on real-world application. With near-real-life cases and mini-projects, I ensure participants gain actionable insights and skills they can immediately put into practice.