The Data Lakehouse – Best of Data Lake and Data Warehouse

Almost every company today utilizes a kind of data warehouse or business intelligence solution for data analysis and reporting. Those solutions are primarily based on relational data, ETL jobs and reporting. Although powerful they are limited when it comes to very large data sets or realtime processing.

Some years ago the paradigm of Data Lakes was born to process very large data sets. Data Lakes are based on the idea of raw data processing, streaming data, ELT and machine learning.

What about combining the strengths of both into something even more powerful? This is what is called the Data Lakehouse, a term conceived by Databricks.

Evolution of data storage, from data warehouses to data lakes to data lakehouses
Data Lakehouse. Source: https://databricks.com/de/glossary/data-lakehouse

As the name suggests, it combines the strengths of Data Warehouses with the power of Data Lakes. Although the term Data Lakehouse was not really used in 2020, we built a Data Lakehouse for a logistics company already then.

One of the main datasets in this project comprised 16 years or freight offers plus live data. The historical data was transferred from Oracle Databases to a new Data Lake. In addition stream sources were set up to ingest live data directly from the source applications into the Data Lake. The result was a huge active archive including historical and live data based on Hadoop, Spark, Kafka and HBase. The raw data was stored and continuously transformed into a normalized form ready to be processed by reporting and machine learning jobs. A logical structure, metadata and governance were added using Apache Atlas and Avro schemas. Reporting and end user security was implemented using Microsoft Power BI.

The result was something we would probably call a Data Lakehouse today. The combination of BI and Data Lake was very successful, so we created a success story to describe it.
To me is seems that Data Lakehouse is a very useful concept. It is an evolutionary step towards an integrated solution for processing and analysis of massive amounts of data by applying good practices in terms of governance, security and reporting. Surely something BI-Teams should have an eye on.

Workshop: Big Data you can Touch

Today I released the brand new Workshop:Big Data you can Touch.

If you start researching about Big Data Platforms, you will find an overwhelming amount of possible technologies. But if you dig deeper you’ll find that many platforms are based on the same proven Open Source products.

This workshop teaches how to set up your own Big Data platform using professional Open Source products. Together we’ll build a end-to-end use case using a Lambda-Architecture and Machine Learning.

It is intended for all people who are generally intested in Big Data platforms, e.g. developers, architects, analysts or decision makers, who want to know how those technologies work together.

The workshop takes 4 hours and can be booked as On-Site-Training and Online-Webinar. Hope to see you there…

Upcoming events:.

7. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

13. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

21. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen