The Data Lakehouse – Best of Data Lake and Data Warehouse

Almost every company today utilizes a kind of data warehouse or business intelligence solution for data analysis and reporting. Those solutions are primarily based on relational data, ETL jobs and reporting. Although powerful they are limited when it comes to very large data sets or realtime processing.

Some years ago the paradigm of Data Lakes was born to process very large data sets. Data Lakes are based on the idea of raw data processing, streaming data, ELT and machine learning.

What about combining the strengths of both into something even more powerful? This is what is called the Data Lakehouse, a term conceived by Databricks.

Evolution of data storage, from data warehouses to data lakes to data lakehouses
Data Lakehouse. Source: https://databricks.com/de/glossary/data-lakehouse

As the name suggests, it combines the strengths of Data Warehouses with the power of Data Lakes. Although the term Data Lakehouse was not really used in 2020, we built a Data Lakehouse for a logistics company already then.

One of the main datasets in this project comprised 16 years or freight offers plus live data. The historical data was transferred from Oracle Databases to a new Data Lake. In addition stream sources were set up to ingest live data directly from the source applications into the Data Lake. The result was a huge active archive including historical and live data based on Hadoop, Spark, Kafka and HBase. The raw data was stored and continuously transformed into a normalized form ready to be processed by reporting and machine learning jobs. A logical structure, metadata and governance were added using Apache Atlas and Avro schemas. Reporting and end user security was implemented using Microsoft Power BI.

The result was something we would probably call a Data Lakehouse today. The combination of BI and Data Lake was very successful, so we created a success story to describe it.
To me is seems that Data Lakehouse is a very useful concept. It is an evolutionary step towards an integrated solution for processing and analysis of massive amounts of data by applying good practices in terms of governance, security and reporting. Surely something BI-Teams should have an eye on.

JAX 2020: Big Data and Agile Culture

This year JAX is taking place from 7. September to 11. September in Mainz. W-JAX is taking place from 2. November to 6. November in Munich. Due to the Corona situaton it will be a special experience as the conferences are going to be held in an hybrid manner (on-site and online). In my sessions I am going to talk about Big Data and Agile Culture.

In the Big Data session I am going to show you how to set up an Open Source Big Data platform from scratch. You will see how popular technologies such as Hadoop, Spark, Hive, Kafka and others work together. We are going to implement a typical end-to-end use case live together. You’ll get a solid understanding of what these technologies do and how they work together to form a platform.

The Agile session covers aspects of culture as a building block of agile organisation development. I am going to talk about what culture actually is, why it is an essential part of “being agile” and how to develop it. Moreover I am going to share experiences and common pitfalls on the journey of agile culture development.

I am glad to be there and hope to meet you on-site or online.

Success Story: Big Data in Logistics

In the years 2019 and 2020 I had the pleasure to support TIMOCOM in the implementation of their brand new Big Data Platform.
TIMOCOM is an international logistics platform provider and a true champion in its area.

When we started the initiative the company had an existing BI-System to perform reporting and statistical analysis. The aim was to extend the capapabilities of the company to collect, store and analyse huge amount of data. A Big Data solution comprising best of breed open source products was chosen. The new technology stack is able to scale not only technically but also business wise as it ist completely license cost free. It is based on technologies such as Java, Python, Hadoop, Hive, Kafka, Spark and HBase.
A major challenge in the beginning was that the staff had almost no knowledge of the applied technologies. To cope with this situation and to establish the solution quickly and in high quality, we’ve set up a Creative Software Workbench (CSW). A CSW combines the areas of modern technology, agile methodology and team dynamics to create an enviroment in which digital products can be created in the best possible way. It is based on more than 25 years of practical experience from many successful and of course some not so successful projects. In this enviroment agile engineering and active learning are important parts which helped us to master the Big Data ecosystem in a reasonable amount of time.

The new platform enables the company to gain new insights from their data today an tomorrow. It is an important step in the future to support their data driven business model.
You can read about the project in the success story “Wissen aus Daten”. I am glad that I can add this story to our list of success stories. If you want to know more about it, don’t hesitate to contact me.


JAX 2019: Agile Team Architecture and Big Data

JAX is one of the most known conferences for Java, architecture and software innovation in Germany. Im am glad to be invited this year to give some sessions. Between the 6th and 10th May 2019 JAX will be taking place at Rheingold Halle in Mainz.

Agile product teams are becoming more and more mission critical. On the 6th I am going to give a presentation about the way agile product teams can be built by applying software architecture principles such a resilience and performance to teams.

When people start learning Big Data technologies for many it seems to be complex due to the sheer amount of products in the Big Data ecosystem. On the 8th I am going to show a simple Big Data Stack to get started with. I am going to set up a working stack from scratch and implement a working lambda architecture.

You can see the timeslots on the JAX website. I look forward to seeing you there.