The Data Lakehouse – Best of Data Lake and Data Warehouse

Almost every company today utilizes a kind of data warehouse or business intelligence solution for data analysis and reporting. Those solutions are primarily based on relational data, ETL jobs and reporting. Although powerful they are limited when it comes to very large data sets or realtime processing.

Some years ago the paradigm of Data Lakes was born to process very large data sets. Data Lakes are based on the idea of raw data processing, streaming data, ELT and machine learning.

What about combining the strengths of both into something even more powerful? This is what is called the Data Lakehouse, a term conceived by Databricks.

Evolution of data storage, from data warehouses to data lakes to data lakehouses
Data Lakehouse. Source: https://databricks.com/de/glossary/data-lakehouse

As the name suggests, it combines the strengths of Data Warehouses with the power of Data Lakes. Although the term Data Lakehouse was not really used in 2020, we built a Data Lakehouse for a logistics company already then.

One of the main datasets in this project comprised 16 years or freight offers plus live data. The historical data was transferred from Oracle Databases to a new Data Lake. In addition stream sources were set up to ingest live data directly from the source applications into the Data Lake. The result was a huge active archive including historical and live data based on Hadoop, Spark, Kafka and HBase. The raw data was stored and continuously transformed into a normalized form ready to be processed by reporting and machine learning jobs. A logical structure, metadata and governance were added using Apache Atlas and Avro schemas. Reporting and end user security was implemented using Microsoft Power BI.

The result was something we would probably call a Data Lakehouse today. The combination of BI and Data Lake was very successful, so we created a success story to describe it.
To me is seems that Data Lakehouse is a very useful concept. It is an evolutionary step towards an integrated solution for processing and analysis of massive amounts of data by applying good practices in terms of governance, security and reporting. Surely something BI-Teams should have an eye on.

Success Story: Big Data in Logistics

In the years 2019 and 2020 I had the pleasure to support TIMOCOM in the implementation of their brand new Big Data Platform.
TIMOCOM is an international logistics platform provider and a true champion in its area.

When we started the initiative the company had an existing BI-System to perform reporting and statistical analysis. The aim was to extend the capapabilities of the company to collect, store and analyse huge amount of data. A Big Data solution comprising best of breed open source products was chosen. The new technology stack is able to scale not only technically but also business wise as it ist completely license cost free. It is based on technologies such as Java, Python, Hadoop, Hive, Kafka, Spark and HBase.
A major challenge in the beginning was that the staff had almost no knowledge of the applied technologies. To cope with this situation and to establish the solution quickly and in high quality, we’ve set up a Creative Software Workbench (CSW). A CSW combines the areas of modern technology, agile methodology and team dynamics to create an enviroment in which digital products can be created in the best possible way. It is based on more than 25 years of practical experience from many successful and of course some not so successful projects. In this enviroment agile engineering and active learning are important parts which helped us to master the Big Data ecosystem in a reasonable amount of time.

The new platform enables the company to gain new insights from their data today an tomorrow. It is an important step in the future to support their data driven business model.
You can read about the project in the success story “Wissen aus Daten”. I am glad that I can add this story to our list of success stories. If you want to know more about it, don’t hesitate to contact me.


Workshop: Big Data you can Touch

Today I released the brand new Workshop:Big Data you can Touch.

If you start researching about Big Data Platforms, you will find an overwhelming amount of possible technologies. But if you dig deeper you’ll find that many platforms are based on the same proven Open Source products.

This workshop teaches how to set up your own Big Data platform using professional Open Source products. Together we’ll build a end-to-end use case using a Lambda-Architecture and Machine Learning.

It is intended for all people who are generally intested in Big Data platforms, e.g. developers, architects, analysts or decision makers, who want to know how those technologies work together.

The workshop takes 4 hours and can be booked as On-Site-Training and Online-Webinar. Hope to see you there…

Upcoming events:.

7. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

13. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

21. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

JAX 2019: Agile Team Architecture and Big Data

JAX is one of the most known conferences for Java, architecture and software innovation in Germany. Im am glad to be invited this year to give some sessions. Between the 6th and 10th May 2019 JAX will be taking place at Rheingold Halle in Mainz.

Agile product teams are becoming more and more mission critical. On the 6th I am going to give a presentation about the way agile product teams can be built by applying software architecture principles such a resilience and performance to teams.

When people start learning Big Data technologies for many it seems to be complex due to the sheer amount of products in the Big Data ecosystem. On the 8th I am going to show a simple Big Data Stack to get started with. I am going to set up a working stack from scratch and implement a working lambda architecture.

You can see the timeslots on the JAX website. I look forward to seeing you there.

Free Avro Schema Viewer

Avro is a data serialization system. The core is Avro schema which can be used to describe the structure of datasets very much like XML Schema or JSON Schema. Avro is primarily used in Big Data scenarios for which it offers special features like schema evolution. This is a typical Avro file:

{"namespace": "net.pleus.domain",
    "name": "customer",
    "version":"1.0",
    "doc" : "Customer Dataset",
    "type": "record", 
    "fields": [
        {"name": "id", "type": "int","default":"-1", "doc":"Unique id of the customer"},
        {"name": "name", "type": ["string", "null"],"default":null,"aliases":["fullname"],"doc":"Customer's name (optional)"},
        {"name":"address", "default":null, "doc":"Address information",
            "type":{
                "type":"record",
                "name":"address",
                "fields":[
                    {"name": "street", "type" : "string","default":"unknown", "doc":"Street"},        
                    {"name": "city", "type": "string","default":"unknown", "doc":"City"}
                ]
            }
        },
        {"name": "contact", "default":null, "doc":"List of contact options", "type": {
            "type": "array",
            "items": {
                "type":"record",
                "name":"contact",
                "fields":[
                    {"name": "type", "type" : { "name":"values" , "type": "enum", "namespace" : "net.pleus.contacts", "symbols" : ["EMAIL", "PHONE", "MESSENGER"]}, "doc" : "Type of contact"},        
                    {"name": "url", "type": "string","default":"unknown", "doc":"The contact url"}
                ]
            }            
        }}
    ]
   }

As you can see it is a JSON file that is structured according to the Avro specification. Although this verbose form might be suitable for technical people, often such structure definitions have to be discussed with non-technical people from the business domain. This is especially the case if you follow a domain driven approach.

To make it easier to discuss and negotiate Avro structure definitions without creating redundant model representations, I’ve created an easy to use Avro Viewer. If you drop the file shown above you will see a visual representation that just shows the essence of the schema.

You can see records, enums, arrays, defaults, aliases, documentation and so on without the JSON markup noise.
Avro Viewer is free to use. It is just plain HTML+JavaScript+CSS. If required you can download the source and modify it to suit your needs.