by Kumar Singh, Research Director, Automation & Analytics, Supply Chain Management, SAPinsider
Data Hubs, Data Lakes and Data Warehouses…..in this case…terminology matters
I am an anti-jargon guy, but sometimes, jargons do matter. This is one such case.
“We need a data lake in our data architecture”-is something I have been hearing a lot recently. Probe a little and you will find that a significant percentage of those looking into data lakes actually do not need it, if they leverage….. (drum roll) a data hub. Data hubs…data lakes…data warehouse…it all becomes confusing for many decision makers.
As organizations define their data architecture strategy, it is critical that they understand the difference between the architectural elements mentioned above. This becomes more critical for scenarios where you plan to use a plethora of analytics applications and platforms since the type of data setup you need will depend on the type of analytics that you plan to leverage. However, if you have an end to end analytics capability (i.e you leverage multiple types of analytics approaches across your enterprise), you must have a data hub. As I have indicated often in my articles , once you build a data hub, you have already made a massive leap in building any platform that you envision to build. Building an underlying, robust data infrastructure is the most challenging part.
Why organizations end up developing multiple data sources
The best way to understand why data hubs are important, using the example of analytics, is to start by understanding that at a very high level, we can categorize the type of analytics organizations need to do into three buckets:
- Business Intelligence (BI): Primarily descriptive analytics done typically on historical data
- Artificial Intelligence (AI): Advanced analytics, primarily machine learning based approaches, leveraged on both historical as well as near real time data- and a significant percentage of this data is unstructured.
- Process Intelligence (PI): Near real time or low latency descrptive operational analytics done on the data generated by business processes
Why is it important to understand the three types above ? Because the type of analytics will essentially drive the data architecture strategy. And to understand how, let us understand the details of the type of data that the three high level analytics approaches above ingest. Artificial Intelligence algorithms leverage a significant amount of unstructured data. In many instances, these data files originate from systems in near real time and get consumed by the algorithms, which means that not much data governance can be applied before the data gets consumed. Data lakes are frequently used in such applications. On the other end of the spectrum is the data leveraged by the business intelligence tools which has been processes (and hence been through data governance process) before it gets consumed in the analysis.
Understanding the MiddleMan- The Data Hub
As you can see, there is definitely a gap between the quality and management of data at both ends of the spectrum. Common sense dictates that this disparate, “multiple sources of truth” approach is not optimal when we think about building analytics driven intelligent enterprise. And that is where data hub comes into play. To simplify it extremely, at the core of it all, a data hub centralizes all the enterprise data, creating a single source of truth. But it does not do it in a legacy way, like some other traditional warehousing technologies (like a data lake, since many data lakes leverage legacy technologies like DAS based storage).
A data hub is a centralized data storage that shares and delivers data in near real time.
The data hub leverages a hub and spoke model, where most commonly used data is centralized and then exchanged with several nodes on the spokes. Now let us revisit our three analytical approaches to see how this architecture fits there. The illustration below depicts how a data hub collects data from multiple sources, harmonizes it through data operations and makes it available to the nodes. The data generated and collected by these entities is mediated and governed by data hub as it flows across the organization. So some of the key aspects of data hub can be summarized as:
- Consolidate data from multiple sources and multiple formats
- Perform a comprehensive data operations on the data (discover, refine, govern, orchestrate)
- Create a single source of truth for applications and platforms across the enterprise
What does this mean for SAPinsiders
By now you can tell that data hub provides an essential ingredient towards building the true intelligent enterprise of the future. And hence, we are seeing an influx of data hub solutions in the market and the number of such products will increase in the future. SAP has its own data hub product in the form of its business data intelligence data management solution, a key component of the business technology platform. Since BTP plays an key role in the “RISE with SAP” offering, and data hub is an integral data architecture of that platform, it is obvious that data hubs will be playing a key role in the intelligent enterprises of the future.
Sidebar on implementation: Implementing data hubs is a whole another complex topic in itself, ripe for simplification, and hence will be simplified in another article, in the context of analytics.
Kumar Singh, Research Director, Automation & Analytics, SAPinsider, can be reached at firstname.lastname@example.org