An Introduction to SAP Data Hub

Marc Hartz, SAP Data Hub Product Manager, Provides an Overview of SAP's New DataOps Management Solution

Marc Hartz, SAP Data Hub Product Manager, joins SAPinsider for a podcast to discuss SAP’s new DataOps management solution. Topics covered include features and functionality, use cases, and how SAP Data Hub eases the challenge of orchestrating and monitoring data across enterprise systems and distributed data systems.

Below is a lightly edited transcript of the conversation

Ken Murphy, SAPinsider: Hi, this is Ken Murphy with SAPinsider. I’m pleased to be joined on this podcast this morning by Marc Hartz, who is the Product Manager for SAP Data Hub. Marc, thanks for joining us this morning.

Marc Hartz, SAP: Thanks for having me.

Ken: Marc, before we get into SAP Data Hub I was hoping you can introduce yourself to our listening audience, and maybe tell our listeners your involvement with the creation of SAP Data Hub.

Marc: Basically, I am now 10 years with SAP. I’m nowadays the lead Product Manager for the Data Hub. Previously, I did a lot of data warehousing and analytics. Out of this concept we saw the need to develop something like a Data Hub, which we’ve now done; as the topic evolved I was there from the beginning leading it and as it grew we brought in more people, and here we are having great feedback for a new product.

Ken: Let’s start with what is the SAP Data Hub? What does it do?

Marc: You have to think about the data landscape of customers as getting more increasingly complex. Data is everywhere; we have data warehouses, we have IoT data, we have data lakes and so on. And out of such landscape, there are a lot of questions how you integrate, how you orchestrate different data processes, and this is exactly what we would like with the Data Hub. So it’s a data integration, orchestration layer without bringing the data within or into the Data Hub, so that’s a big false assumption a lot of people have; so it’s a logical concept to handle all the integration, orchestration and governance needs which are necessary in a complex landscape.

Ken: So that’s the market need, the vast volume of structured and unstructured data?

Marc: The market need, or what we have seen in many customer interactions is that customers are overwhelmed with all of this data and all of this landscape. We have now also a big push to bring data into also a lot of cloud environments whether this is Amazon AWS, or Azure, or Google Cloud Platform; so we see that’s where some IoT data is stored not only in Hadoop but especially in maybe something like Hadoop or maybe something just like an objects store in this cloud environment. And what we got as feedback from so many customers was, “Can you help us provide a link between this environment and my enterprise data?” which from our perspective sits hopefully a lot in SAP systems. 

So how can I establish this link, this connectivity, this integration, to get the business value out of the data I’m collecting on the IoT, on the sensor side? Because many customers are faced with situations that they are collecting this kind of data assets, but they do not really get the value out of it. The value truly comes if you can combine your sensor data, your weblog data with the enterprise data you have in a certain business process in the hopefully SAP system. And that’s exactly what we got as feedback a lot, plus if you try to do that already, there are a lot of different tools and technologies involved so to get such a scenario working where you do something in a data lake and you want to integrate that in a for example data warehouse, that’s very painful because you have five to eight different tools and technologies which need to play together and that’s the situation we want to solve as well by having a unified offering and pricing these technologies and help to have such scenarios implemented faster, operated easier, and basically solve this integration challenge which is there.

Ken: How does SAP Data Hub differ then from other data landscape management tools that are out there?

Marc: Yes, that’s a good one. I think to get the true differentiation you have to think always about the landscape and what many tools and competitors are doing is they are focusing on a niche of the market. There are a lot of technical landscape management tools like SAP Solution Manager which is doing a lot of technical stuff, upgrades and how we provision patches in the landscape and stuff like that. And there are a lot of scheduling tools, there are a lot of data refinement tools, there are a lot of orchestration tools. We believe that it is important that we bring the functionalities which are across these tools kind of together because it’s important to drive automation. 

What you don’t want is that you have five, six different transitions and if something breaks the whole scenario is stuck. So we believe that a portion of all of these scenarios and tools is needed in one tool to basically fulfill the true automation to truly productize in the landscape and I think that’s the hard point to understand about SAP Data Hub; because it’s something new where we believe we are combining a lot of functions in one tool instead of having separate tools, and of course we can’t compare it with just one individual pillar we have to think about this landscape and that we are really trying to do something different by combining; it’s a challenging thing we are trying, but we believe this is really needed to build new applications, data-driven and having a strong focus on the data itself

Ken: What is the Data Hub approach then for mapping data in that one tool?

Marc: I’m not so sure if it’s the mapping the data or more the overseeing the data and the attached processes in that tool, that’s actually what we want to do. There is a lot of data mapping and stuff like that happening in the respective connected systems like a data warehouse or like a data lake, but what is missing basically is that you find out is this data now integrate-able with a data warehouse, is the structure fitting to each other, the data quality, all of the aspects and more are coming from a governance notion.

I think that’s the really interesting part – to have not the technical data mapping but more like a semantic enrichment and refinement and then as a later step basically to say, “OK now the data is in a state that it fits basically semantically from structure perspective, from a quality perspective to basically the latter systems if you think of kind of a food chain or a data loading chain, and then basically help the customer to get this mapping more seamless. But Data Hub itself is not a data modeling tool like you have in a data warehouse to build a data wall or to have a star schema; it’s very much about the provisioning of this data model and the automation the data is flowing in the landscape and fitting to each other.

Ken: How does SAP Data Hub orchestrate and monitor across your enterprise systems and distributed data systems, and why is that important?

Marc: Let me start with why it’s important. I think that the architectures where a lot of components are playing together, it’s essential if we tackle the problems of many scenarios we see nowadays. And a good example is IoT and all that IoT is about, it’s about streaming, it’s about ingestion, it’s about refinement and before you can actually do something with the data whether this is a data science use, or whether you do some fancy machine learning for example. I do believe that such initiatives only will be successful and mature if we are able to bring this enterprise systems with like you said the distributed systems, the cloud environments closer together. 

And the monitoring is one aspect, but also the processing and how you deal with data changes is another aspect. And when we say orchestration we actually mean exactly doing that, to having the knowledge of what is happening where, what is running where, and how is it fitting to a process which runs in different areas of a whole landscape. And I think nowadays this will become even much more important. I don’t think we will have very soon a centralization that everything goes back to one place; I believe that separation and this diversity in a landscape is something which is beneficial but we need to manage, we need to orchestrate, and we need to monitor it very, very carefully because otherwise it gets a little bit like the Wild West.

Ken: Lastly, Marc maybe you can address some key functionalities and use cases for SAP Data Hub?

Marc: If you are used to thinking of certain categories, it’s a challenge to think, “OK, now this SAP guy is saying we do that all in one tool and we try to combine it” but actually the key functionalities are exactly if you really nail it down we have three main categories of functions.

The first is what we call the data pipeline, it’s a very modern way how we can bring certain operations in a landscape in a given execution order. And this order means it could start there is a Kafka queue; I take data from Kafka, I bring it into Hadoop, I process the data into Hadoop with maybe Python, maybe Scala, maybe some TensorFlow model; I take the result and I bring it into HANA and I touch now three different landscape areas and the processing of the data happens within the connected systems; so it happened in the Hadoop, it happened in HANA and so on. And that is what the pipeline is doing basically. 

That we push down if you saw with the processing of the data in the connected landscape, we offer that in a pre-defined way in a visual modeling environment that’s at the core of what we are doing. This core functionality if you saw will wrap by what we call orchestration so that we can talk to technologies which are attached to that like ETL tools, like data services, replications like LT Replication Server (SLT), or third-party ETL tools that we can have them contributing to a scenario which then happens again in the landscape.

And then the third pillar for us besides pipelining, workflow is governance. So if you want to do to that what I just described across the landscape you need to have a certain knowledge about which data is stored where, who is using it, what is the structure of this data, can I use it, is it useful, can I trust it? And these questions need to be answered before you pump it through a prediction algorithm or a machine learning model. And these three pillars in our point of view need to get closer together and that is why we want to offer them in one tool basically.

The use cases. OK, that’s a good question. So what we have seen, and we started with very early with the development and to work with customers that’s a very good thing I believe as a product manager of course – that this product was completely built by what we heard from customers. And very early, in March (2017) we had first POC’s going on, the pilots, and they were very much on the one hand side IoT focused and IoT scenarios and one example there is we had a very longstanding customer doing a lot of analytics and data warehousing with us, like sales reporting and so on. That was completely driven within the data warehouse. Their customer master data, all of that stuff. Besides that, they started – it was an appliance manufacturer – they started to equip their appliances with sensors and now these appliances were producing sensor records. 

And what they did is, they dumped the sensor data they received – of course the user has to accept the privacy of course – everything was handled, and they basically dumped that data into a data lake which was in Amazon. And then they had this enterprise data warehouse with all of the customer data, and the sensor data in a very raw format because all of the appliances delivered different data in a different granularity, and in a different layout which is a challenge to harmonize. But you can’t move that sensor data out of the data lake, you need to harmonize it right there. And this is actually a very, very nice use case where we took over this ingestion part, how we bring in the sensor data in a cloud in this environment, in an object store, in Amazon, we did the transformation, we did the calculation and the processing of the data. We did it there with these pipelines we mentioned, and then we connected the result into the data warehouse that you can at the end have a very nice dashboard showing for example in which region did I sell what, and what are the most used functions of my appliance for example. 

And to do that, I need to have the full transparency of both worlds from the big data side with the sensor data and on the enterprise side with my data warehouse and I can only be successful if I can integrate and really bring this together. And that’s a pattern I’m describing which is true for nearly all IoT scenarios that you have to ingest to prepare, process, and then to integrate and you know we have seen that everywhere whether it’s weblogs, IoT data, whether it’s maybe even some fancy stuff like videos, files, pictures, it’s every time the same always these three steps. And this is the pattern of the key use cases we want to address with the tool.

Ken: Marc, certainly a lot of information there you’ve shared with us. I appreciate your joining us on this podcast to tell us all about the SAP Data Hub.

Marc: Thank you so much.

Ken: Again, this is Ken Murphy with SAPinsider and we’ve had the pleasure of speaking with Marc Hartz of SAP, who is the lead Product Manager for SAP Data Hub.