I'm planning to take salesforce data architect exam, and I wanted to understand in a simple a way what's master data management is all about , what problems it deals with . You're video really helped me thank you
Good video. Some of the data management terms seem quite divorced from technical descriptions of soft ware systems, but you've bridged that gap. You mention most transactional OLTP systems don't have issues storing historical data - what do you think about the concept of event sourcing? Feels like we're already storing historical data in the Write-Ahead-Logs of our database engines.
you describe a map that's created on the operational side. I think you answered this in another comment, but it just seems obvious to sync that map to the analytics tier, so analytics doesn't have to get into all this fuzzy matching for deterministic cases. by the way, great video though. this really helped me a lot. content like this is somewhat hard to find because there's so much vendor created content. not sure if you have an identity resolution deep dive that you have considered doing. this video sort of is that in a way but I think it's trying to talk about the MDM space for Identity
Thanks for the comment. Yes in an ideal world the mappings would just work. In the real world the operational systems are never good/complete enough to just pull in, and often the work happens in parallel. For instance, after a merger it's more important to show merged analytics than to join up the operational systems which will just carry on working separately. Often merging on the ops side is pointless since a migration will take place later on. Don't underestimate how much effort you might waste in trying to create a single thing - many have tried and outside of Amazon who wrote the whole stack from scratch, few have succeeded. Identity mapping is more art than science in my opinion. If you own all systems then just use your own IDs to link things. If you don't own all systems then it's back to black magic and prayer that you're actually linking the right accounts in many cases. Beware the GDPR here, it's very hard to do this legally and prosecutions are on the rise.
@dave I am trying to build a analytical DB. Data will be collected from multiple sources and each is connected to each other with some master Id. I have just learned master data management, but i wonder how i will design my analytic system according to MDM
Thanks for the comment. The how is pretty easy, you just need IDs to link things together. The difficult part is the business rules that you use to master the data, deciding what to keep, what's overlap or duplication, what format you want each column in. Start with the system of record generally and work back from there, if there's a valid record then use that and match other systems to it. If you have a record in another system that doesn't have a match you can create a new record (so don't use the SoR ID Key in analytics!). You then optionally feed back that there's a mismatch, while dealing with it gracefully in analytics. You may choose to drop such records and call them invalid, but make sure you document that this is happening so that the data is trustworthy. A lot of this won't be the data team's job to complete, you'll need to work with business owners to understand what they need to see in the end result
Hi thanks for the feedback. Unfortunately I'm only familiar with the Microsoft tooling so don't really know the open source options. They all work in a similar way though so the skills are transferrable.
@@DaveDoesDemos So Thanks for your reply, is it possible to help me how can i setup mdm with your manner (with ESB)? do your architecture define in video called Data Hub?
that's a very good talk, just question about 7:06 when you mentioned about update only happens at single point when there is need to replace one of the services, I think that make sense if the event payloads stay the the same but probably not the case most of the time where you still have to update the logics in other service in order to publish/consume the new event payloads, or did I misunderstood what you were trying to describe?
Thanks for the question. Usually you'd put a translation layer between service and ESB to make the data generic and usable. If you directly integrate then you need to make translation layers for all integrated services, but with the ESB you just write one translation layer to the bus and the interfaces from the bus to other services remain the same. Take a point of sale system, when a basket is processed it might have several fields, one being pID (productID). We might have several systems with fields called p_ID, product_ID, product, productName but if we translate on the way IN to the service bus to our common productID, each of them will have a standard connector to translate to their own language on the way OUT of the ESB. If we go direct, we need to rewrite them all if we replace the POS system. This is a very simple example but the same is true of data structure/schema too and we can translate into something generic and extensible then back again. Hope that makes sense?
It feels odd to see so much data duplicated (in the operations side). I wonder what the advantage is of having duplicated/synced data vs references to a single source of truth - it also has a familiar feeling with Domain driven design (if I've understood it right). Thank you
Data is always replicated on the operational systems. If you were starting from scratch and writing your own software then maybe you'd get away with it, but in the real world that doesn't happen (Amazon might be an exception there when they originally set up the book shop). As such, your warehouse system, stock system and POS system would all have their own lists of products as an example, and they usually can't use an external source for this. The ESB then gets used to ensure they all get up to date information as it changes - update any one system and the others get the changes. Single source of truth is more of a mantra than a reality, and it often causes more work than dealing with multiple copies of information. We may sometimes keep a reference set of data which would be the source of truth, but this is usually also updated by ESB. Some people then leap to a conclusion that systems should talk directly for updates, but this would multiply out the number of touchpoints and cause more work in the long run, hence we use an ESB to abstract each connection to a middleman system (the ESB) and then create a connector to each other system. We can then change out systems easily without rewriting code all over the place. The approach is also useful in larger businesses or after merger activities where you may have several of each type of system - nobody ever tidies up an environment fully! Hope that made sense, happy to add more detail.
@@DaveDoesDemos ..Thanks for this fantastic video. Talking about SSOT, could you help clarify on the below (quite a few queries...) 1) How is MDM different from SSOT? 2) Is MDM focussed only on master data such as Customers, Locations, Products etc ...whereas an SSOT can also contain transactional data? 3) I have come across articles mentioning SSOT as an aggregated version of the data. What does that mean exactly ? 4) If EDW was considered an SSOT earlier, why is it not so? 5) It would be great, if you could bring up a video on SSOT too in the future.. Thank you.
@@maheshkumarsomalinga1455 MDM means different things but ultimately it ends up with SSOT one way or another. Sometimes you may also see "master data" created from other sources as reference data separately to systems, and this is another valid use of the term, but generally this is used as a reference to check against or a more pure source rather than actively used. For instance you may have a master data list of your stores, which wouldn't include unopened new ones or ones that have permanently closed, but is a current master list of open active stores. You may choose to have multiple master data lists with different purposes too, so a store list including those that have closed or yet to open. SSOT is not usually aggregated, it's just the single place you go to for the truth - that might mean aggregation sometimes, but it could mean that sales system 1 is the SSOT for subsidiary 1 and sales system 2 is the SSOT for subsidiary 2 while you may also have a data warehouse which is the SSOT for both subsidiaries for reporting purposes. In all scenarios the SSOT is the defined place which has the correct version of data for the defined use-case. As explained in my other video (truth about data), the sales system might not have "the truth" that a CFO is looking for when speaking to the markets, since sales data can and does change over time with returns, refunds etc. EDW can be a SSOT for reporting purposes but never make the mistake of thinking it's a single SSOT. The systems of record are SSOTs for current live data, the EDW is a SSOT for historical facts. Importantly, if you have an item returned in retail a month after purchase, your EDW data will change retrospectively and therefore the truth will change. EDW may also report different truths - if you have an item sold then returned, you did still make a sale, so marketing need to know a sale was made. You also had a return, so you'd want to know there was a return so you could do analytics on that. You also didn't make money, so did you make a sale or not? There are lots of truths in data depending on your perspective, but the sales system will only care about the truth right now - you didn't make a sale. Then there's the stock system - is the returned item in stock? It was sold, so no. It was returned, so yes. It may be damages so....maybe? Check out my other video at ruclips.net/video/J9FdMuQutN8/видео.html&t
@@DaveDoesDemos Thanks Dave for the detailed explanation ! In a way, it has made me think differently (rather broadly) about SSOT now, leading to more doubts. Let me read the details again to digest further...Your efforts are greatly appreciated. I went through your other video (truth about data) too...Found it helpful...
Hi Dave, thanks for the great video. I am trying to conceptually understand MDM. I have one question though. In the video you mention that there is one system that is the single source of truth. In the case of your example, this is the CRM system. Is it possible to design the ESB in such a way that if a record is created in the system that is not the 'main' system that via the ESB a record is created in the main system? So, a person creates an account in the web system and this person is not yet in the CRM system. Is it possible that a record is automatically created in the CRM system? The CRM will also add the email and phone number from the web system to this record. I hope the question is clear :)
Hi Paul, thanks for the question. It sounds like you've fully understood the purpose of enterprise integration and ESBs already! Yes absolutely, that's kind of the purpose. The CRM is the system that owns the truth, but we do that by making sure all updates go to it regardless where they start. If I add a customer account in the web platform, the ESB takes that data and creates (or matches!) the account in the CRM system, which will then update any other systems that may need those details. If you do this well, then there is less work matching records when you ingest for analytics since you already know the data is matched and consistent across systems. What we're avoiding here is the customer having different accounts in different systems, and the same for other data like sales, stock, product catalog etc. In retail, product and offer SKUs in particular need to be consistent between systems and this can be very challenging between logistics and distribution where you deal with a pallet of X and the stock system with may deal with a tray of X and then the store system which deals with a single X. All the same product SKU in theory, but plenty of work to do to make the numbers match up. Long story short - your comment was spot on.
@@DaveDoesDemos Hi Dave, thanks for the response. So conceptually speaking, if you update data in system x this will also get updated in system y, even though system y is the systems that 'owns the truth'. Does this also work in practice? We are currently implementing MDM in my organisation, and currently the ESB is being developed in a way that only system y can update data and these updates will be communicated to other systems. If you update a field in system x, this will be denied. I am not sure if I agree with this method. Doesn't this go against the idea of MDM or is this a viable solution?
Many people try and fail. ESB is for operational live information. Analytics is for historical information, and the two are very different in terms of the answers they provide. Live data shows the current state, which is often different from what has happened. While it is possible to take that live feed and process it onto the lake, this often leads to errors in data and is very expensive since you end up replicating your business rules in the analytics solution, doubling the required processing (and therefore cost). As I said though, people continuously try to make this work but I've yet to see it done successfully at scale.
This was very useful and educative -- thanks alot, Dave -- you're brilliant mate
I'm planning to take salesforce data architect exam, and I wanted to understand in a simple a way what's master data management is all about , what problems it deals with . You're video really helped me thank you
This MDM split between Ops & Analytics just explained the confusion I’ve been experiencing 😂 thanks!
Good video. Some of the data management terms seem quite divorced from technical descriptions of soft ware systems, but you've bridged that gap.
You mention most transactional OLTP systems don't have issues storing historical data - what do you think about the concept of event sourcing? Feels like we're already storing historical data in the Write-Ahead-Logs of our database engines.
Thank you for the high-quality video, it was really interesting and insightful
Thanks Dave, very well and simply explained!
解释的非常清楚,感谢提供信息。
Excellent Dave, Many of my queries got resolved. Keep it up.
Thank you, Dave!
Well explained! Thank you so much.
you describe a map that's created on the operational side. I think you answered this in another comment, but it just seems obvious to sync that map to the analytics tier, so analytics doesn't have to get into all this fuzzy matching for deterministic cases.
by the way, great video though. this really helped me a lot. content like this is somewhat hard to find because there's so much vendor created content.
not sure if you have an identity resolution deep dive that you have considered doing. this video sort of is that in a way but I think it's trying to talk about the MDM space for Identity
Thanks for the comment. Yes in an ideal world the mappings would just work. In the real world the operational systems are never good/complete enough to just pull in, and often the work happens in parallel. For instance, after a merger it's more important to show merged analytics than to join up the operational systems which will just carry on working separately. Often merging on the ops side is pointless since a migration will take place later on. Don't underestimate how much effort you might waste in trying to create a single thing - many have tried and outside of Amazon who wrote the whole stack from scratch, few have succeeded.
Identity mapping is more art than science in my opinion. If you own all systems then just use your own IDs to link things. If you don't own all systems then it's back to black magic and prayer that you're actually linking the right accounts in many cases. Beware the GDPR here, it's very hard to do this legally and prosecutions are on the rise.
Awesome. Thanks, Dave.
Please cover the rest data management applications like data lineage and refrence data management and metadata, thanks in advance
@dave
I am trying to build a analytical DB. Data will be collected from multiple sources and each is connected to each other with some master Id.
I have just learned master data management, but i wonder how i will design my analytic system according to MDM
Thanks for the comment. The how is pretty easy, you just need IDs to link things together. The difficult part is the business rules that you use to master the data, deciding what to keep, what's overlap or duplication, what format you want each column in. Start with the system of record generally and work back from there, if there's a valid record then use that and match other systems to it. If you have a record in another system that doesn't have a match you can create a new record (so don't use the SoR ID Key in analytics!). You then optionally feed back that there's a mismatch, while dealing with it gracefully in analytics. You may choose to drop such records and call them invalid, but make sure you document that this is happening so that the data is trustworthy. A lot of this won't be the data team's job to complete, you'll need to work with business owners to understand what they need to see in the end result
Thanks Dave, I love your videos, they are very helpful. Keep it up!
Great insights! Thanks a lot!
Excellent Dave, Thanks i love you videos, could you help me how can i use an open source MDM platform for my company?
Hi thanks for the feedback. Unfortunately I'm only familiar with the Microsoft tooling so don't really know the open source options. They all work in a similar way though so the skills are transferrable.
@@DaveDoesDemos So Thanks for your reply, is it possible to help me how can i setup mdm with your manner (with ESB)?
do your architecture define in video called Data Hub?
Hi, Very informative, I love this video
that's a very good talk, just question about 7:06 when you mentioned about update only happens at single point when there is need to replace one of the services, I think that make sense if the event payloads stay the the same but probably not the case most of the time where you still have to update the logics in other service in order to publish/consume the new event payloads, or did I misunderstood what you were trying to describe?
Thanks for the question.
Usually you'd put a translation layer between service and ESB to make the data generic and usable. If you directly integrate then you need to make translation layers for all integrated services, but with the ESB you just write one translation layer to the bus and the interfaces from the bus to other services remain the same. Take a point of sale system, when a basket is processed it might have several fields, one being pID (productID). We might have several systems with fields called p_ID, product_ID, product, productName but if we translate on the way IN to the service bus to our common productID, each of them will have a standard connector to translate to their own language on the way OUT of the ESB. If we go direct, we need to rewrite them all if we replace the POS system. This is a very simple example but the same is true of data structure/schema too and we can translate into something generic and extensible then back again. Hope that makes sense?
Very helpful !! nice video
It feels odd to see so much data duplicated (in the operations side). I wonder what the advantage is of having duplicated/synced data vs references to a single source of truth - it also has a familiar feeling with Domain driven design (if I've understood it right). Thank you
Data is always replicated on the operational systems. If you were starting from scratch and writing your own software then maybe you'd get away with it, but in the real world that doesn't happen (Amazon might be an exception there when they originally set up the book shop). As such, your warehouse system, stock system and POS system would all have their own lists of products as an example, and they usually can't use an external source for this. The ESB then gets used to ensure they all get up to date information as it changes - update any one system and the others get the changes. Single source of truth is more of a mantra than a reality, and it often causes more work than dealing with multiple copies of information. We may sometimes keep a reference set of data which would be the source of truth, but this is usually also updated by ESB. Some people then leap to a conclusion that systems should talk directly for updates, but this would multiply out the number of touchpoints and cause more work in the long run, hence we use an ESB to abstract each connection to a middleman system (the ESB) and then create a connector to each other system. We can then change out systems easily without rewriting code all over the place. The approach is also useful in larger businesses or after merger activities where you may have several of each type of system - nobody ever tidies up an environment fully!
Hope that made sense, happy to add more detail.
@@DaveDoesDemos ..Thanks for this fantastic video. Talking about SSOT, could you help clarify on the below (quite a few queries...) 1) How is MDM different from SSOT? 2) Is MDM focussed only on master data such as Customers, Locations, Products etc ...whereas an SSOT can also contain transactional data? 3) I have come across articles mentioning SSOT as an aggregated version of the data. What does that mean exactly ? 4) If EDW was considered an SSOT earlier, why is it not so? 5) It would be great, if you could bring up a video on SSOT too in the future.. Thank you.
@@maheshkumarsomalinga1455 MDM means different things but ultimately it ends up with SSOT one way or another. Sometimes you may also see "master data" created from other sources as reference data separately to systems, and this is another valid use of the term, but generally this is used as a reference to check against or a more pure source rather than actively used. For instance you may have a master data list of your stores, which wouldn't include unopened new ones or ones that have permanently closed, but is a current master list of open active stores. You may choose to have multiple master data lists with different purposes too, so a store list including those that have closed or yet to open.
SSOT is not usually aggregated, it's just the single place you go to for the truth - that might mean aggregation sometimes, but it could mean that sales system 1 is the SSOT for subsidiary 1 and sales system 2 is the SSOT for subsidiary 2 while you may also have a data warehouse which is the SSOT for both subsidiaries for reporting purposes. In all scenarios the SSOT is the defined place which has the correct version of data for the defined use-case. As explained in my other video (truth about data), the sales system might not have "the truth" that a CFO is looking for when speaking to the markets, since sales data can and does change over time with returns, refunds etc.
EDW can be a SSOT for reporting purposes but never make the mistake of thinking it's a single SSOT. The systems of record are SSOTs for current live data, the EDW is a SSOT for historical facts. Importantly, if you have an item returned in retail a month after purchase, your EDW data will change retrospectively and therefore the truth will change. EDW may also report different truths - if you have an item sold then returned, you did still make a sale, so marketing need to know a sale was made. You also had a return, so you'd want to know there was a return so you could do analytics on that. You also didn't make money, so did you make a sale or not? There are lots of truths in data depending on your perspective, but the sales system will only care about the truth right now - you didn't make a sale. Then there's the stock system - is the returned item in stock? It was sold, so no. It was returned, so yes. It may be damages so....maybe?
Check out my other video at ruclips.net/video/J9FdMuQutN8/видео.html&t
@@DaveDoesDemos Thanks Dave for the detailed explanation ! In a way, it has made me think differently (rather broadly) about SSOT now, leading to more doubts. Let me read the details again to digest further...Your efforts are greatly appreciated. I went through your other video (truth about data) too...Found it helpful...
Hi Dave, thanks for the great video. I am trying to conceptually understand MDM. I have one question though. In the video you mention that there is one system that is the single source of truth. In the case of your example, this is the CRM system. Is it possible to design the ESB in such a way that if a record is created in the system that is not the 'main' system that via the ESB a record is created in the main system? So, a person creates an account in the web system and this person is not yet in the CRM system. Is it possible that a record is automatically created in the CRM system? The CRM will also add the email and phone number from the web system to this record. I hope the question is clear :)
Hi Paul, thanks for the question. It sounds like you've fully understood the purpose of enterprise integration and ESBs already! Yes absolutely, that's kind of the purpose. The CRM is the system that owns the truth, but we do that by making sure all updates go to it regardless where they start. If I add a customer account in the web platform, the ESB takes that data and creates (or matches!) the account in the CRM system, which will then update any other systems that may need those details. If you do this well, then there is less work matching records when you ingest for analytics since you already know the data is matched and consistent across systems. What we're avoiding here is the customer having different accounts in different systems, and the same for other data like sales, stock, product catalog etc.
In retail, product and offer SKUs in particular need to be consistent between systems and this can be very challenging between logistics and distribution where you deal with a pallet of X and the stock system with may deal with a tray of X and then the store system which deals with a single X. All the same product SKU in theory, but plenty of work to do to make the numbers match up.
Long story short - your comment was spot on.
@@DaveDoesDemos Hi Dave, thanks for the response. So conceptually speaking, if you update data in system x this will also get updated in system y, even though system y is the systems that 'owns the truth'. Does this also work in practice? We are currently implementing MDM in my organisation, and currently the ESB is being developed in a way that only system y can update data and these updates will be communicated to other systems. If you update a field in system x, this will be denied. I am not sure if I agree with this method. Doesn't this go against the idea of MDM or is this a viable solution?
Maybe my question is stupid but wouldn't you plug your analytics to your ESB?
Many people try and fail. ESB is for operational live information. Analytics is for historical information, and the two are very different in terms of the answers they provide. Live data shows the current state, which is often different from what has happened. While it is possible to take that live feed and process it onto the lake, this often leads to errors in data and is very expensive since you end up replicating your business rules in the analytics solution, doubling the required processing (and therefore cost). As I said though, people continuously try to make this work but I've yet to see it done successfully at scale.
Gonzalez Karen Hall Margaret Thompson Larry