I totally agree with the idea of SQL old school automation vs 'new' school automation. Just to clarify, I don't see the value of a Datavault with Delta tables/ lakehouse approach either. But I have something to add from the Data vault world. DV is, mainly, and integration pattern. It means that the challenges it aims to ease are integration challenges. This is something that it is difficult to sustain in the long term if you don't have proper and flexible data models. An integration challenge is not having some sources of data on the same business concepts, it is about having multiple sources (4, 10, 50, etc) related to the same data that your company, usually for a business requirement, have to reconcile. So, DV is not actually about data volume, is about business complexity and long term viability. For example, many public organizations have to collect data from several other entities, where it is not trivial to set standards to receive the data in the same structure (or it is just impossible). You could have dozens of different datasets in their own structure and semantics and containing multiple business concepts that you need to model so you can relate the information together, complementing and reconciling. Hence, the importance of modelling the data and separating concerns in hubs, links and satellites. I don't know how can one manage something like that in a lakehouse approach where there is no integration (business driven) layer. How do you navigate through hundreds of data structures where there are no business elements identified at all before the dimensional (serving) layer? Only metadata based but then you have heavy integration process coupled to create your dimensions where metadata is hard to follow? Then you have a problem and you don't even need 1+ TB of data to feel it. In these cases I actually don't know what is the lakehouse approach to follow and I would choose a DV approach instead with relational databases. "what could a lakehouse give me in those scenarios?" Thanks!
Aha! Thanks Rodrigo, that's really good insight & context to the drivers behind the methodology. It's true that reconciling across multiple different sources describing the same entity (ie: we have 10+ sources defining a 'customer' each of which has different attributes) is a challenge. With a lakehouse, we would usually treat these different sources as separate data feeds into our Silver layer, then build an integration layer by mapping these disparate objects together into common models, either virtually (simply through decoupled views) or materialising it as an intermediary lake layer. But I happily admit that process gets convoluted and complex when we have more than a few sources that need mapping, each with their changes over time - I'd be curious how many DV adoptees have this level of integration complexity as the driver and what kinds of systems we're commonly seeing as the culprit! You could certainly utilise the same process and patterns within the Lakehouse if that level of integraiton was the challenge you were facing, accepting some of the performance quirks. I'm curious if there is a compromise that can be made harnessing some of the tech but keeping some of the process benefits, specifically when facing that issue. Simon
This is true. DV is just getting your data into 6e normal form, 5e including time variance. Nothing more, nothing less. You only split up your data that far for one reason. integration. when you are a bank that has to report to a central bank and stake holders and you are buying other banks and selling of parts. Then it makes sense. In all other cases, only do it when you are in love with data modeling.
@@AdvancingAnalytics I am glad we agree on that :) I thought I was missing something. Now, consider that the described scenario is basically the same when you need a long term data model (platform and tools do not matter), because through 10, 20+ years, your organisation will be changing their internal systems, connecting to different providers and different systems. Therefore, maintaining traceability and auditability of all that could be a business requirement. The concept of a DWH is precisely that, a long term data store that is resilient to change, and not only the change in some columns that Delta tables can manage, but continues change in the data sources, which is a given for many. And sure, not all organisations actually require a data warehouse and I would recommend lakehouse approaches there (speed is more important). That would be the case of many small-medium companies, Startups-unicorns, and even big companies without long term data compliance requirements. But when you are required to trace data for dozens of years (banks, public companies, ongs, etc), you need a sustainable process. This is supported in Datavault by graph theory and the sustainable complexity level. So, at the end, it is just not a small bunch of companies, is it? Consequently, I wouldn't recommend a platform that cannot support a proper data warehouse to these clients (sustainable data model), even when Databricks has defined the lakehouse as the new data warehouse paradigm. I think they are not at that point yet. And yes, things will continue improving, since they are aware of that. Just consider that delta tables are just the begining of the real data warehouse journey. On the other hand, we have the other scenario, where 3rd generation data warehouses are adopting the software engineering paradigm and real automation could be possible. I won't focus on that here, but definitely, we are living interesting times in the data industry!
SQL automation old shool vs new school is an interesting topic. In my experience (we did two projects) using metadata driven automation is not as simple as it looks. One record per table can only apply to Hub. There are multiple records per link wich depends on how many relationship stored in the link table. It is even larger number of records for Sat which depends on how many attributes you would like to bring from source to Sat. We use a semi auto process to generate those metadata records but it still a lot of effort to maintin it. Data lineage also a problem as they are hiden in the metadata tables. The bigger challenge is data orchestration which is difficult to fully automate. Automate logging and error handleing is also not an easy tasks. Currenlty we switched back to old school method using an automation tools to generate an end to end structure. For me, how easy to handle data integration (compare to other modelling approach) and parallel process is the main attraction using DV.
We are currently building DV 2.0 on a Lakehouse for a large Financial Services company with lots of sources. For us the key benefits are: - It forces a light modelling of data on the way in by associating data to subject areas. This helps us understand what we have and also means users can query data without having to understand and traverse each source system's data model - the linkage is done for them. - Use of tracking satellites to achieve performance and flexibility. By separating what we receive from when we received it, we can avoid storing data redundantly. More importantly by performing ingestion and sequencing (CDC) as separate steps we can load data in any order. This is a game changer in terms of flexibility and operating the platform.
I'm in financial services that operates in multiple jurisdictions/countries. We also have multiple systems (3rd party, and internal) for both current and legacy workloads....so they overlap a lot in function while producing significantly different data. We gave medallion a real try, but have found that after ~6 months the silver layer has become a bit 'muddy' and complicate cases like multi-point-in-time audits (E.g. report cutoff distinct from production cutoff with source updates between), GDPR compliance, and mandatory compliance reporting in each country that often have similar, but different rules. Securing everything in a way that doesn't make access complicated was also a challenge (though Unity Catalog is making it a little easier....where's my ABAC? :D ) Note: I did allow people to leverage DBT...and maybe that's part of the muddiness? I'm not sure if full DV2 is the way to go to make things more efficient with fewer code paths between source and mart, but at least it does provide some fairly specific 'rules' to help consistently structure everything from source, to raw vault to business vault then mart. Maybe the raw to business vault transforms would become muddy just like the current "silver" staging tables? I'd love to know if you've doubled down on your perspective or broadened it over the last year Simon @AdvancingAnalytics
We did implement star schema for smaller department needs in the past. Later we adopted to use 3NF (Teradata to Synapse analytics), and acheive star schema with the help of views & materiazations are required. I have worked on many M&A, divestitures, key is how much future proof is your data model to handle a new source seamlessly unify the data
To me a data vault is useful for creating consistent entities in your silver layer such as a customer entity that gold layer products can use and not be affected even when you add a new crm to your pipeline.
Interesting. AWS reckon you should plan to re-architect every 18 months if you're a startup. So in that case, you bring in your new CRM and your gold layer is unaffected.
Interested on your views on an alternative data architecture that is platform agnostic. DV2 does seem complex and couldmget out ofmhand if not.implemented correctly, but I m struggling to determine an alternative that satisfies the agaility and automation capabilities it can bring.
This was incredibly well argued. From someone who was planning to do exactly that, i.e. implement DV on a lake house, I thank you very much for sharing your POV and expertise on the subject.p. Do keep this kind of content, it is very, very, useful.
Hi Simon, Thanks so so much!!! You just read my mind. Finally I found a better data architect friend like you 😊 Around 2016, I already implemented that additional historical Dimensions technique in Silver (regardless it is SCD1 or SCD2). So that, if we change busines scenerio from SCD1 to SCD2 at any time, we still can go back and capture from that historical records from Silver. It just requre reloading the Star schema with minimal loading logic change.
Hi everyone, I'm not a Data Lakehouse expert but i know, that Data Vault pays off when Data Warehouse is grow. There is no sense to building a DV model for an organization with several/dozen data sources. However, if there are already hundreds of systems and they change quickly, there are many teams developing the DV model. In such situations, it's worth thinking about DV. Business definitions are almost unchangable. The DV model give us a benefits in this situation. Of course, DV requires additional work, standards, knowledge. But look, Data Warehouse also requires a lot of work but we build it for a specific reasons. Let's always do something for a specific effect, don't use a cannon to kill a fly, but also don't shoot a bear with a slingshot.
Data vault is a very good thing, if e.g. you are building your dimensions out of very many disparate data sources. Via the hub tables it is getting easy to assemble them. Also, you do not have to take into account effects lik late arriving dimensions into your loading process. It is all Handels by DV already
Hi Simon, as always I have really enjoyed your rant! It's just so that I am one of those passionate practitioners who vouch for the benefits of the DV2.0 methodology... so I see myself compelled to react. One of the major benefits of adopting the DV2.0 methodology is 'integration by business object', and modeling the business (processes) instead of the source system, or having pipelines that simply ingest files easily. A pity you left that one out. Just as you left out the whole raw vs business vault talk, sure it popped up in the medallion architecture design, but this is a very important part when talking about the complexity of the automation. A good automation tool should be able to use meta data to automate all the code to get data into the raw vault, plus applying forced integration of business objects over different sources. Similarly to the modern automation using a generic python script you dub as modern automation. The data vault integration of business objects adds on top of that an enterprise wide integration and historization, enablers for data governance, versions of the facts (vs single version of the truth), just to name two. As far as applying business rules to the data remains a more manual and hard to automate thing, using python or not. The DV2.0 methodology has the concept of a business data vault, in the lakehouse approach the raw- and business data vault layer would co-exist in the silver layer. I have always felt like there was a big leap between the silver and gold layers, a gap perfectly filled by DV2.0 . This bring me to your point of getting the data out, sure, DV2.0 is write enhanced, did you know it aims at eliminating updates entirely. To everyone reading this I would advise to inevstigate further on DV2.0 write only architecture, PIT and BRIDGE tables (aiming at getting data towards the bi tools). People have actually come up with solutions to eliminate the joins from hell. Other advantages like virtual layers for the business data vault and star schemas, cdc, real time, lambda, kappa only benefit from cloud tech like databricks.... In adopting DV2.0 there is only stuff to gain, little to loose - as you said there is real value in the methodology. Lastly, thanks for putting this on the radar and I hope this will open dialogue between lovers/haters, deliver sceptics from doubt, and correctly inform the gullible ;-) Simon, what interesting talks we could have over beers on this topic ! CHEERS !! reach out to me at eric.janssens@fooinc.be for hatemail, praise, or questions
Eric, its the law of diminishing returns. Once your data lake matures - things take less time to build. Conformity promotes data trust and the business domain specific elements can be surfaced in specific views. These views can be governed and managed. Inflating the code base increases the operational costs and code management processes which Simon points out. Data Engineering principals of encapsulating logic or patterns are applicable not by just creating python scripts but by wrapping this up in deployable libraries linked to meta data driven pipelines. This help standardise engineering across the enterprise and drives down dev costs, rework, bug fixing, duplication and processing costs. I can't see how developing more elements helps the DV2.0 argument. I've seen a properly developed Lake House architecture work with a framework that onboarded 16 decently sized projects in a year delivered by an agile methodology.
Hey Eric - thanks for the thought-out, detailed response! The responses I've had that highlight the benefits of DV in these scenarios really highlight that "very complex integration" challenge. Without diving too much into the details of the methodology (which I'm certainly no expert in, hence only loosely covering in the video!) It's interesting on the pushes for more write-optimised modelling. You can architect many data systems to be append-only, but that usually just pushes the work of filtering/deciphering the model to the reader. For me, the entire point of these systems is to enable analytical exploration, so pushing complexity/performance onto the reading party is the antithesis of what I'm designing for. If I take a hit on my write speed but enable users to easily query the platform, that's a compromise I'm willing to take. Are there use cases other than complex integration of many overlapping business systems that drive for this approach? I'm curious about the other approaches you mentioned - PIT, BRIDGE tables etc which aim to get the data towards the business. If we're shaping the data as an integration exercise, then adding further elements to the model that help us reshape it for the business, that feels like a lot of extra steps. Anywhere you would point to with a quick overview of those processes? Always up for a beer and a discussion if we end up in the same place ;) Simon
I always thought that the flexibility that data vault delivers (everything many to many for example), creating extra sattelite tables for different sources etc, are all pretty much handled by delta lake as a technology. also, if the source systems suddenly have a relationship that changes cardinality, your datavault may be able to handle it, but your down stream dimensional models still break, as they are never always many to many. so the only thing you solve there is the datawarehousing load, which is now more flexible, not the data mart delivery to reporting. so a change in cardinality still breaks your reports and star schema's. in a datalakehouse, delta lake can handle changing schema's (ok you might need to overwrite to a new version, but still, it's possible) to handle changing source systems. so to me, the raw layer can be modelled after the sources. the integrated business layer could be 3d normal form, if you need a more neutral business process minded modelling approach. So in that case, the datawarehousing side of things can also be dealt with. what does datavault bring to the table in a delta lake house world? It seems added complexity to me? would love to hear how you see this and count me in for that beer!
@@AdvancingAnalytics At my job we're considering a light version of data vault. The main benefit I'm going after is modelling in a way where we can support multiple versions of the data at the same time so that we can release new versions without breaking what's already being used by other teams. The data vault enables you to do that as you can create new tables that work well with the existing ones and leave the existing things in place as is. We also don't need two entire pipelines to support two versions. We just add the bits that create the new tables/views. The new tables don't have to contain all of the columns since those already exist so we can run them over history much cheaper than rerunning the complete pipeline for everything. What's your usual approach to managing multiple versions existing at the same time?
@@alexischicoine2072 We're also facing the problem of multiple versions (of the same data) existing at the same time. The denormalized datasets are good for analytics and Data analysts but introduce more problems about data quality, consistency and usability. Overuse of simple pipelines that only parse staging tables into denormalized datasets will make data lake like a data mart data warehouse (the term by Bill Inmon). You need to manage more data pipelines to keep their run order, write quality rules for all datasets and hard to manage data lineage because column meaning is now duplicate in many datasets. For me, silver zone should be like an integration layer where people can look into and understand the business model of companies so it needs a data model. The gold layer should be a view or a materialize view based on this integration layer. (Sorry for my English)
Thanks for starting the conversation. I am interested to understand a better methodology to implement a data warehouse on the cloud, where something like DV2 is not the ideal? Going from Staging to denormalized structures seems to be a recipe for disaster without a middle layer that provides agility and structure? I may be being naïve here, but I can't see an alternative method that proved ease of automation, agility , flexibility and isolation. I do get the added complexity, and that has to be carefully managed. Are you suggesting there should be no structured Silver layer? I am genuinely open to learning the new world approach as I may be behind the curve here.
We follow more of an Inmon architecture because we want a 3NF abstraction layer to normalize the model and semantics. Maybe we could get by without that layer if all our sources were completely distinct, but because we need to do entity resolution and we don't want to duplicate the logic, we need a place to land that data. From there, it might go to an analytical data mart or a more operational data product, modeled for their varied performance requirements. To me, DV just seems to be Inmon with an evolving model (but not evolving tables). That seems like a recipe for technical debt. The longer it lives, the more complex it would get. I would rather swallow the frog and refactor as needed, with the help of a robust DevOps testing and deployment process. We built that modern metadata and configuration driven framework to abstract the shared code, but there's still significant business logic, after the Silver layer, that benefits from DevOps principles.
If you need perfect history retainment you will either have to design a DV (expensive) or have to buy a lot of disk space (also expensive). These are two viable approaches, but there's no in-between as presented in the video (as far as I know). If I'm wrong, I'd love to know how. The "perfect history retainment" requirement is probably the key thing; most organizations don't need this and that thought should be forgotten. That makes things cheaper (by a lot) and you might use Deltalake for minor loading enhancements, such as rolling back a day or so. But for "perfect" time simulation, you're back between choosing to design upfront or to buy disk space. The agility to the model seems there's not a clear winner, to me at least.
Completely agreed! We are currently recovering from a vendor who was a blind advocate of DV 2.0 and convinced management to make a (significant) up-front investment to build this in our lake. We are now left high and dry with a mountain of convoluted scripts, terrible performance that requires frequent tuning, deployment conflicts, and a frustrated business - who NEVER realized any of the downstream benefits that were promised.
@@gardnmi I didn't imply anything: DV 2.0 'Did' introduce those problems at my company. Your point that X problems exist elsewhere for other reasons isn't a real argument - its a straw man. I'm speaking about issues occurring directly in the data vault layer and exist in my org as a result of that architectural decision.
@@rudnickj I have seen that often problems happen because people start to apply DV but actually do not follow its standards or use its automation strengths. Particularly, individuals scripting own scripts and building ETL scripts manually step by step is a recipe for problems. A detailed audit might (=>will) reveal that the method was not followed and data landscape not understood. The same applies to all methods - doing stuff wrong and calling it "method X" leads to problems. I would not start DV that is built by individual consultants scripting stuff. DV should be generated with tools by people who understand details and wider business context.
(Yes, that's me in the list of authors in the Databricks blog) While I might quibble with your use of Silver and my assertion that the Inmon data landscape can have a DV silver tier just like it can have a normalized silver... aside from those minor points, you are absolutely right. The main point is: if, and only if, you have this modeling, this approach for managing change and tracking history, and you do NOT want to learn a new way and teach new techniques, then a Lakehouse can accommodate you. But to your question at the end, would this be my first recommendation, NO! My biggest concern with DVs is the conceptual cost, the overhead and loss of clarity in a DV over a normalized model, is not worth the technical benefits. Having a clean, understandable, change tracked Silver, normalized DW allows you to better build (and rebuild to your point) an star schema-based model in "Analytic Gold", or a DS/ML friendly flat model in "DS Gold", or a denormalized single table for my app devs in "Web App Gold", etc. Thanks so much for your thoughts and clear explanations!!!
Hey Glenn, thanks for the input. I'd love to actually dig into how you folks use the silver layer one day - we don't adopt the medallion architecture, we have our own lake curation process that's similar but home-grown over the past 5 or so years, hence my definition of silver being slightly different! Appreciate the thoughts on DV and where (if) it fits!
I have to disagree with you, You put a lot of emphasis on the implementation side of data vault vs the medallion architecture and you are right to say that the medallion achieve some decoupling between the end result star schema and the initial raw data like data vault do. However, data vault aims for two things: 1. Decoupling business rules from persisted data, i.e. : don't store business rule, store raw data and use view instead for better agility. 2. Monitoring and auditing
What I feel you didn't touch upon is the ability of HUB and LINKS in DV to provide integration of data across many systems which can linearly scale. Data lake is not integrated
Man, show me a detailed business requirement to justify this amount of engineering before ever deciding to embark on such a mission. Then, on top of that, let's find enough staff out there that all understand the concept and have 5 years implementing such a solution. Then, on top of that..... have fun keeping that team marching to the beat of the same drummer. Good luck!
@Simon "Do you need that level of rigor, process, and protection against occasionally having to reload a table?(14:16)" Yes, IMHO. The level of rigor is made easier these days with automation. DV2.0 allows us to reach the holy Grail of being accountable for our data assets. Change is inevitable not occasional. DV2.0 allows us to embrace change smartly w/o ditching the proven methods from Inmon or Kimball. I would emphasize 5NF for Satellites and Links instead of 3NF -taking inspiration from CJ Date as a modeling guideline. Yes it makes matters more complicated conceptually (not necessarily logically or physically) for the sake of truly getting rid of redundancy. This is important because compute & network are the most expensive cloud components. Those w/o a dedupe and anti-redundancy strategy are needlessly overpaying.
I think Data Vault is beneficial for very large organizations with extremely high dimensionality and regulations, such as the DoD in the USA. I wouldn't even limit it to Fortune 500 companies; I would say it's essential for entities that are highly regulated in every aspect. In such cases, you'll need a certificate to access the system, and your superior will need proof (e.g., a certificate) that you are qualified to handle it. I believe this is the use case where you can benefit from the methodology, process, and approach. For every other use case, I totally agree that modern data engineering is the way to go. The Python script in your example abstracts away the table metadata and maintains the EL(T) process with external table configuaration - the "T" here represents the historization, if needed (bronze-silver), and that's the right approach if you can use a technology like this. If not, that's another question. I'm not a Databricks user, but I like your channel - keep up the good work. Thanks Darvi!
Great thoughts. Good to know I wasn't the only one thinking in this direction. Wondering if can we go one step further and only have a 2 tier architecture? 1.Staging/Bronze layer 2. combine Silve/Gold layer and directly building star schemas delta tables?
years ago there was a complete metadata driven dv solution built using Kettle. I imagine it wouldnt be too hard to migrate it to apache hop. If one wanted to :)
You can actually perfectly fine make DV meta data driven. Actually you could do this with one meta data driven notebook to build a whole DV2 for many sources, and thus hide the complexity, and with this you add the business / graph context to your data, that Lakehouse simply does not provide as a methodology. Capturing this in the modeling, gives as well the opportunity to apply this relatively easily in something else: iceberg, snowflake etc.
I like how this video tries to explore a compelling reason to use Data Vault or actually for that matter, any ensemble data modelling technique that essentially normalizes data for further consumption. The very assumption that data warehouse or lakehouse, or just datalake is designed as a technical solution, when it is really not. Data modelling in general exists to map business terms to standard entities, and this is a hard and painful process with very little technical solution. And think about it, if DW does not do it then the source has to, or worse the consumer has to build and overtime a mess emerges, while the industry will be busy selling a data condo that has datalake views* Size of the company is rather a naïve assumption. Think what happens when company A acquires company B? Every data source, including customer data is duplicated now. Think of working in a company where Marketing uses the term customer while Sales identifies them as clients? It also helps dealing with numbers being reported by different departments presenting a different picture their data has not been reconciled in a central warehouse. I can go on and on with many such examples but maybe, just maybe data warehouse case studies are littered with many such use cases? Data modelling much like SQL, is really old and has surprisingly resisted packaged solutions for enterprise specific use cases. *Data Condos are due to debut when Lakehouses start to go the Data lake way :D
This is so wrong. You are comparing a technology with a data modelling methodology. You have overlooked the key point of Data Vault: to integrate data across systems on business keys. You have to do this at some point in ANY warehouse or lakehouse; Data Vault offers an optimal way of modelling data to enable integration while supporting schema and data changes over time, with little to no impact on existing entities (up to the RDV layer anyway). No other modelling methodology does this. The "lakehouse" technologies just make it more efficient. And btw, DV does NOT require all tables to be split into hubs, links and satellites - this is a common misconception that reflects a poor understanding of Data Vault and data modelling in general.
300bn records in a dimension??!! I wonder what it would be?? People of the universe?? Or all living beings?? 25 years in DWH, seen a lot, but not that.
The discussion here is close to the difference whether you need a Data Vault or a Kimball style model. It depends. But in my opinion... a lake is a lake. It's not supposed to be a data warehouse.
Deltalake format is not magic. It's keeping a log and new entries to the table. It's highly expensive as a "one table" approach. It's not intelligent in this regard and will blow up in your face if not understiood. A table may in real life be 1 GB and in a few weeks be 50 GB until you vacuum it and *lose* your history completely. Not even close to closing the gap of DV. "Deltalake" as a concept is useful as a minor convenience to restore to last week, but we're talking days, certainly not years.
At 21:00 - you're arguing that you've got history in Deltalake (not a lean feature!), but remember that the default is something like 7 days retention and this grows quite quickly from say 12 TB to 50 TB in a few weeks. You mention DV as isolating change and this optmization is part of the solution. I don't think DV is suitable on Databricks, but this is where you make a false argument as Deltalake being a solution to the problem. The problem persists. It's not a solution, for sure. Your data is going to bloom 20x times with a hefty price to pay for storage.
You're spot on in your analysis that creating another record because something was updated is expensive. This exact expense is going the Deltalake approach. Of course, those tables act just like a big table in any database.
I just get the feeling the silver layer is just not-well explained and/or doesn't really have cross-industry consensus. It sure sounds good as a marketing sound bite to have Bronze/Silver/Gold rather than just Bronze/Gold. Aren't the silver layers you describe just the old Kimball-style staging -> star schema just with some DQ and history. Couldn't you keep the raw data history in bronze and run the DQ into gold? What are you gaining repeating source data without any entity consolidation into a separate silver layer?
A Silver layer guarantees a consistent schema even as your RAW sources evolves, and it's typically stored as Delta tables - so it's more performant, you can add columns, you have history and temporal and you can do changes on the data without touching the RAW source. At least that's how we use it - our Bronze is a mix of virtual tables for files that can readily be virtualized, and persisted tables for those that can't. So even though we have different types of source files they're all published as tables with the latest schema of the source, presenting a nice consistent interface to load Silver from.
So for clarity, I never build using "Bronze/Silver/Gold", we use "Raw/Base/Enriched/Curated" so my definitions for BSG might be off compared to how Databricks describe (hence the lack of consensus I'm sure!) The reason we store the cleansed (DQ), temporal (history) data as a separate layer is purely for reusability - we don't want every analyst, data scientist, BI dev etc to have to repeat the cleaning exercise (and maybe do it in a different way) so we simply clean once (where we can), use many times - we achieve that by storing as a separate layer in between our Raw and Curated objects. For us there's significant value in that, so it's a pattern we maintain.
Your argument comparing Data Lakes vs Data Vault misses the whole concept of Data Warehousing. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Without Data Governance and Data Catalogs you create a Data Swamp A Data Vault is platform agnostic because its a methodology
I completely agree that any data platform without governance & cataloguing is doomed to failure - but to say that any other approach doesn't have those elements is incorrect. Also, that's one definition of a warehouse (Inmon), but we won't get into that :) Simon
Agree, Simon missed the point of integration. Simon talks about the building the data warehouse on source tables, but what about multiple source systems with the same Information about customers (business key integration). In his proposed approach that would be in the (conformed) dimension, a very complex one step process. Datavault in your architecture is about separating concerns, doing the difficult stuff in different layers. But there are some good points in the vid to think about. From a lean perspective you could say, it's an extra layer you have to build and that takes time (therefore automation). O and the example of the python script is not applicable to DV but to the tools of the vendor (how they built it), so don't blame datavault for that. you can built automation with Python on the fly. But again here, if you build code in steps, and CICD it, you can control the process more easily and debug the code easier (it could be an argument for generating objects code)
@@rowanjohnstone9524, Kimball. Every man and their dog knows it, it’s time-proven, easy, well documented, scales well for 99% of organizations, has native Databricks support w/ SCD2 in DLT
Firstly you seem to conflate lakehouse == spark/databricks and as I understand it lakehouse is a Platform Architecture not a technology, secondly you focus too much in my opinion on the write speed aspect of DV2, it also offers flexibility and aligns well with building an Enterprise Data Model in an incremental and agile way. I agree using DV2 in a datalake doesn't make sense but as part of the DW capability of a lakehouse it can.
300 billion row dimension.... LOL. Someone forgot their outriggers and/or mini dimensions to break off their more quickly changing data. Save yourself the headache of even engaging..... lol.
A hub is a non changing PK????? Sorry...I will limit to one emoticon☹! Forget about data vault. Understand what a business Key means first. The concept of Primary key doesn't even hold significance in the discussion of Business Keys. When you can'r mention business keys in the context of a hub....you explained enough why you cannot appreciate data vault. Such half baked ideas have held businesses hostage for a while now. Don't worry, Very soon an very easy way to implement data vault is coming!
I totally agree with the idea of SQL old school automation vs 'new' school automation. Just to clarify, I don't see the value of a Datavault with Delta tables/ lakehouse approach either.
But I have something to add from the Data vault world. DV is, mainly, and integration pattern. It means that the challenges it aims to ease are integration challenges. This is something that it is difficult to sustain in the long term if you don't have proper and flexible data models.
An integration challenge is not having some sources of data on the same business concepts, it is about having multiple sources (4, 10, 50, etc) related to the same data that your company, usually for a business requirement, have to reconcile. So, DV is not actually about data volume, is about business complexity and long term viability.
For example, many public organizations have to collect data from several other entities, where it is not trivial to set standards to receive the data in the same structure (or it is just impossible). You could have dozens of different datasets in their own structure and semantics and containing multiple business concepts that you need to model so you can relate the information together, complementing and reconciling. Hence, the importance of modelling the data and separating concerns in hubs, links and satellites.
I don't know how can one manage something like that in a lakehouse approach where there is no integration (business driven) layer. How do you navigate through hundreds of data structures where there are no business elements identified at all before the dimensional (serving) layer? Only metadata based but then you have heavy integration process coupled to create your dimensions where metadata is hard to follow? Then you have a problem and you don't even need 1+ TB of data to feel it.
In these cases I actually don't know what is the lakehouse approach to follow and I would choose a DV approach instead with relational databases. "what could a lakehouse give me in those scenarios?"
Thanks!
Aha! Thanks Rodrigo, that's really good insight & context to the drivers behind the methodology.
It's true that reconciling across multiple different sources describing the same entity (ie: we have 10+ sources defining a 'customer' each of which has different attributes) is a challenge. With a lakehouse, we would usually treat these different sources as separate data feeds into our Silver layer, then build an integration layer by mapping these disparate objects together into common models, either virtually (simply through decoupled views) or materialising it as an intermediary lake layer. But I happily admit that process gets convoluted and complex when we have more than a few sources that need mapping, each with their changes over time - I'd be curious how many DV adoptees have this level of integration complexity as the driver and what kinds of systems we're commonly seeing as the culprit!
You could certainly utilise the same process and patterns within the Lakehouse if that level of integraiton was the challenge you were facing, accepting some of the performance quirks. I'm curious if there is a compromise that can be made harnessing some of the tech but keeping some of the process benefits, specifically when facing that issue.
Simon
This is true. DV is just getting your data into 6e normal form, 5e including time variance. Nothing more, nothing less. You only split up your data that far for one reason. integration. when you are a bank that has to report to a central bank and stake holders and you are buying other banks and selling of parts. Then it makes sense. In all other cases, only do it when you are in love with data modeling.
@@AdvancingAnalytics I am glad we agree on that :) I thought I was missing something. Now, consider that the described scenario is basically the same when you need a long term data model (platform and tools do not matter), because through 10, 20+ years, your organisation will be changing their internal systems, connecting to different providers and different systems. Therefore, maintaining traceability and auditability of all that could be a business requirement.
The concept of a DWH is precisely that, a long term data store that is resilient to change, and not only the change in some columns that Delta tables can manage, but continues change in the data sources, which is a given for many.
And sure, not all organisations actually require a data warehouse and I would recommend lakehouse approaches there (speed is more important). That would be the case of many small-medium companies, Startups-unicorns, and even big companies without long term data compliance requirements. But when you are required to trace data for dozens of years (banks, public companies, ongs, etc), you need a sustainable process. This is supported in Datavault by graph theory and the sustainable complexity level. So, at the end, it is just not a small bunch of companies, is it?
Consequently, I wouldn't recommend a platform that cannot support a proper data warehouse to these clients (sustainable data model), even when Databricks has defined the lakehouse as the new data warehouse paradigm. I think they are not at that point yet. And yes, things will continue improving, since they are aware of that. Just consider that delta tables are just the begining of the real data warehouse journey.
On the other hand, we have the other scenario, where 3rd generation data warehouses are adopting the software engineering paradigm and real automation could be possible. I won't focus on that here, but definitely, we are living interesting times in the data industry!
SQL automation old shool vs new school is an interesting topic. In my experience (we did two projects) using metadata driven automation is not as simple as it looks. One record per table can only apply to Hub. There are multiple records per link wich depends on how many relationship stored in the link table. It is even larger number of records for Sat which depends on how many attributes you would like to bring from source to Sat. We use a semi auto process to generate those metadata records but it still a lot of effort to maintin it. Data lineage also a problem as they are hiden in the metadata tables. The bigger challenge is data orchestration which is difficult to fully automate. Automate logging and error handleing is also not an easy tasks. Currenlty we switched back to old school method using an automation tools to generate an end to end structure.
For me, how easy to handle data integration (compare to other modelling approach) and parallel process is the main attraction using DV.
Having a good data catalog wouldn't solve this issue entirely?
We are currently building DV 2.0 on a Lakehouse for a large Financial Services company with lots of sources. For us the key benefits are:
- It forces a light modelling of data on the way in by associating data to subject areas. This helps us understand what we have and also means users can query data without having to understand and traverse each source system's data model - the linkage is done for them.
- Use of tracking satellites to achieve performance and flexibility. By separating what we receive from when we received it, we can avoid storing data redundantly. More importantly by performing ingestion and sequencing (CDC) as separate steps we can load data in any order. This is a game changer in terms of flexibility and operating the platform.
Regarding "light modeling", what is in your opinion then the advantage of DV over 3NF modelling in lakehouse?
I'm in financial services that operates in multiple jurisdictions/countries. We also have multiple systems (3rd party, and internal) for both current and legacy workloads....so they overlap a lot in function while producing significantly different data. We gave medallion a real try, but have found that after ~6 months the silver layer has become a bit 'muddy' and complicate cases like multi-point-in-time audits (E.g. report cutoff distinct from production cutoff with source updates between), GDPR compliance, and mandatory compliance reporting in each country that often have similar, but different rules. Securing everything in a way that doesn't make access complicated was also a challenge (though Unity Catalog is making it a little easier....where's my ABAC? :D )
Note: I did allow people to leverage DBT...and maybe that's part of the muddiness?
I'm not sure if full DV2 is the way to go to make things more efficient with fewer code paths between source and mart, but at least it does provide some fairly specific 'rules' to help consistently structure everything from source, to raw vault to business vault then mart. Maybe the raw to business vault transforms would become muddy just like the current "silver" staging tables? I'd love to know if you've doubled down on your perspective or broadened it over the last year Simon @AdvancingAnalytics
Hi Simon , really well argued case for doing things more simply without DV. Please keep doing the rants - they are very informative and entertaining!
A year ago I came to many of the same conclusions, but it took many discussions. Thank you for putting your opinion publicly and helping everyone!
We did implement star schema for smaller department needs in the past. Later we adopted to use 3NF (Teradata to Synapse analytics), and acheive star schema with the help of views & materiazations are required. I have worked on many M&A, divestitures, key is how much future proof is your data model to handle a new source seamlessly unify the data
To me a data vault is useful for creating consistent entities in your silver layer such as a customer entity that gold layer products can use and not be affected even when you add a new crm to your pipeline.
Interesting. AWS reckon you should plan to re-architect every 18 months if you're a startup. So in that case, you bring in your new CRM and your gold layer is unaffected.
I’m glad you said it.. cos I’ve been saying this for ages. Great videos
Interested on your views on an alternative data architecture that is platform agnostic. DV2 does seem complex and couldmget out ofmhand if not.implemented correctly, but I m struggling to determine an alternative that satisfies the agaility and automation capabilities it can bring.
This was incredibly well argued. From someone who was planning to do exactly that, i.e. implement DV on a lake house, I thank you very much for sharing your POV and expertise on the subject.p. Do keep this kind of content, it is very, very, useful.
Hi Simon,
Thanks so so much!!!
You just read my mind. Finally I found a better data architect friend like you 😊
Around 2016, I already implemented that additional historical Dimensions technique in Silver (regardless it is SCD1 or SCD2). So that, if we change busines scenerio from SCD1 to SCD2 at any time, we still can go back and capture from that historical records from Silver. It just requre reloading the Star schema with minimal loading logic change.
Hi everyone, I'm not a Data Lakehouse expert but i know, that Data Vault pays off when Data Warehouse is grow. There is no sense to building a DV model for an organization with several/dozen data sources. However, if there are already hundreds of systems and they change quickly, there are many teams developing the DV model. In such situations, it's worth thinking about DV. Business definitions are almost unchangable. The DV model give us a benefits in this situation. Of course, DV requires additional work, standards, knowledge. But look, Data Warehouse also requires a lot of work but we build it for a specific reasons. Let's always do something for a specific effect, don't use a cannon to kill a fly, but also don't shoot a bear with a slingshot.
Data vault is a very good thing, if e.g. you are building your dimensions out of very many disparate data sources. Via the hub tables it is getting easy to assemble them. Also, you do not have to take into account effects lik late arriving dimensions into your loading process. It is all Handels by DV already
Hi Simon, as always I have really enjoyed your rant!
It's just so that I am one of those passionate practitioners who vouch for the benefits of the DV2.0 methodology... so I see myself compelled to react.
One of the major benefits of adopting the DV2.0 methodology is 'integration by business object', and modeling the business (processes) instead of the source system, or having pipelines that simply ingest files easily. A pity you left that one out. Just as you left out the whole raw vs business vault talk, sure it popped up in the medallion architecture design, but this is a very important part when talking about the complexity of the automation. A good automation tool should be able to use meta data to automate all the code to get data into the raw vault, plus applying forced integration of business objects over different sources. Similarly to the modern automation using a generic python script you dub as modern automation. The data vault integration of business objects adds on top of that an enterprise wide integration and historization, enablers for data governance, versions of the facts (vs single version of the truth), just to name two.
As far as applying business rules to the data remains a more manual and hard to automate thing, using python or not. The DV2.0 methodology has the concept of a business data vault, in the lakehouse approach the raw- and business data vault layer would co-exist in the silver layer. I have always felt like there was a big leap between the silver and gold layers, a gap perfectly filled by DV2.0 .
This bring me to your point of getting the data out, sure, DV2.0 is write enhanced, did you know it aims at eliminating updates entirely. To everyone reading this I would advise to inevstigate further on DV2.0 write only architecture, PIT and BRIDGE tables (aiming at getting data towards the bi tools). People have actually come up with solutions to eliminate the joins from hell.
Other advantages like virtual layers for the business data vault and star schemas, cdc, real time, lambda, kappa only benefit from cloud tech like databricks....
In adopting DV2.0 there is only stuff to gain, little to loose - as you said there is real value in the methodology.
Lastly, thanks for putting this on the radar and I hope this will open dialogue between lovers/haters, deliver sceptics from doubt, and correctly inform the gullible ;-)
Simon, what interesting talks we could have over beers on this topic ! CHEERS !!
reach out to me at eric.janssens@fooinc.be for hatemail, praise, or questions
Eric, its the law of diminishing returns. Once your data lake matures - things take less time to build. Conformity promotes data trust and the business domain specific elements can be surfaced in specific views. These views can be governed and managed. Inflating the code base increases the operational costs and code management processes which Simon points out. Data Engineering principals of encapsulating logic or patterns are applicable not by just creating python scripts but by wrapping this up in deployable libraries linked to meta data driven pipelines. This help standardise engineering across the enterprise and drives down dev costs, rework, bug fixing, duplication and processing costs. I can't see how developing more elements helps the DV2.0 argument. I've seen a properly developed Lake House architecture work with a framework that onboarded 16 decently sized projects in a year delivered by an agile methodology.
Hey Eric - thanks for the thought-out, detailed response! The responses I've had that highlight the benefits of DV in these scenarios really highlight that "very complex integration" challenge. Without diving too much into the details of the methodology (which I'm certainly no expert in, hence only loosely covering in the video!)
It's interesting on the pushes for more write-optimised modelling. You can architect many data systems to be append-only, but that usually just pushes the work of filtering/deciphering the model to the reader. For me, the entire point of these systems is to enable analytical exploration, so pushing complexity/performance onto the reading party is the antithesis of what I'm designing for. If I take a hit on my write speed but enable users to easily query the platform, that's a compromise I'm willing to take. Are there use cases other than complex integration of many overlapping business systems that drive for this approach?
I'm curious about the other approaches you mentioned - PIT, BRIDGE tables etc which aim to get the data towards the business. If we're shaping the data as an integration exercise, then adding further elements to the model that help us reshape it for the business, that feels like a lot of extra steps. Anywhere you would point to with a quick overview of those processes?
Always up for a beer and a discussion if we end up in the same place ;)
Simon
I always thought that the flexibility that data vault delivers (everything many to many for example), creating extra sattelite tables for different sources etc, are all pretty much handled by delta lake as a technology. also, if the source systems suddenly have a relationship that changes cardinality, your datavault may be able to handle it, but your down stream dimensional models still break, as they are never always many to many. so the only thing you solve there is the datawarehousing load, which is now more flexible, not the data mart delivery to reporting. so a change in cardinality still breaks your reports and star schema's.
in a datalakehouse, delta lake can handle changing schema's (ok you might need to overwrite to a new version, but still, it's possible) to handle changing source systems. so to me, the raw layer can be modelled after the sources. the integrated business layer could be 3d normal form, if you need a more neutral business process minded modelling approach. So in that case, the datawarehousing side of things can also be dealt with.
what does datavault bring to the table in a delta lake house world? It seems added complexity to me? would love to hear how you see this
and count me in for that beer!
@@AdvancingAnalytics At my job we're considering a light version of data vault. The main benefit I'm going after is modelling in a way where we can support multiple versions of the data at the same time so that we can release new versions without breaking what's already being used by other teams. The data vault enables you to do that as you can create new tables that work well with the existing ones and leave the existing things in place as is. We also don't need two entire pipelines to support two versions. We just add the bits that create the new tables/views. The new tables don't have to contain all of the columns since those already exist so we can run them over history much cheaper than rerunning the complete pipeline for everything.
What's your usual approach to managing multiple versions existing at the same time?
@@alexischicoine2072 We're also facing the problem of multiple versions (of the same data) existing at the same time. The denormalized datasets are good for analytics and Data analysts but introduce more problems about data quality, consistency and usability. Overuse of simple pipelines that only parse staging tables into denormalized datasets will make data lake like a data mart data warehouse (the term by Bill Inmon). You need to manage more data pipelines to keep their run order, write quality rules for all datasets and hard to manage data lineage because column meaning is now duplicate in many datasets. For me, silver zone should be like an integration layer where people can look into and understand the business model of companies so it needs a data model. The gold layer should be a view or a materialize view based on this integration layer. (Sorry for my English)
Thanks for starting the conversation. I am interested to understand a better methodology to implement a data warehouse on the cloud, where something like DV2 is not the ideal?
Going from Staging to denormalized structures seems to be a recipe for disaster without a middle layer that provides agility and structure? I may be being naïve here, but I can't see an alternative method that proved ease of automation, agility , flexibility and isolation. I do get the added complexity, and that has to be carefully managed.
Are you suggesting there should be no structured Silver layer? I am genuinely open to learning the new world approach as I may be behind the curve here.
We follow more of an Inmon architecture because we want a 3NF abstraction layer to normalize the model and semantics. Maybe we could get by without that layer if all our sources were completely distinct, but because we need to do entity resolution and we don't want to duplicate the logic, we need a place to land that data. From there, it might go to an analytical data mart or a more operational data product, modeled for their varied performance requirements.
To me, DV just seems to be Inmon with an evolving model (but not evolving tables). That seems like a recipe for technical debt. The longer it lives, the more complex it would get. I would rather swallow the frog and refactor as needed, with the help of a robust DevOps testing and deployment process.
We built that modern metadata and configuration driven framework to abstract the shared code, but there's still significant business logic, after the Silver layer, that benefits from DevOps principles.
Completely agree with you Simon! Love the rants - keep'em coming!!!
If you need perfect history retainment you will either have to design a DV (expensive) or have to buy a lot of disk space (also expensive). These are two viable approaches, but there's no in-between as presented in the video (as far as I know). If I'm wrong, I'd love to know how. The "perfect history retainment" requirement is probably the key thing; most organizations don't need this and that thought should be forgotten. That makes things cheaper (by a lot) and you might use Deltalake for minor loading enhancements, such as rolling back a day or so. But for "perfect" time simulation, you're back between choosing to design upfront or to buy disk space.
The agility to the model seems there's not a clear winner, to me at least.
Completely agreed! We are currently recovering from a vendor who was a blind advocate of DV 2.0 and convinced management to make a (significant) up-front investment to build this in our lake. We are now left high and dry with a mountain of convoluted scripts, terrible performance that requires frequent tuning, deployment conflicts, and a frustrated business - who NEVER realized any of the downstream benefits that were promised.
To be fair lots of companies don't use a data vault architecture and suffer the exact same problems your implying the dv introduced.
@@gardnmi I didn't imply anything: DV 2.0 'Did' introduce those problems at my company. Your point that X problems exist elsewhere for other reasons isn't a real argument - its a straw man. I'm speaking about issues occurring directly in the data vault layer and exist in my org as a result of that architectural decision.
@@rudnickj I have seen that often problems happen because people start to apply DV but actually do not follow its standards or use its automation strengths. Particularly, individuals scripting own scripts and building ETL scripts manually step by step is a recipe for problems. A detailed audit might (=>will) reveal that the method was not followed and data landscape not understood.
The same applies to all methods - doing stuff wrong and calling it "method X" leads to problems.
I would not start DV that is built by individual consultants scripting stuff. DV should be generated with tools by people who understand details and wider business context.
(Yes, that's me in the list of authors in the Databricks blog)
While I might quibble with your use of Silver and my assertion that the Inmon data landscape can have a DV silver tier just like it can have a normalized silver...
aside from those minor points, you are absolutely right. The main point is: if, and only if, you have this modeling, this approach for managing change and tracking history, and you do NOT want to learn a new way and teach new techniques, then a Lakehouse can accommodate you.
But to your question at the end, would this be my first recommendation, NO!
My biggest concern with DVs is the conceptual cost, the overhead and loss of clarity in a DV over a normalized model, is not worth the technical benefits.
Having a clean, understandable, change tracked Silver, normalized DW allows you to better build (and rebuild to your point) an star schema-based model in "Analytic Gold", or a DS/ML friendly flat model in "DS Gold", or a denormalized single table for my app devs in "Web App Gold", etc.
Thanks so much for your thoughts and clear explanations!!!
Hey Glenn, thanks for the input. I'd love to actually dig into how you folks use the silver layer one day - we don't adopt the medallion architecture, we have our own lake curation process that's similar but home-grown over the past 5 or so years, hence my definition of silver being slightly different!
Appreciate the thoughts on DV and where (if) it fits!
I have to disagree with you,
You put a lot of emphasis on the implementation side of data vault vs the medallion architecture and you are right to say that the medallion achieve some decoupling between the end result star schema and the initial raw data like data vault do.
However, data vault aims for two things:
1. Decoupling business rules from persisted data, i.e. : don't store business rule, store raw data and use view instead for better agility.
2. Monitoring and auditing
What I feel you didn't touch upon is the ability of HUB and LINKS in DV to provide integration of data across many systems which can linearly scale. Data lake is not integrated
Man, show me a detailed business requirement to justify this amount of engineering before ever deciding to embark on such a mission. Then, on top of that, let's find enough staff out there that all understand the concept and have 5 years implementing such a solution. Then, on top of that..... have fun keeping that team marching to the beat of the same drummer. Good luck!
How about using DBT to address automation issue?
Excellent discussion/rant!!! Keep it up.
@Simon "Do you need that level of rigor, process, and protection against occasionally having to reload a table?(14:16)" Yes, IMHO. The level of rigor is made easier these days with automation. DV2.0 allows us to reach the holy Grail of being accountable for our data assets. Change is inevitable not occasional. DV2.0 allows us to embrace change smartly w/o ditching the proven methods from Inmon or Kimball. I would emphasize 5NF for Satellites and Links instead of 3NF -taking inspiration from CJ Date as a modeling guideline. Yes it makes matters more complicated conceptually (not necessarily logically or physically) for the sake of truly getting rid of redundancy. This is important because compute & network are the most expensive cloud components. Those w/o a dedupe and anti-redundancy strategy are needlessly overpaying.
I think Data Vault is beneficial for very large organizations with extremely high dimensionality and regulations, such as the DoD in the USA. I wouldn't even limit it to Fortune 500 companies; I would say it's essential for entities that are highly regulated in every aspect. In such cases, you'll need a certificate to access the system, and your superior will need proof (e.g., a certificate) that you are qualified to handle it. I believe this is the use case where you can benefit from the methodology, process, and approach.
For every other use case, I totally agree that modern data engineering is the way to go. The Python script in your example abstracts away the table metadata and maintains the EL(T) process with external table configuaration - the "T" here represents the historization, if needed (bronze-silver), and that's the right approach if you can use a technology like this. If not, that's another question.
I'm not a Databricks user, but I like your channel - keep up the good work. Thanks Darvi!
Great thoughts. Good to know I wasn't the only one thinking in this direction. Wondering if can we go one step further and only have a 2 tier architecture? 1.Staging/Bronze layer
2. combine Silve/Gold layer and directly building star schemas delta tables?
would be very hard to disagree and I am happy there is no need to! thanks
It's like you read my mind. I've a DV critical article in drafts in LinkedIn for years, never dared publishing 😂
years ago there was a complete metadata driven dv solution built using Kettle. I imagine it wouldnt be too hard to migrate it to apache hop. If one wanted to :)
You can actually perfectly fine make DV meta data driven. Actually you could do this with one meta data driven notebook to build a whole DV2 for many sources, and thus hide the complexity, and with this you add the business / graph context to your data, that Lakehouse simply does not provide as a methodology. Capturing this in the modeling, gives as well the opportunity to apply this relatively easily in something else: iceberg, snowflake etc.
We are currently trying to do this. Can you share some insights or ressources how you did the meta data driven integration?
Could you elaborate on the steps required to achieve the new automation you spoke of please?
I like how this video tries to explore a compelling reason to use Data Vault or actually for that matter, any ensemble data modelling technique that essentially normalizes data for further consumption. The very assumption that data warehouse or lakehouse, or just datalake is designed as a technical solution, when it is really not. Data modelling in general exists to map business terms to standard entities, and this is a hard and painful process with very little technical solution. And think about it, if DW does not do it then the source has to, or worse the consumer has to build and overtime a mess emerges, while the industry will be busy selling a data condo that has datalake views*
Size of the company is rather a naïve assumption.
Think what happens when company A acquires company B? Every data source, including customer data is duplicated now.
Think of working in a company where Marketing uses the term customer while Sales identifies them as clients?
It also helps dealing with numbers being reported by different departments presenting a different picture their data has not been reconciled in a central warehouse.
I can go on and on with many such examples but maybe, just maybe data warehouse case studies are littered with many such use cases?
Data modelling much like SQL, is really old and has surprisingly resisted packaged solutions for enterprise specific use cases.
*Data Condos are due to debut when Lakehouses start to go the Data lake way :D
Brilliant argumentation
This is so wrong. You are comparing a technology with a data modelling methodology. You have overlooked the key point of Data Vault: to integrate data across systems on business keys. You have to do this at some point in ANY warehouse or lakehouse; Data Vault offers an optimal way of modelling data to enable integration while supporting schema and data changes over time, with little to no impact on existing entities (up to the RDV layer anyway). No other modelling methodology does this. The "lakehouse" technologies just make it more efficient. And btw, DV does NOT require all tables to be split into hubs, links and satellites - this is a common misconception that reflects a poor understanding of Data Vault and data modelling in general.
Can you give an example of what kind of tables dont have to be one of those types? Trying to learn here
i liked so much your video, thanks
Great video.
300bn records in a dimension??!! I wonder what it would be?? People of the universe?? Or all living beings?? 25 years in DWH, seen a lot, but not that.
Maybe something like 10 years data of energy IoT smart meter data from the energy utility industry. Probably telco too
@@badass_omelette5166 in a fact table, yes. But dimension??
The discussion here is close to the difference whether you need a Data Vault or a Kimball style model. It depends. But in my opinion... a lake is a lake. It's not supposed to be a data warehouse.
Deltalake format is not magic. It's keeping a log and new entries to the table. It's highly expensive as a "one table" approach. It's not intelligent in this regard and will blow up in your face if not understiood. A table may in real life be 1 GB and in a few weeks be 50 GB until you vacuum it and *lose* your history completely. Not even close to closing the gap of DV. "Deltalake" as a concept is useful as a minor convenience to restore to last week, but we're talking days, certainly not years.
At 21:00 - you're arguing that you've got history in Deltalake (not a lean feature!), but remember that the default is something like 7 days retention and this grows quite quickly from say 12 TB to 50 TB in a few weeks. You mention DV as isolating change and this optmization is part of the solution. I don't think DV is suitable on Databricks, but this is where you make a false argument as Deltalake being a solution to the problem. The problem persists. It's not a solution, for sure. Your data is going to bloom 20x times with a hefty price to pay for storage.
You're spot on in your analysis that creating another record because something was updated is expensive. This exact expense is going the Deltalake approach. Of course, those tables act just like a big table in any database.
I just get the feeling the silver layer is just not-well explained and/or doesn't really have cross-industry consensus. It sure sounds good as a marketing sound bite to have Bronze/Silver/Gold rather than just Bronze/Gold.
Aren't the silver layers you describe just the old Kimball-style staging -> star schema just with some DQ and history. Couldn't you keep the raw data history in bronze and run the DQ into gold? What are you gaining repeating source data without any entity consolidation into a separate silver layer?
A Silver layer guarantees a consistent schema even as your RAW sources evolves, and it's typically stored as Delta tables - so it's more performant, you can add columns, you have history and temporal and you can do changes on the data without touching the RAW source. At least that's how we use it - our Bronze is a mix of virtual tables for files that can readily be virtualized, and persisted tables for those that can't. So even though we have different types of source files they're all published as tables with the latest schema of the source, presenting a nice consistent interface to load Silver from.
So for clarity, I never build using "Bronze/Silver/Gold", we use "Raw/Base/Enriched/Curated" so my definitions for BSG might be off compared to how Databricks describe (hence the lack of consensus I'm sure!)
The reason we store the cleansed (DQ), temporal (history) data as a separate layer is purely for reusability - we don't want every analyst, data scientist, BI dev etc to have to repeat the cleaning exercise (and maybe do it in a different way) so we simply clean once (where we can), use many times - we achieve that by storing as a separate layer in between our Raw and Curated objects. For us there's significant value in that, so it's a pattern we maintain.
Well done, this video really reveals your depth of knowledge and a practical approach to a solution.
Your argument comparing Data Lakes vs Data Vault misses the whole concept of Data Warehousing.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Without Data Governance and Data Catalogs you create a Data Swamp
A Data Vault is platform agnostic because its a methodology
I completely agree that any data platform without governance & cataloguing is doomed to failure - but to say that any other approach doesn't have those elements is incorrect. Also, that's one definition of a warehouse (Inmon), but we won't get into that :)
Simon
Agree, Simon missed the point of integration. Simon talks about the building the data warehouse on source tables, but what about multiple source systems with the same Information about customers (business key integration). In his proposed approach that would be in the (conformed) dimension, a very complex one step process. Datavault in your architecture is about separating concerns, doing the difficult stuff in different layers. But there are some good points in the vid to think about. From a lean perspective you could say, it's an extra layer you have to build and that takes time (therefore automation). O and the example of the python script is not applicable to DV but to the tools of the vendor (how they built it), so don't blame datavault for that. you can built automation with Python on the fly. But again here, if you build code in steps, and CICD it, you can control the process more easily and debug the code easier (it could be an argument for generating objects code)
Data vault: a solution to a problem we no longer have.
What is the replacement that you recommend?
@@rowanjohnstone9524, Kimball. Every man and their dog knows it, it’s time-proven, easy, well documented, scales well for 99% of organizations, has native Databricks support w/ SCD2 in DLT
Firstly you seem to conflate lakehouse == spark/databricks and as I understand it lakehouse is a Platform Architecture not a technology, secondly you focus too much in my opinion on the write speed aspect of DV2, it also offers flexibility and aligns well with building an Enterprise Data Model in an incremental and agile way. I agree using DV2 in a datalake doesn't make sense but as part of the DW capability of a lakehouse it can.
300 billion row dimension.... LOL. Someone forgot their outriggers and/or mini dimensions to break off their more quickly changing data. Save yourself the headache of even engaging..... lol.
Good grief, data warehousing is so boring. I can't believe I'm watching this; I desperately need a change of career.
Change of career to what?
A hub is a non changing PK????? Sorry...I will limit to one emoticon☹! Forget about data vault. Understand what a business Key means first. The concept of Primary key doesn't even hold significance in the discussion of Business Keys. When you can'r mention business keys in the context of a hub....you explained enough why you cannot appreciate data vault. Such half baked ideas have held businesses hostage for a while now. Don't worry, Very soon an very easy way to implement data vault is coming!