Behind the Hype - The Medallion Architecture Doesn't Work
HTML-код
- Опубликовано: 23 окт 2023
- Since the dawn of the Data Lake, we've been arguing about what to call the different layers that we move data through. The Medallion Architecture has become supremely popular in recent years, and with Microsoft Fabric embracing the Lakehouse approach, it's never had more attention that it does currently!
But there's also a huge amount of confusion - what do the different layers mean? What processes do you do where? How do you translate the entirely of your data curation into just three steps... and does it always make sense?
In this video, Simon pulls apart the expected journey of data through a lakehouse architecture and maps the medallion architecture to the deeper, richer approach that has become the Advancing Analytics standard.
If you're setting out on a journey to build out a Medallion Architecture, whether in Databricks or Microsoft Fabric - get in touch with the experts from Advancing Analytics
Thank you!!! Finally, someone talks about the Medallion Architecture as a framework and not the rule of you need a bronze, silver, and gold layer to your data lakehouse. For years I always received that basic explanation as the rule and "Best Practice" even from all of the so called experts. After learning from experience with implementations it's nice to hear affirmation - it's what you decide is appropriate to the context of your team and data product to determine the various landing zones of our data. You provide a good way to communicate that without just saying "It depends" which it does and nobody likes that answer 😂
Finally somebody talking common sense. Its just the layer structure that always have exists. Whether you call it bronze silver gold, or stage help dim fact etc., its a potato, potato discussion. In other words semantics.
I call it ETL: Extract (Bronze), Transform (Silver), Load (Gold)
@@Milhouse77BS what if you ELT? the point here is that BSG covers several transformation types and in some cases several copies of data.
BSG I think of "Battle Star Galactica" first :)
@@FR-099
@@Milhouse77BS Why? ETL (and ELT) are processes.
ETL can take raw data from others systems and upload it into a Bronze layer.
ETL can take raw Bronze data, make cleaned Silver data out of it, then upload it to your Silver layer.
ETL can take cleaned Silver data, join it with other Silver datasets, to make a Gold dataset, then upload it to your Gold layer.
ETL can exist entirely outside of medallion architecture and take whatever data you want from a bunch of source systems and produce custom reports for a business unit, director, c-suite, etc.
Think of it like these layers are the nouns, and ETL (or ELT) is a verb that can turn parts of one layer into another. Or it can be used entirely outside of a medallion architecture.
@@Milhouse77BS 😂😂😂, nice sense of humor
A key point omitted here is that the medallion architecture is also referred to as a 'multi-hop' architecture. I always talk about a multi-hop design approach with medallion as a starting point. In data warehousing we used to talk about 'staging'. This was never a single hop transformation to the dimensional model, but many smaller hops - in a similar way to medallion zones.
Simon! I am so glad you made this video. It’s been a constant challenge at different clients explaining that medallion isn’t strictly 3 layers. The databricks blogs and documentation gets vague and prescriptive at the same time.
Thoughtful, articulate, and impactful - as usual! Thank you for this.
Brilliantly said Simon! Thanks for this, keen for more videos like this in the future.
Briliant! Thanks for making a simple and clear video about the Medallion Architecture. 🙂
such a nice way of explaining this.
always wondered why everyone nowadays is pushing for this "architecture" which can make sense in a lot of scenarios but I think not as many as people tend to think
Wise Words! I think "Make it your own" is the message. Great video.
This is the kind of content that deserves a love, insightful and celebrate button.
Awesome breakdown! Im glad I understood the concept very early on and not end up stuck with that misconception.
The way the medallion architecture has been presented is oversimplified. It's okay to present it as is as a start. BUT, it being more of a logical layer should be emphasized. That and business logic and needs is also a major consideration.
100% Agreed - I think the medallion "architecture" is really just an extremely poor analogy because it has caused confusion rather than simplifying a concept.
Bronze, Silver, Gold makes non-technical users think of Good, Great, Best - but data is not an Olympic event. Bronze, Silver, Gold is telling me what exactly? Is the Bronze data 3rd best? Is the Silver data only slightly less good than Gold?
Raw, Prep, Land, Stage, Star, Semantic, BI - words like this have semantic meanings that relate to the intended purpose and integrity of those layers. Thanks for illuminating the source of this abomination. 😁
yeah when I first heard a management consultant use those colors I wept a tear for the death of meaning and nuance I knew would follow
What a great video - again! You're a legend. Well spoken and explained.
In my projects I usually have Input/Raw, Enriched+Merged (merge small files into a bigger daily file), and any Gold layer is already in a use-case optimized data store (DWH, SQL DB, ElasticSearch, and so on). The concept of processing everything in object storage and then mirroring to data stores seems a bit surprising.
Thanks for this. Having worked on DW ETLs for decades, I wondered what was new about 'Medallion architecture'... answer: Not a lot. Same series of steps, with a different name.
The new bit - albeit not that new these days - is the reporting model sitting downstream of the DW / Analytical Model.
Thoroughly enjoyed that Simon, thanks!
Hi Simon. I found your video very insightful. I have a question, though. How do you model the Silver layer? Would you use a set of flat tables? If not Data Vault, what would you suggest for the conformed data?
Thank you! Appreciate the effort. It actually makes way more sense
Great great video. Every minute was to the point
Some 12 years ago, at the far forgotten times of Powercenter and Teradata, we had a Data Warehouse divided into three parts:
Staging area, Work area and DWH.
One for raw, flat files, other for curated data and the last one for aggregated data.
I have realized that, in this market, we recycle old concepts with new names, so we keep the environment fresh.
I definitely have build "Data solutions" with this sort of thinking. The one other one I'd include here that wasn't included was around if you have heavy PII and especially PHI/HIPAA data as often you have to have a governance structure for limiting access to certain users and only certain environments/tools.
This is where you may have STG/DEV/TEST/PROD environments. It gets complex, the key is making sure it's clear however and wherever the data is at and can be found by users.
Thank you for sharing. The message I hear and champion is; governance 😊. Make sure it is clear what can be found where 😊
This is really helpful - thanks!
Excellent synthesis, thank you
Great overview. For something like Power BI which would like a star schema, would there be some overlap between the curated and semantic layer?? Or for some items in the curated, it would be a duplication of the object into the semantic layer? with other objects extended by the metrics added where needed??
Yes, for example your Date dimension in curated layer may be copied as is into the semantic layer. In Microsoft Fabric it would be a link or shortcut, so data not actually copied.
Ah yes shortcuts in Fabric! That’s a nice idea!
Excellent vid! Thank you.
Hi Advancing Analytics Team, At what layer do you start to use the delta format?
As soon as we can get the data into a coherent structure - so for us it's Delta from RAW onwards!
Regarding the names of the layers, I'm still on the fence between airlock code of the planet Druidia, or the album titles of Led Zeppelin
Joking aside, love this video, I've recommended it a bunch already!
While I agree in principle I think the value from using Bronze, Silver and Gold is that users understand generally what to expect of the data in each layer. I talk to users about Bronze data so often as a kind of warning - it's a mirror of source so don't expect it to be clean or conformed. So often we forget to define what is meant by various terms and a new person joining an organization from outside where they may have worked with these terms can immediately understand if we propagate their usage.
So yeah, it's nothing new, but using the terms consistently has value in itself.
Thanks for the video, as usual, your channel always has quality content - it's the Gold layer of analytics videos :-)
@simon does this also means that we have 6 layer of same data storage in data lake (some raw, some cleaned, some agreegated) ?
Yep, potentially. If you require it and there are business use cases for each materialised layer.
At which point are you doing any required de-duplication? Base?
So, when using UC in Databricks, do you create a catalog per layer or do you divide it using schemas? Always in doubt on what the best approach is, especially with DLT, where only one destination catalog is possible.
I generally see catalogs named dev, stg, prod with similar schemas and cloud storage folder structures in each. This makes it easier to embrace CICD processes since the catalog name can be set as a default in the workspace admin settings and the code can run unchanged as it moves across the environments. Sometimes people want to organize bronze tables in a special schema (iot_bronze) and the silver and gold objects in another (iot). This makes it easier to hide the bronze tables from analysts and dsml engineers if desired. There are also line of business strategies for catalogs. I think more can be written about the patterns and pros/cons to help architects setting these things up.
This usually comes down to a size thing - if you get to a certain size of business, having a single "prod" catalog can be too limiting trying to do all security at the schema level. We've seen dev/test/prod or dev_raw/test_raw/prod_raw where you then want to break raw down into further data product oriented schemas, rather than have a single catalog that encapsulates all layers for all data products. You could well extend that to the product level if you had large, org-wide security that needed to be secured at the workspace level using catalog-workspace bindings. Essentially - design it to the granularity of security you require, balanced with usability for your data consumers
Is the Semantic layer based off of the Curated or the Enriched layer?
In many cases, a file comes with perfect data. Would you still move it and store it in all layers, or skip couple of those and have that file in for example Landing, Raw and Semantic.
If you have terabytes of data it would not be efficient to move and store data through all layers. And if you decide not to move through all layers and skip some for some layers, then you end up in situation where you need to "join" two tables from different layers, because one of the table does not exist in one of the layers. How do you deal then with such anomality?
This aligns pretty well with how I have been designing lakes. I have had internal debates about whether structured coming into the lake should be "unstrcutured" but I see your logic. My question relates to when importing large quantitties of data from external OLTP source. Do you ETL only daily changes or is Raw a full copy? (which doesn't work when receiving few million records a day)
Main question: where do you first filter to stop inserting a few terrabyte each day?
We try and ensure all loads from source systems are incremental, but this relies on us having the option - not all sources have change data capture / timestamps / change feeds etc. So we parameterise & config up the data extracts, so we can do incremental where possible, then date partition/full snapshot where we have no other choice!
Simon is there a set of file formats that you conform to in the raw layer? For example if the data source is CSV or Excel do you transform that into delta in the pipeline from data source to the raw layer? Also you store rest related data as json in raw, do you transform it to delta at a later stage?
Once you've picked it up from Landing, it's in a spark dataframe. When it's in a spark dataframe, land it as Delta, always (unless exporting data for a very specific integration)
@@AdvancingAnalytics so if I understand correctly you land the data in the data lake in its original file format and thereafter all data will be conformed to delta in the next layer?
@@patronspatron7681Land the data in the landing zone as what ever format json, parquet, csv. When writing hour notebook create a spark df on the landing data and then save it as delta in the raw or bronze zone.
This video makes me so so happy.
Mazing video - subbed
I have seen some design where gold layer is kept separate into a database rather than in databricks what can be the thinking behind this?
That's from the olden days when Spark wasn't great at joins & star-schema queries. These days I don't see a driver for adding another tech into the architecture so we keep everything lakehouse-based!
Thank you. You solved a million dollar question I had.
Brilliant rant - loved it
Neat and concrete! 🎉
Regarding the semantic layer: usually how do you implement it with lake house? It is purely PowerBi managed or you use Azure Analysis Services?
Power BI data model has pretty much replaced Azure Analysis Services. You shouldn't use AAS for anything new
@@nickmcdermaid3015 thanks for answer!
A very interesting perspective on consolidated analytics and ETL, although I may not completely agree with it. I must emphasize that the medallion architecture's contribution to data processing is invaluable, offering numerous benefits across the enterprise. It is simple, comprehensible, and, like any best practice, should be encouraged to be followed.
It offers nothing new. It's just layers. Layers is a concept that has been around for decades
I have a question: where data is really large, like truely billion record, how much of these layers can be materialised vs how much should be logical?
Spot on buddy!
Simon, are all those hops really necessary? How do you managed your PII data with all those hops?
Nope - these are all of the hops a customer might go to, they are absolutely not ALL necessary to ALL organisations. Design the layers to your required processes.
PII - depends on your use case. If you don't need it in your lake, strip it out when first performing structural validation. If you need it but need to obfuscate it, then do that during data cleansing and lock down your RAW layer accordingly
@@AdvancingAnalytics Somebody already asked when you typically transition to Delta. However, there is also a third angle regarding PII specifically. One might need - and have a justification - for using the actual data, unobfuscated. However, there should still be a mechanism for "surgically" removing it when a legal basis for processing no longer exists (e.g. a person is exercising a right to erasure / right to be forgotten).
This becomes potentially problematic when persisting data across ALL the stages. How do you typically handle this? Transient landing, but persistent RAW - the latter in Delta to enable more targeted deletions / anonymization?
If you use Databricks, Unity Catalog offers capabilities to help manage PII data when you need to retain it for authorized people/purposes: Volumes let's you control who can see raw data files containing PII before its ingested into Delta, you can also create reusable functions in Unity Catalog to obfuscate and filter PII data from unauthorized people, and Delta and Deletion Vectors help you implement the GDPR "right to be forgotten" without introducing performance issues in your tables.
@@jimthorstad6226 Those are all good points and advertisement of features within Databricks, but Simon used the age-old argument of "retaining all data (in all stages), because storage is cheap in the cloud". However, cost is not the only factor when deciding what data to retain - particularly in its raw form.
Data engineers are not some special people who are privileged to handle PII, when the company as a whole does not have the right to process it (anymore).
@@vjraitila on top of that you need to remember information kept in previous versions of your delta table used for time travel.
The misconception here is too many peoples think of Bronze, Silver, Gold as physical layer, and when it should be seems as logical layers! And too many think about it a bit too literally.
I’m not sure even Databricks ever mentioned you’ll have only one step of refinement in your Silver layer or Gold layer.
The number of hops should be tight to a business cases and not defined and predefined in advance. All that can now be managed with tags on your tables, or databases in most of all modern data warehouses.
Initially this sounded lot like early days Data Warehouse with landing/staging table and main tables. With Lake house We can create multiple tables/views from same sources without much risk need not to be bronze, silver, gold.
To me even the gold layer could be the Kimball base star schema for ad hoc reporting. The Platinum layer can be fed by a semantic model. Then, the advanced Kimball concepts such as second level consolidated fact tables or other materialization of drill across result can then feed dashboards......if indeed that is the part of requirements.
The reality is speed to market and skill sets of a team will short circuit this approach. That's when you have big trouble and an evolving mess ......
Oomph... love me a metric layer
Spot on
Anyone doing data mgmt for a while knows there are stages to any data management solution. It's a marketing tool. The key is defining all the sublayers and data management tasks that have to happen such as reference data management, ingestion, intermediate reshaping, etc.
The biggest issue I have with the medallion architecture? Most people that know metallurgy would more than slightly shudder if they were ever explained data engineering 😂😂
At the same time, one should not just dismiss someone else suggesting the architecture. It may just be the way they learned. My suggestion is to inquire about where folks think all the subtasks should be.
Preach it! Sober up everyone, this is real world.
So how do I call the delta between my Delta tables now 🙂
Alpha Delta Zeta. Pretend it's a fraternity.
It's just about logical concepts that require abstractions based on your business cases. Period!
Typically I just manually read the entire database and aggregate the results in my mind.
YES! It's not an architecture! I see so many old ideas rehashed with new names for marketing purposes. Layer approach, columnstores. These things have been around for decades
this is gold
*Curated ;)
I was asked this medallion architecture in job interview and i didnt know except ive heard the name. The interviewer explained it enthusiastically and i just commented that its the same as any other process, he didnt seem to like my response. I didnt get picked for the role and also didnt want to as the interviewer seemed arrogant. Its not rocket science.
One request, can you please use a better mic? your content is so good but needs a better sound quality
There is so much confusion happening with the definition of what goes in a medallion layers, least of all because people don't necessarily the different type of modelling you do whether its 3rd normal form or dimensional models, the aggregation aspect is confusing to folks, these days many data engineers don't really understand what a dimensional model is, in the old days a bi person was end to end but now the raw to clean conformed is done by data engineers, that's my recent experience, and think you need to aggregate your facts in gold, when actually that's not true because you would mostly always have detailed facts to the lowest grain possible and that should sit in gold, or is this just splitting hairs, at the end of the day like you say as long as we all on the same page it doesn't matter, but as the teams and companies get bigger, i can see a lot of mismatches happening, just like ive seen so many data warehouses built incorrectly the definition of what you doing needs to be clear understood and communicated. Your layers make more sense to me as a long time end to end BI person because it's quite clear what it means.
Agree. All this mention of "aggregates" to me is a waste of breath. Capture facts at lowest grain and optionally add aggregates if you want. But you shouldn't add aggregates for performance. Performance should be solved in the platform. IMHO the best case for aggregates is combining two facts of different grains to make analysis simpler.
Agree 100%
Show some examples. 16 min in and still talking hypothetical.