Behind the Hype - The Medallion Architecture Doesn't Work

Поделиться
HTML-код
  • Опубликовано: 23 окт 2023
  • Since the dawn of the Data Lake, we've been arguing about what to call the different layers that we move data through. The Medallion Architecture has become supremely popular in recent years, and with Microsoft Fabric embracing the Lakehouse approach, it's never had more attention that it does currently!
    But there's also a huge amount of confusion - what do the different layers mean? What processes do you do where? How do you translate the entirely of your data curation into just three steps... and does it always make sense?
    In this video, Simon pulls apart the expected journey of data through a lakehouse architecture and maps the medallion architecture to the deeper, richer approach that has become the Advancing Analytics standard.
    If you're setting out on a journey to build out a Medallion Architecture, whether in Databricks or Microsoft Fabric - get in touch with the experts from Advancing Analytics

Комментарии • 92

  • @ParkerWoodson-ow3ol
    @ParkerWoodson-ow3ol 4 месяца назад +5

    Thank you!!! Finally, someone talks about the Medallion Architecture as a framework and not the rule of you need a bronze, silver, and gold layer to your data lakehouse. For years I always received that basic explanation as the rule and "Best Practice" even from all of the so called experts. After learning from experience with implementations it's nice to hear affirmation - it's what you decide is appropriate to the context of your team and data product to determine the various landing zones of our data. You provide a good way to communicate that without just saying "It depends" which it does and nobody likes that answer 😂

  • @MDevion
    @MDevion 9 месяцев назад +37

    Finally somebody talking common sense. Its just the layer structure that always have exists. Whether you call it bronze silver gold, or stage help dim fact etc., its a potato, potato discussion. In other words semantics.

    • @Milhouse77BS
      @Milhouse77BS 9 месяцев назад +3

      I call it ETL: Extract (Bronze), Transform (Silver), Load (Gold)

    • @FR-099
      @FR-099 9 месяцев назад +2

      ​@@Milhouse77BS what if you ELT? the point here is that BSG covers several transformation types and in some cases several copies of data.

    • @Milhouse77BS
      @Milhouse77BS 9 месяцев назад

      BSG I think of "Battle Star Galactica" first :)
      @@FR-099

    • @david0aloha
      @david0aloha 7 месяцев назад

      @@Milhouse77BS Why? ETL (and ELT) are processes.
      ETL can take raw data from others systems and upload it into a Bronze layer.
      ETL can take raw Bronze data, make cleaned Silver data out of it, then upload it to your Silver layer.
      ETL can take cleaned Silver data, join it with other Silver datasets, to make a Gold dataset, then upload it to your Gold layer.
      ETL can exist entirely outside of medallion architecture and take whatever data you want from a bunch of source systems and produce custom reports for a business unit, director, c-suite, etc.
      Think of it like these layers are the nouns, and ETL (or ELT) is a verb that can turn parts of one layer into another. Or it can be used entirely outside of a medallion architecture.

    • @VinayakKommana
      @VinayakKommana 2 месяца назад

      @@Milhouse77BS 😂😂😂, nice sense of humor

  • @matthewbradley2873
    @matthewbradley2873 8 месяцев назад +5

    A key point omitted here is that the medallion architecture is also referred to as a 'multi-hop' architecture. I always talk about a multi-hop design approach with medallion as a starting point. In data warehousing we used to talk about 'staging'. This was never a single hop transformation to the dimensional model, but many smaller hops - in a similar way to medallion zones.

  • @kaurivneet1
    @kaurivneet1 Месяц назад

    Simon! I am so glad you made this video. It’s been a constant challenge at different clients explaining that medallion isn’t strictly 3 layers. The databricks blogs and documentation gets vague and prescriptive at the same time.

  • @Simondoubt4446
    @Simondoubt4446 7 месяцев назад +1

    Thoughtful, articulate, and impactful - as usual! Thank you for this.

  • @Juan-PaulHynek
    @Juan-PaulHynek 9 месяцев назад

    Brilliantly said Simon! Thanks for this, keen for more videos like this in the future.

  • @repsdorph01
    @repsdorph01 8 месяцев назад

    Briliant! Thanks for making a simple and clear video about the Medallion Architecture. 🙂

  • @panostanilian
    @panostanilian 8 месяцев назад +1

    such a nice way of explaining this.
    always wondered why everyone nowadays is pushing for this "architecture" which can make sense in a lot of scenarios but I think not as many as people tend to think

  • @geehaf
    @geehaf 9 месяцев назад +1

    Wise Words! I think "Make it your own" is the message. Great video.

  • @LogicallySoundMM
    @LogicallySoundMM 4 месяца назад +1

    This is the kind of content that deserves a love, insightful and celebrate button.

  • @RaiOkami
    @RaiOkami 9 месяцев назад

    Awesome breakdown! Im glad I understood the concept very early on and not end up stuck with that misconception.
    The way the medallion architecture has been presented is oversimplified. It's okay to present it as is as a start. BUT, it being more of a logical layer should be emphasized. That and business logic and needs is also a major consideration.

  • @calmhorizons
    @calmhorizons 9 месяцев назад +12

    100% Agreed - I think the medallion "architecture" is really just an extremely poor analogy because it has caused confusion rather than simplifying a concept.
    Bronze, Silver, Gold makes non-technical users think of Good, Great, Best - but data is not an Olympic event. Bronze, Silver, Gold is telling me what exactly? Is the Bronze data 3rd best? Is the Silver data only slightly less good than Gold?
    Raw, Prep, Land, Stage, Star, Semantic, BI - words like this have semantic meanings that relate to the intended purpose and integrity of those layers. Thanks for illuminating the source of this abomination. 😁

    • @FR-099
      @FR-099 9 месяцев назад +2

      yeah when I first heard a management consultant use those colors I wept a tear for the death of meaning and nuance I knew would follow

  • @julius8183
    @julius8183 2 месяца назад

    What a great video - again! You're a legend. Well spoken and explained.

  • @DollyBastard
    @DollyBastard 8 месяцев назад +1

    In my projects I usually have Input/Raw, Enriched+Merged (merge small files into a bigger daily file), and any Gold layer is already in a use-case optimized data store (DWH, SQL DB, ElasticSearch, and so on). The concept of processing everything in object storage and then mirroring to data stores seems a bit surprising.

  • @pauloneufneufneuf
    @pauloneufneufneuf 6 месяцев назад +1

    Thanks for this. Having worked on DW ETLs for decades, I wondered what was new about 'Medallion architecture'... answer: Not a lot. Same series of steps, with a different name.
    The new bit - albeit not that new these days - is the reporting model sitting downstream of the DW / Analytical Model.

  • @spillanemike
    @spillanemike 9 месяцев назад

    Thoroughly enjoyed that Simon, thanks!

  • @JulioAcunaMoreno
    @JulioAcunaMoreno 4 месяца назад +1

    Hi Simon. I found your video very insightful. I have a question, though. How do you model the Silver layer? Would you use a set of flat tables? If not Data Vault, what would you suggest for the conformed data?

  • @liyashuklinejg
    @liyashuklinejg Месяц назад

    Thank you! Appreciate the effort. It actually makes way more sense

  • @xkhokokox
    @xkhokokox 5 месяцев назад

    Great great video. Every minute was to the point

  • @andrewinkler9538
    @andrewinkler9538 26 дней назад

    Some 12 years ago, at the far forgotten times of Powercenter and Teradata, we had a Data Warehouse divided into three parts:
    Staging area, Work area and DWH.
    One for raw, flat files, other for curated data and the last one for aggregated data.
    I have realized that, in this market, we recycle old concepts with new names, so we keep the environment fresh.

  • @paul_devos
    @paul_devos 8 месяцев назад

    I definitely have build "Data solutions" with this sort of thinking. The one other one I'd include here that wasn't included was around if you have heavy PII and especially PHI/HIPAA data as often you have to have a governance structure for limiting access to certain users and only certain environments/tools.
    This is where you may have STG/DEV/TEST/PROD environments. It gets complex, the key is making sure it's clear however and wherever the data is at and can be found by users.

  • @IanSmith22101980
    @IanSmith22101980 9 месяцев назад +1

    Thank you for sharing. The message I hear and champion is; governance 😊. Make sure it is clear what can be found where 😊

  • @mooxxmusic
    @mooxxmusic 9 месяцев назад

    This is really helpful - thanks!

  • @davidroberts6774
    @davidroberts6774 9 месяцев назад

    Excellent synthesis, thank you

  • @jonlunn4114
    @jonlunn4114 8 месяцев назад +2

    Great overview. For something like Power BI which would like a star schema, would there be some overlap between the curated and semantic layer?? Or for some items in the curated, it would be a duplication of the object into the semantic layer? with other objects extended by the metrics added where needed??

    • @Milhouse77BS
      @Milhouse77BS 7 месяцев назад

      Yes, for example your Date dimension in curated layer may be copied as is into the semantic layer. In Microsoft Fabric it would be a link or shortcut, so data not actually copied.

    • @jonlunn4114
      @jonlunn4114 7 месяцев назад +1

      Ah yes shortcuts in Fabric! That’s a nice idea!

  • @JulesTuffrey
    @JulesTuffrey 9 месяцев назад

    Excellent vid! Thank you.

  • @stephanjoubert8568
    @stephanjoubert8568 9 месяцев назад +3

    Hi Advancing Analytics Team, At what layer do you start to use the delta format?

    • @AdvancingAnalytics
      @AdvancingAnalytics  9 месяцев назад +5

      As soon as we can get the data into a coherent structure - so for us it's Delta from RAW onwards!

  • @andrekamman1971
    @andrekamman1971 8 месяцев назад +1

    Regarding the names of the layers, I'm still on the fence between airlock code of the planet Druidia, or the album titles of Led Zeppelin
    Joking aside, love this video, I've recommended it a bunch already!

  • @gtoms
    @gtoms 8 месяцев назад

    While I agree in principle I think the value from using Bronze, Silver and Gold is that users understand generally what to expect of the data in each layer. I talk to users about Bronze data so often as a kind of warning - it's a mirror of source so don't expect it to be clean or conformed. So often we forget to define what is meant by various terms and a new person joining an organization from outside where they may have worked with these terms can immediately understand if we propagate their usage.
    So yeah, it's nothing new, but using the terms consistently has value in itself.
    Thanks for the video, as usual, your channel always has quality content - it's the Gold layer of analytics videos :-)

  • @atulbansal8041
    @atulbansal8041 9 месяцев назад +1

    @simon does this also means that we have 6 layer of same data storage in data lake (some raw, some cleaned, some agreegated) ?

    • @AdvancingAnalytics
      @AdvancingAnalytics  9 месяцев назад

      Yep, potentially. If you require it and there are business use cases for each materialised layer.

  • @user-rv5xe4nk9p
    @user-rv5xe4nk9p 8 месяцев назад

    At which point are you doing any required de-duplication? Base?

  • @mladum
    @mladum 9 месяцев назад +1

    So, when using UC in Databricks, do you create a catalog per layer or do you divide it using schemas? Always in doubt on what the best approach is, especially with DLT, where only one destination catalog is possible.

    • @jimthorstad
      @jimthorstad 9 месяцев назад +2

      I generally see catalogs named dev, stg, prod with similar schemas and cloud storage folder structures in each. This makes it easier to embrace CICD processes since the catalog name can be set as a default in the workspace admin settings and the code can run unchanged as it moves across the environments. Sometimes people want to organize bronze tables in a special schema (iot_bronze) and the silver and gold objects in another (iot). This makes it easier to hide the bronze tables from analysts and dsml engineers if desired. There are also line of business strategies for catalogs. I think more can be written about the patterns and pros/cons to help architects setting these things up.

    • @AdvancingAnalytics
      @AdvancingAnalytics  9 месяцев назад

      This usually comes down to a size thing - if you get to a certain size of business, having a single "prod" catalog can be too limiting trying to do all security at the schema level. We've seen dev/test/prod or dev_raw/test_raw/prod_raw where you then want to break raw down into further data product oriented schemas, rather than have a single catalog that encapsulates all layers for all data products. You could well extend that to the product level if you had large, org-wide security that needed to be secured at the workspace level using catalog-workspace bindings. Essentially - design it to the granularity of security you require, balanced with usability for your data consumers

  • @DollyBastard
    @DollyBastard 8 месяцев назад

    Is the Semantic layer based off of the Curated or the Enriched layer?

  • @rankena
    @rankena 8 месяцев назад +1

    In many cases, a file comes with perfect data. Would you still move it and store it in all layers, or skip couple of those and have that file in for example Landing, Raw and Semantic.
    If you have terabytes of data it would not be efficient to move and store data through all layers. And if you decide not to move through all layers and skip some for some layers, then you end up in situation where you need to "join" two tables from different layers, because one of the table does not exist in one of the layers. How do you deal then with such anomality?

  • @Iangatley
    @Iangatley 7 месяцев назад

    This aligns pretty well with how I have been designing lakes. I have had internal debates about whether structured coming into the lake should be "unstrcutured" but I see your logic. My question relates to when importing large quantitties of data from external OLTP source. Do you ETL only daily changes or is Raw a full copy? (which doesn't work when receiving few million records a day)
    Main question: where do you first filter to stop inserting a few terrabyte each day?

    • @AdvancingAnalytics
      @AdvancingAnalytics  7 месяцев назад +1

      We try and ensure all loads from source systems are incremental, but this relies on us having the option - not all sources have change data capture / timestamps / change feeds etc. So we parameterise & config up the data extracts, so we can do incremental where possible, then date partition/full snapshot where we have no other choice!

  • @patronspatron7681
    @patronspatron7681 9 месяцев назад

    Simon is there a set of file formats that you conform to in the raw layer? For example if the data source is CSV or Excel do you transform that into delta in the pipeline from data source to the raw layer? Also you store rest related data as json in raw, do you transform it to delta at a later stage?

    • @AdvancingAnalytics
      @AdvancingAnalytics  8 месяцев назад

      Once you've picked it up from Landing, it's in a spark dataframe. When it's in a spark dataframe, land it as Delta, always (unless exporting data for a very specific integration)

    • @patronspatron7681
      @patronspatron7681 8 месяцев назад

      @@AdvancingAnalytics so if I understand correctly you land the data in the data lake in its original file format and thereafter all data will be conformed to delta in the next layer?

    • @ManCar1608
      @ManCar1608 2 месяца назад

      @@patronspatron7681Land the data in the landing zone as what ever format json, parquet, csv. When writing hour notebook create a spark df on the landing data and then save it as delta in the raw or bronze zone.

  • @richardcroft3676
    @richardcroft3676 17 дней назад

    This video makes me so so happy.

  • @carlmerritt4355
    @carlmerritt4355 2 месяца назад

    Mazing video - subbed

  • @manishkumar-hp1hw
    @manishkumar-hp1hw 7 месяцев назад +1

    I have seen some design where gold layer is kept separate into a database rather than in databricks what can be the thinking behind this?

    • @AdvancingAnalytics
      @AdvancingAnalytics  7 месяцев назад

      That's from the olden days when Spark wasn't great at joins & star-schema queries. These days I don't see a driver for adding another tech into the architecture so we keep everything lakehouse-based!

  • @sivakumarkanagaraj
    @sivakumarkanagaraj 3 месяца назад

    Thank you. You solved a million dollar question I had.

  • @DAVIDODONNELL-t9w
    @DAVIDODONNELL-t9w 21 день назад

    Brilliant rant - loved it

  • @ledinhanhtan
    @ledinhanhtan Месяц назад

    Neat and concrete! 🎉

  • @derkachm
    @derkachm 8 месяцев назад

    Regarding the semantic layer: usually how do you implement it with lake house? It is purely PowerBi managed or you use Azure Analysis Services?

    • @nickmcdermaid3015
      @nickmcdermaid3015 8 месяцев назад

      Power BI data model has pretty much replaced Azure Analysis Services. You shouldn't use AAS for anything new

    • @derkachm
      @derkachm 8 месяцев назад

      @@nickmcdermaid3015 thanks for answer!

  • @DomGidziuSzymciu
    @DomGidziuSzymciu 8 месяцев назад

    A very interesting perspective on consolidated analytics and ETL, although I may not completely agree with it. I must emphasize that the medallion architecture's contribution to data processing is invaluable, offering numerous benefits across the enterprise. It is simple, comprehensible, and, like any best practice, should be encouraged to be followed.

    • @nickmcdermaid3015
      @nickmcdermaid3015 8 месяцев назад +1

      It offers nothing new. It's just layers. Layers is a concept that has been around for decades

  • @mimmakutu
    @mimmakutu 8 месяцев назад

    I have a question: where data is really large, like truely billion record, how much of these layers can be materialised vs how much should be logical?

  • @ra2i3
    @ra2i3 5 месяцев назад

    Spot on buddy!

  • @jeanchindeko5477
    @jeanchindeko5477 9 месяцев назад +1

    Simon, are all those hops really necessary? How do you managed your PII data with all those hops?

    • @AdvancingAnalytics
      @AdvancingAnalytics  9 месяцев назад

      Nope - these are all of the hops a customer might go to, they are absolutely not ALL necessary to ALL organisations. Design the layers to your required processes.
      PII - depends on your use case. If you don't need it in your lake, strip it out when first performing structural validation. If you need it but need to obfuscate it, then do that during data cleansing and lock down your RAW layer accordingly

    • @vjraitila
      @vjraitila 9 месяцев назад

      @@AdvancingAnalytics Somebody already asked when you typically transition to Delta. However, there is also a third angle regarding PII specifically. One might need - and have a justification - for using the actual data, unobfuscated. However, there should still be a mechanism for "surgically" removing it when a legal basis for processing no longer exists (e.g. a person is exercising a right to erasure / right to be forgotten).
      This becomes potentially problematic when persisting data across ALL the stages. How do you typically handle this? Transient landing, but persistent RAW - the latter in Delta to enable more targeted deletions / anonymization?

    • @jimthorstad6226
      @jimthorstad6226 9 месяцев назад +2

      If you use Databricks, Unity Catalog offers capabilities to help manage PII data when you need to retain it for authorized people/purposes: Volumes let's you control who can see raw data files containing PII before its ingested into Delta, you can also create reusable functions in Unity Catalog to obfuscate and filter PII data from unauthorized people, and Delta and Deletion Vectors help you implement the GDPR "right to be forgotten" without introducing performance issues in your tables.

    • @vjraitila
      @vjraitila 9 месяцев назад

      @@jimthorstad6226 Those are all good points and advertisement of features within Databricks, but Simon used the age-old argument of "retaining all data (in all stages), because storage is cheap in the cloud". However, cost is not the only factor when deciding what data to retain - particularly in its raw form.
      Data engineers are not some special people who are privileged to handle PII, when the company as a whole does not have the right to process it (anymore).

    • @jeanchindeko5477
      @jeanchindeko5477 9 месяцев назад

      @@vjraitila on top of that you need to remember information kept in previous versions of your delta table used for time travel.

  • @jeanchindeko5477
    @jeanchindeko5477 9 месяцев назад +1

    The misconception here is too many peoples think of Bronze, Silver, Gold as physical layer, and when it should be seems as logical layers! And too many think about it a bit too literally.
    I’m not sure even Databricks ever mentioned you’ll have only one step of refinement in your Silver layer or Gold layer.
    The number of hops should be tight to a business cases and not defined and predefined in advance. All that can now be managed with tags on your tables, or databases in most of all modern data warehouses.

  • @Anbu_Sampath
    @Anbu_Sampath 8 месяцев назад

    Initially this sounded lot like early days Data Warehouse with landing/staging table and main tables. With Lake house We can create multiple tables/views from same sources without much risk need not to be bronze, silver, gold.

  • @NeumsFor9
    @NeumsFor9 8 месяцев назад

    To me even the gold layer could be the Kimball base star schema for ad hoc reporting. The Platinum layer can be fed by a semantic model. Then, the advanced Kimball concepts such as second level consolidated fact tables or other materialization of drill across result can then feed dashboards......if indeed that is the part of requirements.
    The reality is speed to market and skill sets of a team will short circuit this approach. That's when you have big trouble and an evolving mess ......

  • @johnnywinter861
    @johnnywinter861 9 месяцев назад +3

    Oomph... love me a metric layer

  • @fishsauce7497
    @fishsauce7497 4 месяца назад

    Spot on

  • @NeumsFor9
    @NeumsFor9 8 месяцев назад

    Anyone doing data mgmt for a while knows there are stages to any data management solution. It's a marketing tool. The key is defining all the sublayers and data management tasks that have to happen such as reference data management, ingestion, intermediate reshaping, etc.
    The biggest issue I have with the medallion architecture? Most people that know metallurgy would more than slightly shudder if they were ever explained data engineering 😂😂
    At the same time, one should not just dismiss someone else suggesting the architecture. It may just be the way they learned. My suggestion is to inquire about where folks think all the subtasks should be.

  • @riedwaanbassadien
    @riedwaanbassadien 29 дней назад

    Preach it! Sober up everyone, this is real world.

  • @allthingsdata
    @allthingsdata 8 месяцев назад

    So how do I call the delta between my Delta tables now 🙂

    • @Marcus-sh8sq
      @Marcus-sh8sq 7 месяцев назад

      Alpha Delta Zeta. Pretend it's a fraternity.

  • @zycbrasil2618
    @zycbrasil2618 9 месяцев назад +2

    It's just about logical concepts that require abstractions based on your business cases. Period!

  • @MonsterManStan
    @MonsterManStan 5 месяцев назад +2

    Typically I just manually read the entire database and aggregate the results in my mind.

  • @nickmcdermaid3015
    @nickmcdermaid3015 8 месяцев назад

    YES! It's not an architecture! I see so many old ideas rehashed with new names for marketing purposes. Layer approach, columnstores. These things have been around for decades

  • @fpa89
    @fpa89 7 месяцев назад

    this is gold

  • @samulimustonen2047
    @samulimustonen2047 2 месяца назад

    I was asked this medallion architecture in job interview and i didnt know except ive heard the name. The interviewer explained it enthusiastically and i just commented that its the same as any other process, he didnt seem to like my response. I didnt get picked for the role and also didnt want to as the interviewer seemed arrogant. Its not rocket science.

  • @MelvinAbedini
    @MelvinAbedini 8 месяцев назад

    One request, can you please use a better mic? your content is so good but needs a better sound quality

  • @goldwhispers
    @goldwhispers 9 месяцев назад +3

    There is so much confusion happening with the definition of what goes in a medallion layers, least of all because people don't necessarily the different type of modelling you do whether its 3rd normal form or dimensional models, the aggregation aspect is confusing to folks, these days many data engineers don't really understand what a dimensional model is, in the old days a bi person was end to end but now the raw to clean conformed is done by data engineers, that's my recent experience, and think you need to aggregate your facts in gold, when actually that's not true because you would mostly always have detailed facts to the lowest grain possible and that should sit in gold, or is this just splitting hairs, at the end of the day like you say as long as we all on the same page it doesn't matter, but as the teams and companies get bigger, i can see a lot of mismatches happening, just like ive seen so many data warehouses built incorrectly the definition of what you doing needs to be clear understood and communicated. Your layers make more sense to me as a long time end to end BI person because it's quite clear what it means.

    • @nickmcdermaid3015
      @nickmcdermaid3015 8 месяцев назад +1

      Agree. All this mention of "aggregates" to me is a waste of breath. Capture facts at lowest grain and optionally add aggregates if you want. But you shouldn't add aggregates for performance. Performance should be solved in the platform. IMHO the best case for aggregates is combining two facts of different grains to make analysis simpler.

    • @ManCar1608
      @ManCar1608 2 месяца назад

      Agree 100%

  • @mredmister3014
    @mredmister3014 Месяц назад

    Show some examples. 16 min in and still talking hypothetical.