Thanks very much! I'm glad you got a lot from it! This is the first in a number of videos (I think 9 planned out so far), so please like & subscribe to get notified when the next video is released next week!
Great Video! Can't wait to watch the srest of the series I have a couple of a quick questions based on reading some of the comments. 1. When you talk about storing business entities in the silver layer this makes total sense i.e using it to create you dimension tables (i.e a curated 'customer' table for example, reference data tables etc) - but what about 'Fact' type data that would sit in the heart of the gold layer star schemas - would this not really need to be part of the silver layer i.e youd be getting your raw 'event' data from bronze and then building out the star in gold, referencing you silver dims? 2. From some of the comments you've mentioned about lack of ability to group/separate tables in a lakehouse - as such I guess it makes sense to, in certain cases, have domain specific lakehouses in silver and gold - which would use shortcuts to share organisational curated data 'core' / ref data etc? Super video mate once again. Would really appreciate any insight on the above!
Thanks for your comment! Sorry for the delay in responding. 1. Yes - both your "dimension" and "fact" data will live in the Silver layer, but not necessarily in dedicated "dimensions" and "facts", if you see what I mean. Sorry - I probably could have been clearer in the video. Generally we don't tend think of the "Silver" layer in terms of "Facts and Dimensions" (i.e. a dimensional model). What I meant by "Business Entities" in the video was more along the lines of "data and entities your business receives and processes". This could be purely descriptive core entities like Customers and Products, but oftentimes your descriptive data arrives in the same flat file as your events and metrics, in a "denormalized" form. The point is that in Silver it's fine to keep these logical "dimensions" and "facts" in the same file/table, rather than doing any premature dimensional modelling. Ideally we want to minimize the amount of modelling we do, because modelling is often opinionated and the goal of the Silver layer is to provide a standardized foundation upon which arbitrary Gold projections can be built (one of which is oftentimes a "Star Schema" version of the Silver tables). Different Gold consumers may have different ways they want to structure the Silver data. There's no fixed recipe for the tables you create in the Silver layer, but mostly they'll be 1-1 with the dataset types you're receiving in bronze. Try to avoid generating "net new" tables for the reasons I've mentioned above. 2. Yes, absolutely. Not least because you'll likely have different access requirements per domain, which is easier to configure when you have separate Lakehouses. But also watch out for the upcoming SCHEMA support for Lakehouses: learn.microsoft.com/en-us/fabric/release-plan/data-engineering#schema-support. This might give you another option here.
Hi, very good and simple explanation! Thank you very much! One question here: How and why did you decide to create a lakehouse for each layer? I saw another video where only one Lakehouse was created and the different layers were created within Lakehouse folders. Do you have any experience with other instantiations of the medallion architecture onfabric?
Hi, thanks for the comment! The reality is that there's no one-size-fits-all approach to architecting solutions in Fabric. When it comes to your specific question about how the Medallion architecture maps to Lakehouses in Fabric: we default to one Lakehouse per layer. That's mainly because of two things - organizational flexibility and security flexibility. If we start combining the layers into a single Lakehouse, we lose flexibility on the organization of the managed Delta tables that we create. That's because within a single Lakehouse there's no way to group related tables other than by using a table naming convention - i.e. right now there's no equivalent of a T-SQL "SCHEMA" in a Lakehouse, nor is there a notion of sub-folders. Generally in the Silver and Gold layers we're creating Tables as outputs, rather than Files, so this flexibility is useful in order to separate Tables from one layer to the next layer. If you're purely dealing with "Files" then there's less of an obvious benefit of creating separate Lakehouses. But I would question why Silver/Gold datasets are being stored in the Files section rather than the Tables section - the Tables section is written in Delta format and heavily optimized for reporting purposes. From a security perspective, there's a clear separation of data when using separate Lakehouses, which can mean that different security provisions can be put in each layer. You might not want people to have access to `raw` (i.e. `Bronze` data), but you might want them to be able to access `Silver` data. Having everything in a single Lakehouse would make this a little tricky. In reality, though, "it depends". If you have a really simple use-case, one Lakehouse could be sufficient. I've also seen people suggest combining "Bronze and Silver" into a single Lakehouse and "Gold" into a separate Lakehouse, since in "Bronze" you're usually only dealing with "Files", and therefore you may as well utilize the "Tables" section for your "Silver" layer. You just need to determine what works best for your use-case, factoring in the above points alongside other concerns such as data residency and cost management/chargeback. And its likely that even internally, your Lakehouse architecture will differ from one project to the next. Remember in Fabric we have the power of "Shortcuts", which allows us to seamlessly combine data from other Lakehouses. So you can't go far wrong whatever architecture you choose!
This is a great walkthrough of Medallion architecture! I'm looking to implement a structure where the Gold Layer stores Dimension and Fact - type information but one thing I noticed is that Delta Tables in fabric cannot hold anything like a primary key which is integer and auto-incrementing as you would in on-prem SQL. Am I right in thinking that this mustn't be the best practice when loading Dimensions from Bronze data for such tables? I'm trying to visualise the best way to set up Primary and Foreign Keys based on the data.
Thanks for the kind words! Ah - you've hit the age old problem when building a Lakehouse - generating integer surrogate keys. Fortunately, Databricks has somewhat solved this problem by adding IDENTITY columns to their version of Delta Lake (see www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html). Unfortunately for us, they haven't open-sourced this feature, despite branding Delta Lake 2.0 as the "fully open-source" version of Delta Lake. Microsoft Fabric (and Azure Synapse) relies on the "open-source" version of Delta Lake, hence can't benefit from this feature (unless they were to build it themselves, which I doubt is going to happen). So, what can you do? Well, the first thing I would say is.. do you actually need your Primary Key to be an integer? I know this goes against traditional Data Warehousing principles, but a Lakehouse is not a traditional Data Warehouse. And Power BI (/Analysis Services) isn't either. Delta Lake and Vertipaq both utilize columnar storage, where the main consideration on data size (and therefore performance) is the number of unique values (cardinality) within a column - not the data type. And people have tested this out in Power BI - see Max's blog here: www.maxwikstrom.se/performance/power-bi-data-types-in-relationships-does-it-matter/ - with the outcome being "it doesn't really matter". This aligns with behaviour I've seen as well. So, if Power BI is your primary downstream consumer, you might be able to get away with just using your table's unique key as the primary key - even if that's a string. If you do need to generate an integer surrogate key (maybe your dimensions are really large), then there's no elegant way of doing this right now in Fabric. There are a number of ways you can do this in Spark/T-SQL - see ruclips.net/video/aF2hRH5WZAU/видео.html&ab_channel=Databricks for some open-source Spark options. There are equivalents in T-SQL. But you should be aware of the issues around concurrent-writing & idempotency for these methods (i.e. they're not safe without additional checks). And given that multi-table transactions also aren't supported in Spark, you need to be careful when creating a star schema that depends on surrogate keys you've generated. In the past we've implemented a custom locking/transaction mechanism that makes sure that all the tables in a star schema have been successfully written, or they've all been rolled back if failed. It's certainly a pain. Although Multi-table transactions are supported in Fabric Data Warehouse, so if you're using that then you're good on that front. My recommendation would be to consider not bothering with integer PKs if at all possible. Test it out for your scenario and see whether it's feasible. It saves a whole world of pain maintaining them.
@@endjin Thankyou! That answer is comprehensive and really hits all the questions I would have had as follow ons. Regarding the size of the dimension tables, I don't think they're significant enough to warrant integer type and string may suffice. Thanks again!
Thanks a lot. Currently we have multiple folders for our tables in our gen 2 data lake. It helps multiple teams find groups of tables. It looks like this isn’t possible in the lakehouse..you can only have subfolders in the file section?
You're correct - you can only create subfolders in the Files section. Maybe consider creating separate Lakehouses to satisfy your "grouping" requirements? And then you can always use Shortcuts to point to the core table so that you're not duplicating any data.
What are the pros and cons of using a lakehouse for each piece of the medallion architecture vs using a single lakehouse and a file folder for each piece?
Hi, thanks for the comment! This question has been asked by other commentors too... so we'll just reshare our answer... The reality is that there's no one-size-fits-all approach to architecting solutions in Fabric. When it comes to your specific question about how the Medallion architecture maps to Lakehouses in Fabric: we default to one Lakehouse per layer. That's mainly because of two things - organizational flexibility and security flexibility. If we start combining the layers into a single Lakehouse, we lose flexibility on the organization of the managed Delta tables that we create. That's because within a single Lakehouse there's no way to group related tables other than by using a table naming convention - i.e. there's no equivalent of a T-SQL "SCHEMA" in a Lakehouse, nor is there a notion of sub-folders. Generally in the Silver and Gold layers we're creating Tables as outputs, rather than Files, so this flexibility is useful in order to separate Tables from one layer to the next layer. If you're purely dealing with "Files" then there's less of an obvious benefit of creating separate Lakehouses. But I would question why Silver/Gold datasets are being stored in the Files section rather than the Tables section - the Tables section is written in Delta format and heavily optimized for reporting purposes. From a security perspective, there's a clear separation of data when using separate Lakehouses, which can mean that different security provisions can be put in each layer. You might not want people to have access to `raw` (i.e. `Bronze` data), but you might want them to be able to access `Silver` data. Having everything in a single Lakehouse would make this a little tricky. In reality, though, "it depends". If you have a really simple use-case, one Lakehouse could be sufficient. I've also seen people suggest combining "Bronze and Silver" into a single Lakehouse and "Gold" into a separate Lakehouse, since in "Bronze" you're usually only dealing with "Files", and therefore you may as well utilize the "Tables" section for your "Silver" layer. You just need to determine what works best for your use-case, factoring in the above points alongside other concerns such as data residency and cost management/chargeback. And its likely that even internally, your Lakehouse architecture will differ from one project to the next. Remember in Fabric we have the power of "Shortcuts", which allows us to seamlessly combine data from other Lakehouses. So you can't go far wrong whatever architecture you choose!
@@rkneti That's also an option, though generally we find having all layers in a single workspace easier to manage all things considered. For example, when considering CICD (i.e. promoting from Dev -> Test -> Prod), having all artifacts as one "unit" contained in a single workspace is much simpler to manage when releasing new functionality. The "workspace per layer" approach is also less appealing now that Folders are in Preview in Fabric. This brings an additional level or organisation to your artifacts.
Thank you for watching, if you enjoyed this episode, please hit like 👍subscribe, and turn notifications on 🔔it helps us more than you know. 🙏
Very insightful. Thanks!
Best explanation so far! Many thanks!
Thanks very much! I'm glad you got a lot from it! This is the first in a number of videos (I think 9 planned out so far), so please like & subscribe to get notified when the next video is released next week!
@@endjin Already done :)
@@hannesw.8297 W00t!
That was a great vid. Subscribed!
tks! leant Medallion architecture from you !
Excellent! Glad you found it useful!
Great Video! Can't wait to watch the srest of the series I have a couple of a quick questions based on reading some of the comments.
1. When you talk about storing business entities in the silver layer this makes total sense i.e using it to create you dimension tables (i.e a curated 'customer' table for example, reference data tables etc) - but what about 'Fact' type data that would sit in the heart of the gold layer star schemas - would this not really need to be part of the silver layer i.e youd be getting your raw 'event' data from bronze and then building out the star in gold, referencing you silver dims?
2. From some of the comments you've mentioned about lack of ability to group/separate tables in a lakehouse - as such I guess it makes sense to, in certain cases, have domain specific lakehouses in silver and gold - which would use shortcuts to share organisational curated data 'core' / ref data etc?
Super video mate once again. Would really appreciate any insight on the above!
Thanks for your comment! Sorry for the delay in responding.
1. Yes - both your "dimension" and "fact" data will live in the Silver layer, but not necessarily in dedicated "dimensions" and "facts", if you see what I mean. Sorry - I probably could have been clearer in the video.
Generally we don't tend think of the "Silver" layer in terms of "Facts and Dimensions" (i.e. a dimensional model). What I meant by "Business Entities" in the video was more along the lines of "data and entities your business receives and processes". This could be purely descriptive core entities like Customers and Products, but oftentimes your descriptive data arrives in the same flat file as your events and metrics, in a "denormalized" form.
The point is that in Silver it's fine to keep these logical "dimensions" and "facts" in the same file/table, rather than doing any premature dimensional modelling. Ideally we want to minimize the amount of modelling we do, because modelling is often opinionated and the goal of the Silver layer is to provide a standardized foundation upon which arbitrary Gold projections can be built (one of which is oftentimes a "Star Schema" version of the Silver tables). Different Gold consumers may have different ways they want to structure the Silver data.
There's no fixed recipe for the tables you create in the Silver layer, but mostly they'll be 1-1 with the dataset types you're receiving in bronze. Try to avoid generating "net new" tables for the reasons I've mentioned above.
2. Yes, absolutely. Not least because you'll likely have different access requirements per domain, which is easier to configure when you have separate Lakehouses. But also watch out for the upcoming SCHEMA support for Lakehouses: learn.microsoft.com/en-us/fabric/release-plan/data-engineering#schema-support. This might give you another option here.
Part 8 - Good Notebook Development Practices - is now available: ruclips.net/video/UyS6ZUgh-Wc/видео.html
Hi, very good and simple explanation! Thank you very much! One question here: How and why did you decide to create a lakehouse for each layer? I saw another video where only one Lakehouse was created and the different layers were created within Lakehouse folders. Do you have any experience with other instantiations of the medallion architecture onfabric?
Hi, thanks for the comment!
The reality is that there's no one-size-fits-all approach to architecting solutions in Fabric. When it comes to your specific question about how the Medallion architecture maps to Lakehouses in Fabric: we default to one Lakehouse per layer. That's mainly because of two things - organizational flexibility and security flexibility.
If we start combining the layers into a single Lakehouse, we lose flexibility on the organization of the managed Delta tables that we create. That's because within a single Lakehouse there's no way to group related tables other than by using a table naming convention - i.e. right now there's no equivalent of a T-SQL "SCHEMA" in a Lakehouse, nor is there a notion of sub-folders.
Generally in the Silver and Gold layers we're creating Tables as outputs, rather than Files, so this flexibility is useful in order to separate Tables from one layer to the next layer. If you're purely dealing with "Files" then there's less of an obvious benefit of creating separate Lakehouses. But I would question why Silver/Gold datasets are being stored in the Files section rather than the Tables section - the Tables section is written in Delta format and heavily optimized for reporting purposes.
From a security perspective, there's a clear separation of data when using separate Lakehouses, which can mean that different security provisions can be put in each layer. You might not want people to have access to `raw` (i.e. `Bronze` data), but you might want them to be able to access `Silver` data. Having everything in a single Lakehouse would make this a little tricky.
In reality, though, "it depends". If you have a really simple use-case, one Lakehouse could be sufficient. I've also seen people suggest combining "Bronze and Silver" into a single Lakehouse and "Gold" into a separate Lakehouse, since in "Bronze" you're usually only dealing with "Files", and therefore you may as well utilize the "Tables" section for your "Silver" layer. You just need to determine what works best for your use-case, factoring in the above points alongside other concerns such as data residency and cost management/chargeback. And its likely that even internally, your Lakehouse architecture will differ from one project to the next.
Remember in Fabric we have the power of "Shortcuts", which allows us to seamlessly combine data from other Lakehouses. So you can't go far wrong whatever architecture you choose!
Thanks for a great explanation. Any chance you can share your Visio on your blog?
I think Ed is planning a post with a link to resources soon. He's just finishing the 4th video.
This is a great walkthrough of Medallion architecture! I'm looking to implement a structure where the Gold Layer stores Dimension and Fact - type information but one thing I noticed is that Delta Tables in fabric cannot hold anything like a primary key which is integer and auto-incrementing as you would in on-prem SQL. Am I right in thinking that this mustn't be the best practice when loading Dimensions from Bronze data for such tables? I'm trying to visualise the best way to set up Primary and Foreign Keys based on the data.
Thanks for the kind words! Ah - you've hit the age old problem when building a Lakehouse - generating integer surrogate keys.
Fortunately, Databricks has somewhat solved this problem by adding IDENTITY columns to their version of Delta Lake (see www.databricks.com/blog/2022/08/08/identity-columns-to-generate-surrogate-keys-are-now-available-in-a-lakehouse-near-you.html).
Unfortunately for us, they haven't open-sourced this feature, despite branding Delta Lake 2.0 as the "fully open-source" version of Delta Lake. Microsoft Fabric (and Azure Synapse) relies on the "open-source" version of Delta Lake, hence can't benefit from this feature (unless they were to build it themselves, which I doubt is going to happen). So, what can you do?
Well, the first thing I would say is.. do you actually need your Primary Key to be an integer? I know this goes against traditional Data Warehousing principles, but a Lakehouse is not a traditional Data Warehouse. And Power BI (/Analysis Services) isn't either. Delta Lake and Vertipaq both utilize columnar storage, where the main consideration on data size (and therefore performance) is the number of unique values (cardinality) within a column - not the data type. And people have tested this out in Power BI - see Max's blog here: www.maxwikstrom.se/performance/power-bi-data-types-in-relationships-does-it-matter/ - with the outcome being "it doesn't really matter". This aligns with behaviour I've seen as well. So, if Power BI is your primary downstream consumer, you might be able to get away with just using your table's unique key as the primary key - even if that's a string.
If you do need to generate an integer surrogate key (maybe your dimensions are really large), then there's no elegant way of doing this right now in Fabric. There are a number of ways you can do this in Spark/T-SQL - see ruclips.net/video/aF2hRH5WZAU/видео.html&ab_channel=Databricks for some open-source Spark options. There are equivalents in T-SQL. But you should be aware of the issues around concurrent-writing & idempotency for these methods (i.e. they're not safe without additional checks). And given that multi-table transactions also aren't supported in Spark, you need to be careful when creating a star schema that depends on surrogate keys you've generated. In the past we've implemented a custom locking/transaction mechanism that makes sure that all the tables in a star schema have been successfully written, or they've all been rolled back if failed. It's certainly a pain. Although Multi-table transactions are supported in Fabric Data Warehouse, so if you're using that then you're good on that front.
My recommendation would be to consider not bothering with integer PKs if at all possible. Test it out for your scenario and see whether it's feasible. It saves a whole world of pain maintaining them.
@@endjin Thankyou! That answer is comprehensive and really hits all the questions I would have had as follow ons. Regarding the size of the dimension tables, I don't think they're significant enough to warrant integer type and string may suffice.
Thanks again!
Thanks a lot. Currently we have multiple folders for our tables in our gen 2 data lake. It helps multiple teams find groups of tables. It looks like this isn’t possible in the lakehouse..you can only have subfolders in the file section?
You're correct - you can only create subfolders in the Files section. Maybe consider creating separate Lakehouses to satisfy your "grouping" requirements? And then you can always use Shortcuts to point to the core table so that you're not duplicating any data.
@@endjin ah ok yes, that makes sense. Thanks
What are the pros and cons of using a lakehouse for each piece of the medallion architecture vs using a single lakehouse and a file folder for each piece?
Hi, thanks for the comment! This question has been asked by other commentors too... so we'll just reshare our answer...
The reality is that there's no one-size-fits-all approach to architecting solutions in Fabric. When it comes to your specific question about how the Medallion architecture maps to Lakehouses in Fabric: we default to one Lakehouse per layer. That's mainly because of two things - organizational flexibility and security flexibility.
If we start combining the layers into a single Lakehouse, we lose flexibility on the organization of the managed Delta tables that we create. That's because within a single Lakehouse there's no way to group related tables other than by using a table naming convention - i.e. there's no equivalent of a T-SQL "SCHEMA" in a Lakehouse, nor is there a notion of sub-folders.
Generally in the Silver and Gold layers we're creating Tables as outputs, rather than Files, so this flexibility is useful in order to separate Tables from one layer to the next layer. If you're purely dealing with "Files" then there's less of an obvious benefit of creating separate Lakehouses. But I would question why Silver/Gold datasets are being stored in the Files section rather than the Tables section - the Tables section is written in Delta format and heavily optimized for reporting purposes.
From a security perspective, there's a clear separation of data when using separate Lakehouses, which can mean that different security provisions can be put in each layer. You might not want people to have access to `raw` (i.e. `Bronze` data), but you might want them to be able to access `Silver` data. Having everything in a single Lakehouse would make this a little tricky.
In reality, though, "it depends". If you have a really simple use-case, one Lakehouse could be sufficient. I've also seen people suggest combining "Bronze and Silver" into a single Lakehouse and "Gold" into a separate Lakehouse, since in "Bronze" you're usually only dealing with "Files", and therefore you may as well utilize the "Tables" section for your "Silver" layer. You just need to determine what works best for your use-case, factoring in the above points alongside other concerns such as data residency and cost management/chargeback. And its likely that even internally, your Lakehouse architecture will differ from one project to the next.
Remember in Fabric we have the power of "Shortcuts", which allows us to seamlessly combine data from other Lakehouses. So you can't go far wrong whatever architecture you choose!
@@endjin How about configuring different workspaces for each layer and securing them at the workspace level?
@@rkneti That's also an option, though generally we find having all layers in a single workspace easier to manage all things considered. For example, when considering CICD (i.e. promoting from Dev -> Test -> Prod), having all artifacts as one "unit" contained in a single workspace is much simpler to manage when releasing new functionality.
The "workspace per layer" approach is also less appealing now that Folders are in Preview in Fabric. This brings an additional level or organisation to your artifacts.
Thanks
You're most welcome. Check out some of the other videos on the channel. There are some really good talk about gathering requirements and testing.
Good Databricks video... I'm sorry, what?
Also good Snowflake, Azure Synapse, etc demo...