Microsoft Fabric: How to Build a Lakehouse with Medallion Architecture
HTML-код
- Опубликовано: 8 окт 2024
- Microsoft Fabric is a new cloud computing solution that enables you to build a lakehouse, a unified data platform that combines the best of data lakes and data warehouses.
But how do you design and implement a lakehouse that meets your data analytics needs?
One of the popular approaches is the medallion architecture, where you have different layers of data quality and refinement: bronze, silver, gold, and diamond.
I have been working with Fabric and the medallion architecture for months and will share my insights and tips on how to create a lakehouse with medallion architecture using Fabric’s services and features.
We will also answer your questions and comments live, so don’t miss this chance to learn more about Microsoft Fabric and the medallion architecture.
This topic is for everyone
Great Video, i was hoping you would continue the demo pushing data between the Bronze, Silver and Gold Folders, and maybe show how to set the access to one/different folders by security groups. You might have addressed these points in different videos, so i will keep looking. Thanks
Great feedback. Let me work on that.
Interesting demo. Was interested to learn about how the data sources were being reference and use between the Bronze, Silver and Gold layers.
I have been using the medallion architecture but with actual tables as opposed to the folders. I never considered doing it with the folders. I suppose it makes sense if the data isn't in the format to load directly into a table. You still have to find a way to update the tables rather than the right hand click and load to. I just used notebooks to carry out all of the processing. I like the idea of the diamond layer representing the data model piece.
Thanks! I have also been thinking about separation in Gold - modeled/ Diamond- semantic/ ????? - flattened report or api. Gotta come up with a name for that.
Great session Christopher. 👍🏻 Would it be possible to enable the ‘Live Chat’ functionality in the final recording? Reid (Havens Consulting) does that, it’s valuable to be able follow the discussion and question from your community for us that was not able to attend the livestream. 😊 Thanks, keep it up.
Hello @cfosund - that should be on... I can see it on this video. I know that I missed the setting on previous livestreams, but I think that this has been sorted out.
Awesome, it´s available now. Thanks!
Great Content Chris, I saw this live and then came back now and re watched it as I start to Fabric more.
Question though regarding medallion architecture. It was set up in this demo with clear labels within your OneLake. However, once you moved into the SQL Warehouse / Endpoint the medallion naming convention wasn't used any long. I know this was a time crunch to fit everything in this demo, but wanted your thoughts on persisting the medallion structure in the SQL Warehouse side. Let me know if it was there and I missed it.
Good call out. In the Lakehouse or DW I would separate them out by schemas. Schema's were not working at the time of the recording. They work great now. Do that. :)
Great Video. Thank You. In real world scenario, the Bronze Folder structure will be of type Year/Month/Day format.
Thanks for the video Chris!
I'm wondering should each medallion have it's prod dev and quality stage? or 2 of them?
Great question. Keep your environment as simple as you need. If you need to have these additional layers, then do so. Bigger companies 100% have at least this many layers. Smaller companies may be fine without them.
Great video! One question. Is there a way to use the medallion architecture in situations where data needs to be in near-real time? I know directlake eliminates latency between the delta tables and the semantic model, but everything upstream of that (dataflows, pipelines, etc.) requires scheduled refreshes. It seems like breaking those out into layers would only increase latency. Or am I missing something?
Great question. For Real time data, I set up the stream to write to gold, then scripit it back to Bronze and silver on a time based cadence.
Thanks Chris, this is all great info for the growing citizen dev :)
I am glad you liked it!
Great video! Could you explain more about Diamond layer? Thanks in advance.
The Diamond layer is the semantic layer usually comprised of SSAS / AAS / Power BI Dataset layer where Measures are created, RLS is applied, and users have access to consistent joins and data.
A dollar each time you say 'it depends', very GIAC-like 😊
Bingo. Charity should be a priority to everyone. :) Happy to promote a kinder world.
Hey, thank you for the video was really informative.
Quick Q on the 'load to table' within lakehouse. I inserted new data (row) into the .csv and wanted it to show in the table automatically which didn't work. How would I get this to work automatically without having to re-create the table each time? Is that even possible? Thank you!
Great question. The CSV becomes either a base for a table that is rewritten each time, or you build processes on top of that that load it into the Delta file in the lakehouse table. You can build processes that update the Delta file that was created by the CSV, but by just updating the CSV you have to reload (or choose your management technique).
I'm curious how you can use dataflow as mentioned to land from an on prem file server to fabric? Won't you need data factory for that?
\
You do not
For now:
You have to do a Gen1 dataflow to the service
Then a Gen2 from the Gen1 dataflow.
Sorry I missed the live version - was on w/ Microsoft Support regarding a Dataflow Gen2 error!
DOH! I hope that gets sorted out.
Medallion Architecture - yet another term dropped on me after Fabric has already smothered me with a dizzying array of confusing terms already. When will it all end? When can I have peace like Dax? Honestly, it's getting to be too damn much. I'm not sure any of this will help the little guy in the small organization. My frustration is real.
I am SO sorry... this is something that has been around for 6 or 7 years (?) and was originated by Databricks (I believe). The ONLY reason I bring it up and use it is to hopefully connect with other documentation you may see in other technology.
Only use the tools that you need. Keep it simple. If you run into a problem, then think about switching.
@@ChrisWagnerDatagod , the problem is not the Medallion is how fabrics work with that!! I have a bucnh of question about fabric!!
Hey Chris! Question about how fixed the boundaries between the layers? Meaning are the definitions of what the layers contain hard and fast?
@@alt-enter237 Good question. While not hard and fast, it's a very rare exception that did not cause problems nearly right away.
Is it Cake or is it Cake 2?🍰