Thank you for the clear and practical videos! So far they've been helping me build foundational knowledge in Databricks industry standards and how things are/should be done. Looking forward to watching more of your content!
Finally a good, solid video that explains it well. Thanks! I would love to see a follow-up where you actually land some data in Bronze and transform it to Silver in development. How does the data look like in the containers? How does the tab "catalog" show the data in Databricks? How are they related? I want to know these things but can barely find anyone explaining it well. I basically want to build an enterprise lakehouse from scratch. Thanks!
Hi Simon, your videos are always on point and something tangible that I have always found very useful. Just one thing- since you have already assigned the rbac role of storage blob data contributor on the lake, for the access connector, you do not need acl on the container level.Had you not given the rbac on the lake, you would have needed acl.
yeah, this is a big mistake on their part tbh. one of the huge advantages of azure databricks was that a team could just start building - no procurement process, no need for global admins. making a change like this can take months in a large enterprise
Thanks Simon, engaging and crystal clear explanation as always! I would like to ask you a question: how would you design workspaces/catalogs avoiding to replicate 1:1 the data among DEV-STG-PRD environments and at the same time be able to handle 2 different scenarios which are the following: 1 - new requirement driven from data consumer. For example you need to develop a new gold table. 2 - new requirement driven from data producer. For example you need to add one column to the bronze and silver tables For scenario 1, it could be nice to work in DEV workspace but actually read prd_silver_table in order to be sure the gold logic properly works (if not on 100% of the data, maybe just the data from last month/week with PII anonymized) For scenario 2 instead since the new column is new, of course it is not present in PRD and so it is necessary to import new data from the producers in DEV workspace into dev_bronze_table and then dev_silver_table. In this case, if you want to run a regression test on a gold_table and once again you would like to run it on subset of PRD data, how would you approach it ? Thanks anyway for all the material!! :)
Most scenarios we come across, this is absolutely not allowed by InfoSec - whilst it would be great to cross-reference prod data for development & testing, we're rarely allowed to open up that access. If you wanted to, you wouldn't enforce the workspace-catalog bindings, instead you would have to rely on table permissions - your developers can read from the prod schema, but they are denied write access. If only an elevated account can write to prod, you can then model your scenarios. Pretty rare though, environment separation is much more common!
@@AdvancingAnalytics thanks for the quick feedback! I understand and I agree this is the common scenario. I am trying to find a new solution that is a win-win: avoid to duplicates data, but at the same time have a secure infrastructure. It could be to do read mode only as you said, with maybe some masking of PII, I’ll will investigating more into this nonetheless and see where the future of the data world goes :)
Great video. Thank you! I believe the ACLs are not needed when you've assigned Storage Blob Data Contributor. I've been able to set it up successfully without it.
I very much like the features that come with unity catalog. But at the same time I find it extremally challenging to implement this in a big organization in its current form due to 1-1 relation to AAD tenant. We have one AAD tenant used by multiple business groups that run multiple products. They are from different industries, have little to do with each other. I am an architect on one of such products. We have multiple envs with multiple lakes and DB workspaces. Sounds like a good use case for us right? Well not so fast. There are organizational questions that are difficult to answer: 1) Who will be managing the "account"? Our AAD global admins know nothing about Databricks and they dont want to mange this stuff (give permissions, create catalogs etc.). So it has to be deletaged - but to whom? It could be me, but it means I will be able to control access other's business groups catalogs. Will they agree to that? It also means I'll be dealing with their requests all the time. So it means there has to be some "company wide Databricks admin" nominated who will be managing all this stuff. Getting that done is not easy. 2) Who will be hosting and managing metastore storage account and access connector? Since its for entire org, it falls into some "common infra / landing zone" bucket, usually managed by some central infra team. So you need to onboard them. 3) What about automation? I'd like to have an SPN that can for instance create catalogs and use it for my CI/CD. But for now, there are no granular permissions on metastore level - either you are admin or not. Having an "admin" SPN that can create and control access to all catalogs in metastore (that may belong to multiple business groups) - not only its close to impossible but its also stupid. All these problems come down to one question - why does this have to be tied to AAD tenant? Or why can't we have multiple metastores per region - each product/product group having its own? Then everyone would take care of their own stuff and everyone would be happy!
thanks Simon. decision question: DB solution architects have recommended we use managed tables in UC. They mention a lot of benefits that are built into UC for query optimizations, AI-driven optimizations, etc.. But the idea of having external tables live in the azure subscriptions of the various data-producing domain teams seems like the best option. How would you decide on which option to use? Or can you do some combination of both?
I don’t recommend managed tables. If your data is deleted you have to undo the table drop to not lose it. Migrating to a new workspace or to UC will force to copy all the data. This is not the case of external tables
So - you absolutely, definitively want your tables stored in external storage rather than the UC metastore location, as it's fairly likely you'll have dev/test/prod environments etc, and having everything in a single lake goes against every infosec rule! However, you can override the managed table location when creating the catalog or schema by adding the LOCATION clause upon creation. That means you can have managed tables, but in your choice of lake location. The decision then lies with whether you want UC to be the primary owner of the data. If you drop a managed table, you drop the underlying data. Historically, we've avoided this like the plague, as there's very little benefit, but as UC adds a ton more "we'll optimize it for you" features, I imagine there will be a strong push to switch over to managed tables in future
@@AdvancingAnalytics liquid clustering provides optimization for external tables as well. The only part missing would be vacuuming. But that’s simple enough. I have a dedicated job for that purpose that will scan all databases and tables once a day and vacuum them
Don't you think that having DEV, QA, and PROD data all in the same datalake could create some performance issue? Usually DEV QA and PROD data have different licefcycles and data magnitude, workspaces could be bound to different vnets as well as ADLSs targeted as datalakes. So it would make me more comfortable having a separate ADLS datalake for every env. If we used external datalakes would be still have all the features of the unity catalog available?
In the case of having 2 storage accounts, 1 for dev and 1 for prod, should I create 1 metastore in each storage account and assign the dev workspace to the dev metastore and the prod workspace to the prod workspace?
Hi Simon, I having trouble is sending data to Domo using pydomo lib. As I am using external location in UC as source path the OS.list or other OS functions are not able to read the files in abfss path. Is there any solution to this ?
Absolutely, we TF most of our client deployments - but if there are people out there who haven't come across any of these setup steps, it's important to know what's actually happening before you automate it!
awesome. Thanks buddy , I was the victim of the never ending loop & after setting up account admin role , I was able to enable the unity catalog
Good Explanation!! I was missing the Global AD access which many of other RUclips did not explain, Thank You
Thank you for the clear and practical videos! So far they've been helping me build foundational knowledge in Databricks industry standards and how things are/should be done. Looking forward to watching more of your content!
Finally a good, solid video that explains it well. Thanks! I would love to see a follow-up where you actually land some data in Bronze and transform it to Silver in development. How does the data look like in the containers? How does the tab "catalog" show the data in Databricks? How are they related? I want to know these things but can barely find anyone explaining it well. I basically want to build an enterprise lakehouse from scratch. Thanks!
Hi Simon, your videos are always on point and something tangible that I have always found very useful. Just one thing- since you have already assigned the rbac role of storage blob data contributor on the lake, for the access connector, you do not need acl on the container level.Had you not given the rbac on the lake, you would have needed acl.
Incredible video Simon, thanks for making it simple and clear !
Databricks should really come up with a way to enable Unity Catalog by default for a workspace without having to chase down Global Admins.
yeah, this is a big mistake on their part tbh. one of the huge advantages of azure databricks was that a team could just start building - no procurement process, no need for global admins. making a change like this can take months in a large enterprise
Databricks SA here; this feature is coming in due course. It is being designed as we speak!
Hi Simon, why do we need the ACLs as well as the RBAC for the access connector on the metastore ADLS?
Thx!! This was just what I needed. And not the first time you read my mind 😂.
Keep up the good work. Very much appreciated
Thanks Simon, engaging and crystal clear explanation as always!
I would like to ask you a question: how would you design workspaces/catalogs avoiding to replicate 1:1 the data among DEV-STG-PRD environments and at the same time be able to handle 2 different scenarios which are the following:
1 - new requirement driven from data consumer. For example you need to develop a new gold table.
2 - new requirement driven from data producer. For example you need to add one column to the bronze and silver tables
For scenario 1, it could be nice to work in DEV workspace but actually read prd_silver_table in order to be sure the gold logic properly works (if not on 100% of the data, maybe just the data from last month/week with PII anonymized)
For scenario 2 instead since the new column is new, of course it is not present in PRD and so it is necessary to import new data from the producers in DEV workspace into dev_bronze_table and then dev_silver_table. In this case, if you want to run a regression test on a gold_table and once again you would like to run it on subset of PRD data, how would you approach it ?
Thanks anyway for all the material!! :)
Most scenarios we come across, this is absolutely not allowed by InfoSec - whilst it would be great to cross-reference prod data for development & testing, we're rarely allowed to open up that access.
If you wanted to, you wouldn't enforce the workspace-catalog bindings, instead you would have to rely on table permissions - your developers can read from the prod schema, but they are denied write access. If only an elevated account can write to prod, you can then model your scenarios.
Pretty rare though, environment separation is much more common!
@@AdvancingAnalytics thanks for the quick feedback! I understand and I agree this is the common scenario. I am trying to find a new solution that is a win-win: avoid to duplicates data, but at the same time have a secure infrastructure. It could be to do read mode only as you said, with maybe some masking of PII, I’ll will investigating more into this nonetheless and see where the future of the data world goes :)
Great video. Thank you! I believe the ACLs are not needed when you've assigned Storage Blob Data Contributor. I've been able to set it up successfully without it.
I thought so too, but then got errors during metastore creation! Might just be UI checks but it certainly thought it needed the ACLs!
@@AdvancingAnalytics , Same here . ACLs are not required.
Great walkthrough, thanks !
I very much like the features that come with unity catalog. But at the same time I find it extremally challenging to implement this in a big organization in its current form due to 1-1 relation to AAD tenant. We have one AAD tenant used by multiple business groups that run multiple products. They are from different industries, have little to do with each other. I am an architect on one of such products. We have multiple envs with multiple lakes and DB workspaces. Sounds like a good use case for us right? Well not so fast.
There are organizational questions that are difficult to answer:
1) Who will be managing the "account"? Our AAD global admins know nothing about Databricks and they dont want to mange this stuff (give permissions, create catalogs etc.). So it has to be deletaged - but to whom? It could be me, but it means I will be able to control access other's business groups catalogs. Will they agree to that? It also means I'll be dealing with their requests all the time. So it means there has to be some "company wide Databricks admin" nominated who will be managing all this stuff. Getting that done is not easy.
2) Who will be hosting and managing metastore storage account and access connector? Since its for entire org, it falls into some "common infra / landing zone" bucket, usually managed by some central infra team. So you need to onboard them.
3) What about automation? I'd like to have an SPN that can for instance create catalogs and use it for my CI/CD. But for now, there are no granular permissions on metastore level - either you are admin or not. Having an "admin" SPN that can create and control access to all catalogs in metastore (that may belong to multiple business groups) - not only its close to impossible but its also stupid.
All these problems come down to one question - why does this have to be tied to AAD tenant? Or why can't we have multiple metastores per region - each product/product group having its own? Then everyone would take care of their own stuff and everyone would be happy!
Mehhn...he's so excited!
thanks Simon.
decision question: DB solution architects have recommended we use managed tables in UC. They mention a lot of benefits that are built into UC for query optimizations, AI-driven optimizations, etc.. But the idea of having external tables live in the azure subscriptions of the various data-producing domain teams seems like the best option. How would you decide on which option to use? Or can you do some combination of both?
I don’t recommend managed tables. If your data is deleted you have to undo the table drop to not lose it. Migrating to a new workspace or to UC will force to copy all the data. This is not the case of external tables
So - you absolutely, definitively want your tables stored in external storage rather than the UC metastore location, as it's fairly likely you'll have dev/test/prod environments etc, and having everything in a single lake goes against every infosec rule!
However, you can override the managed table location when creating the catalog or schema by adding the LOCATION clause upon creation. That means you can have managed tables, but in your choice of lake location.
The decision then lies with whether you want UC to be the primary owner of the data. If you drop a managed table, you drop the underlying data. Historically, we've avoided this like the plague, as there's very little benefit, but as UC adds a ton more "we'll optimize it for you" features, I imagine there will be a strong push to switch over to managed tables in future
@@AdvancingAnalytics liquid clustering provides optimization for external tables as well. The only part missing would be vacuuming. But that’s simple enough. I have a dedicated job for that purpose that will scan all databases and tables once a day and vacuum them
Don't you think that having DEV, QA, and PROD data all in the same datalake could create some performance issue? Usually DEV QA and PROD data have different licefcycles and data magnitude, workspaces could be bound to different vnets as well as ADLSs targeted as datalakes.
So it would make me more comfortable having a separate ADLS datalake for every env.
If we used external datalakes would be still have all the features of the unity catalog available?
In the case of having 2 storage accounts, 1 for dev and 1 for prod, should I create 1 metastore in each storage account and assign the dev workspace to the dev metastore and the prod workspace to the prod workspace?
Hi Simon,
I having trouble is sending data to Domo using pydomo lib. As I am using external location in UC as source path the OS.list or other OS functions are not able to read the files in abfss path. Is there any solution to this ?
Internal Server Error I'm getting this
why on earth can you not add a group as account admin?
Who knows? Madness, I agree!
Informative, but for anyone, I don’t recommend ever enabling UC in the UI. Use proper IaC tools like TF, or whatever
Absolutely, we TF most of our client deployments - but if there are people out there who haven't come across any of these setup steps, it's important to know what's actually happening before you automate it!