Advancing Spark - Setting up Databricks Unity Catalog Environments

Advancing Analytics

Просмотров 19 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 ноя 2024

Комментарии •

@VikramMishra-l3b 9 месяцев назад ⁺³
awesome. Thanks buddy , I was the victim of the never ending loop & after setting up account admin role , I was able to enable the unity catalog
@knowwhyt Год назад ⁺¹
Good Explanation!! I was missing the Global AD access which many of other RUclips did not explain, Thank You
@MohammadAwad-q4v 11 месяцев назад
Thank you for the clear and practical videos! So far they've been helping me build foundational knowledge in Databricks industry standards and how things are/should be done. Looking forward to watching more of your content!
@julius8183 7 месяцев назад
Finally a good, solid video that explains it well. Thanks! I would love to see a follow-up where you actually land some data in Bronze and transform it to Silver in development. How does the data look like in the containers? How does the tab "catalog" show the data in Databricks? How are they related? I want to know these things but can barely find anyone explaining it well. I basically want to build an enterprise lakehouse from scratch. Thanks!
@saugatmukherjee8119 Год назад ⁺¹
Hi Simon, your videos are always on point and something tangible that I have always found very useful. Just one thing- since you have already assigned the rbac role of storage blob data contributor on the lake, for the access connector, you do not need acl on the container level.Had you not given the rbac on the lake, you would have needed acl.
@drummerboi4eva Год назад
Incredible video Simon, thanks for making it simple and clear !
@akhilannan Год назад ⁺⁸
Databricks should really come up with a way to enable Unity Catalog by default for a workspace without having to chase down Global Admins.
@user-ln3jj4ip9d Год назад ⁺²
yeah, this is a big mistake on their part tbh. one of the huge advantages of azure databricks was that a team could just start building - no procurement process, no need for global admins. making a change like this can take months in a large enterprise
@advait2062 Год назад ⁺²
Databricks SA here; this feature is coming in due course. It is being designed as we speak!
@LS-rv1lk Год назад ⁺²
Hi Simon, why do we need the ACLs as well as the RBAC for the access connector on the metastore ADLS?
@fredrikovesson Год назад
Thx!! This was just what I needed. And not the first time you read my mind 😂.
Keep up the good work. Very much appreciated
@AlessandroGattolin Год назад
Thanks Simon, engaging and crystal clear explanation as always!
I would like to ask you a question: how would you design workspaces/catalogs avoiding to replicate 1:1 the data among DEV-STG-PRD environments and at the same time be able to handle 2 different scenarios which are the following:
1 - new requirement driven from data consumer. For example you need to develop a new gold table.
2 - new requirement driven from data producer. For example you need to add one column to the bronze and silver tables
For scenario 1, it could be nice to work in DEV workspace but actually read prd_silver_table in order to be sure the gold logic properly works (if not on 100% of the data, maybe just the data from last month/week with PII anonymized)
For scenario 2 instead since the new column is new, of course it is not present in PRD and so it is necessary to import new data from the producers in DEV workspace into dev_bronze_table and then dev_silver_table. In this case, if you want to run a regression test on a gold_table and once again you would like to run it on subset of PRD data, how would you approach it ?
Thanks anyway for all the material!! :)
@AdvancingAnalytics Год назад ⁺¹
Most scenarios we come across, this is absolutely not allowed by InfoSec - whilst it would be great to cross-reference prod data for development & testing, we're rarely allowed to open up that access.
If you wanted to, you wouldn't enforce the workspace-catalog bindings, instead you would have to rely on table permissions - your developers can read from the prod schema, but they are denied write access. If only an elevated account can write to prod, you can then model your scenarios.
Pretty rare though, environment separation is much more common!
@AlessandroGattolin Год назад
@@AdvancingAnalytics thanks for the quick feedback! I understand and I agree this is the common scenario. I am trying to find a new solution that is a win-win: avoid to duplicates data, but at the same time have a secure infrastructure. It could be to do read mode only as you said, with maybe some masking of PII, I’ll will investigating more into this nonetheless and see where the future of the data world goes :)
@TeeVanBee Год назад ⁺²
Great video. Thank you! I believe the ACLs are not needed when you've assigned Storage Blob Data Contributor. I've been able to set it up successfully without it.
@AdvancingAnalytics Год назад
I thought so too, but then got errors during metastore creation! Might just be UI checks but it certainly thought it needed the ACLs!
@arunr2265 Год назад
@@AdvancingAnalytics , Same here . ACLs are not required.
@DGermishuizen Год назад
Great walkthrough, thanks !
@hellhax Год назад ⁺¹
I very much like the features that come with unity catalog. But at the same time I find it extremally challenging to implement this in a big organization in its current form due to 1-1 relation to AAD tenant. We have one AAD tenant used by multiple business groups that run multiple products. They are from different industries, have little to do with each other. I am an architect on one of such products. We have multiple envs with multiple lakes and DB workspaces. Sounds like a good use case for us right? Well not so fast.
There are organizational questions that are difficult to answer:
1) Who will be managing the "account"? Our AAD global admins know nothing about Databricks and they dont want to mange this stuff (give permissions, create catalogs etc.). So it has to be deletaged - but to whom? It could be me, but it means I will be able to control access other's business groups catalogs. Will they agree to that? It also means I'll be dealing with their requests all the time. So it means there has to be some "company wide Databricks admin" nominated who will be managing all this stuff. Getting that done is not easy.
2) Who will be hosting and managing metastore storage account and access connector? Since its for entire org, it falls into some "common infra / landing zone" bucket, usually managed by some central infra team. So you need to onboard them.
3) What about automation? I'd like to have an SPN that can for instance create catalogs and use it for my CI/CD. But for now, there are no granular permissions on metastore level - either you are admin or not. Having an "admin" SPN that can create and control access to all catalogs in metastore (that may belong to multiple business groups) - not only its close to impossible but its also stupid.
All these problems come down to one question - why does this have to be tied to AAD tenant? Or why can't we have multiple metastores per region - each product/product group having its own? Then everyone would take care of their own stuff and everyone would be happy!
@ndz7372 26 дней назад
Mehhn...he's so excited!
@IsmaelByrd Год назад ⁺¹
thanks Simon.
decision question: DB solution architects have recommended we use managed tables in UC. They mention a lot of benefits that are built into UC for query optimizations, AI-driven optimizations, etc.. But the idea of having external tables live in the azure subscriptions of the various data-producing domain teams seems like the best option. How would you decide on which option to use? Or can you do some combination of both?
@fb-gu2er Год назад ⁺¹
I don’t recommend managed tables. If your data is deleted you have to undo the table drop to not lose it. Migrating to a new workspace or to UC will force to copy all the data. This is not the case of external tables
@AdvancingAnalytics Год назад ⁺²
So - you absolutely, definitively want your tables stored in external storage rather than the UC metastore location, as it's fairly likely you'll have dev/test/prod environments etc, and having everything in a single lake goes against every infosec rule!
However, you can override the managed table location when creating the catalog or schema by adding the LOCATION clause upon creation. That means you can have managed tables, but in your choice of lake location.
The decision then lies with whether you want UC to be the primary owner of the data. If you drop a managed table, you drop the underlying data. Historically, we've avoided this like the plague, as there's very little benefit, but as UC adds a ton more "we'll optimize it for you" features, I imagine there will be a strong push to switch over to managed tables in future
@fb-gu2er Год назад ⁺¹
@@AdvancingAnalytics liquid clustering provides optimization for external tables as well. The only part missing would be vacuuming. But that’s simple enough. I have a dedicated job for that purpose that will scan all databases and tables once a day and vacuum them
@wiwiwiii 11 месяцев назад
Don't you think that having DEV, QA, and PROD data all in the same datalake could create some performance issue? Usually DEV QA and PROD data have different licefcycles and data magnitude, workspaces could be bound to different vnets as well as ADLSs targeted as datalakes.
So it would make me more comfortable having a separate ADLS datalake for every env.
If we used external datalakes would be still have all the features of the unity catalog available?
@lucaslira5 11 месяцев назад
In the case of having 2 storage accounts, 1 for dev and 1 for prod, should I create 1 metastore in each storage account and assign the dev workspace to the dev metastore and the prod workspace to the prod workspace?
@yuvakarthiking 8 месяцев назад
Hi Simon,
I having trouble is sending data to Domo using pydomo lib. As I am using external location in UC as source path the OS.list or other OS functions are not able to read the files in abfss path. Is there any solution to this ?
@AbhishekYadav-o3q1p Год назад
Internal Server Error I'm getting this
@jacovangelder9700 Год назад
why on earth can you not add a group as account admin?
@AdvancingAnalytics Год назад
Who knows? Madness, I agree!
@fb-gu2er Год назад
Informative, but for anyone, I don’t recommend ever enabling UC in the UI. Use proper IaC tools like TF, or whatever
@AdvancingAnalytics Год назад ⁺¹
Absolutely, we TF most of our client deployments - but if there are people out there who haven't come across any of these setup steps, it's important to know what's actually happening before you automate it!

Следующие

Автовоспроизведение

Advancing Spark - Managing Files with Unity Catalog Volumes