Databricks CI/CD: Azure DevOps Pipeline + DABs

Поделиться
HTML-код
  • Опубликовано: 6 фев 2025

Комментарии • 35

  • @gangadharneelam3107
    @gangadharneelam3107 5 месяцев назад +2

    Hey Dustin,
    We're currently exploring DABs, and it feels like this was made just for us!😅
    Thanks a lot for sharing it!

  • @luiscarcamo8421
    @luiscarcamo8421 Месяц назад +1

    Thanks, Dustin! You help me a lot for a production pipeline!

  • @youssefzaki9142
    @youssefzaki9142 26 дней назад

    Excellent. Thank you so much Dustin.

  • @valentinloghin4004
    @valentinloghin4004 25 дней назад

    Thank you for reply , the file is in the root , I will take the look to your suggestion . Thank you again and blessings !

  • @benjamingeyer8907
    @benjamingeyer8907 5 месяцев назад

    Now do it in Terraform ;)
    Great video as always!

    • @DustinVannoy
      @DustinVannoy  5 месяцев назад +1

      🤣🤣 it may happen one day, but not today. I would probably need help from build5nines.com

  • @sandipansaha6847
    @sandipansaha6847 Месяц назад

    Superb video, very detailed
    Thanks @dustin....
    One Ques:: Do we need to create job in each databricks environment (QA,Prod) after the deployment or is it automatic created while deployment from dev to different environment

    • @DustinVannoy
      @DustinVannoy  25 дней назад

      You would either have a separate CI pipeline for each environment and swap out variables, or you can make this one more dynamic based on branch name or other factors.

  • @KrisKoirala
    @KrisKoirala 21 день назад

    Thank you for the awesome videos, those are very helpful. When i do bundle run it runs the workflow and after the run is done it deletes the workflow completely and I get message saying my workflow is deleted. I tried it removing the bundle run command it still deletes the workflow and there is no workflow, what is causing this ?

  • @carlcaravantsz1869
    @carlcaravantsz1869 22 дня назад

    question, the notebooks are not defined in a yaml-fashion? if not, how are they considered to be deployed in the target workspace? just by define them in the jobs are they deploy?

    • @DustinVannoy
      @DustinVannoy  21 день назад +1

      All the files and folders that are not ignored because of .gitignore are synced to workspace on bundle deploy by default. You can control which ones are included or excluded by adding sync -> include or sync ->exclude block. Search in these docs if you need more example of how to control that: docs.databricks.com/en/dev-tools/bundles/settings.html

  • @אופיראוחיון-ס8י
    @אופיראוחיון-ס8י 2 месяца назад +1

    Thank you!
    I have a few processes that are not related to each other. Do I need to create a separate DAB for each one? How can I make the process more dynamic?

    • @DustinVannoy
      @DustinVannoy  Месяц назад +1

      The general guidance is if you want to deploy together and code can be versioned together, then put it in the same bundle (all using same databricks.yml). If you want to keep things separate then its fine to have separate bundles and you can either deploy in separate CD pipelines or the same one by calling `databricks bundle deploy` multiple times, once from each directory with a databricks.yml.
      For making it more dyanmic I suggest variables, especially complex variables, but usually that is just to change values based on the target environment. Using SDK to create workflows is an alternative to DABs and other things have been discussed which might be more of a blend between the two options eventually.

    • @אופיראוחיון-ס8י
      @אופיראוחיון-ס8י Месяц назад

      @ Thank you very much!

  • @moncefansseti1907
    @moncefansseti1907 3 месяца назад +1

    Hey Dustin, if we want to add more ressources like adls bronze silver and gold storage do we need to add it to the envi variables?

    • @DustinVannoy
      @DustinVannoy  Месяц назад

      You can deploy schemas within unity catalog but for external storage locations or volumes I would expect those to either happen from Terraform or as notebooks/scripts that you run in the deploy pipeline. Jobs to populate the storage would be defined in DABs, but not the creation of storage itself unless it's built into a job you trigger with bundle run.

  • @BrianMurrays
    @BrianMurrays Месяц назад

    Thanks for the video; I had been looking into how to set this up for a while and this video finally got me to a place of having a working process. I just about have all of this setup in my environment but the most recent issue I'm running into is if I develop locally and run a DLT pipeline from VSCode it sets everything up with my credentials. When I merge to my dev branch that triggers the CICD pipeline (running as the service principal), the step that runs the job throws an error that the tables defined in the DLT pipeline are managed by another pipeline (the one with my credentials). If I use DLT do I just never test from vscode, or do I need to go clean those up each time? Is there a better way to manage this?

    • @BrianMurrays
      @BrianMurrays 26 дней назад

      In case anyone else was running into this I found a solution. First a little more context. I have a dev and prod workspace. Data engineers can pull the project down into VS Code and using either the Databricks CLI or the VS Code Databricks extension when they deploy bundles to the dev target it will go to their home folder. I have a release pipeline that is triggered when I merge into the dev branch. This will deploy the project using a service principal account. It is re-using the dev workspace and code will be deployed to it's home folder. All jobs and pipelines are getting named with the environment and the username as well.
      The part that I was missing is that everyone was using the same schemas. After talking to our Databricks expert it was suggested to use dynamic schemas as well. This is all configured with variables in the databricks.yml file. In the dev target they are overridden to use the user name as part of the schema name. So, when I deploy the bundle to the dev target the workflows and pipelines get prefixed with [dev myusername] and the catalog is my dev_bronze catalog, this is shared by all in dev workspace. But, the schemas that are created and written to will be prefixed with myusername_. This was the key so that each dev has their own area in the workspace and we don't have multiple DLT pipelines trying to own the same underlying tables. Staging uses the service principal and I have it set to use the default values to mimic how the process will look in the prod workspace.
      So, when developer1 deploys a bundle for ProjectA to dev it will create the [dev developer1] ProjectA Job and the [dev developer1] ProjectA DLT Pipeline, the source is stored in the home folder of developer1 and the table it writes to will be in dev_bronze.developler1_schemaName.tableName. When developer2 deploys ProjectA, they get [dev developer2] ProjectA Job, [dev developer2] ProjectA DLT Pipeline and write to dev_bronze.developer2_schemaName.tableName. When the change gets merged to the dev branch it will get deployed as [dev servicePrincipal] ProjectA Job, [dev servicePrincipal] ProjectA DLT Pipeline and write to dev_bronze.schemaName.tableName.

    • @DustinVannoy
      @DustinVannoy  25 дней назад

      @@BrianMurrays Thanks for sharing these details about how you got this working for your setup.

  • @valentinloghin4004
    @valentinloghin4004 26 дней назад

    Thank you very much ! I have an issue in the step Validate bundle for dev enviroment , I am getting the error message Error: unable to locate bundle root: databricks.yml not found

    • @DustinVannoy
      @DustinVannoy  25 дней назад

      The databricks.yml should be in the current working directory. Perhaps you need to set the working directory or add a change directory step before trying to run the deploy.

  • @unilmittakola
    @unilmittakola 4 месяца назад

    Hey Dustin,
    We're currently implementing data bricks asset bundles using Azure DevOps to deploy workflows. The bundles we are using storing it in the GitHub. Can you please help me with the YAML script for it.

  • @albertwang1134
    @albertwang1134 5 месяцев назад

    Hi Dustin, have you tried to configure and deploy a single node cluster by using Databricks Bundle?

    • @DustinVannoy
      @DustinVannoy  5 месяцев назад

      Yes, it is possible. It looks something like this:
      job_clusters:
      - job_cluster_key: job_cluster
      new_cluster:
      spark_version: 14.3.x-scala2.12
      node_type_id: m6gd.xlarge
      num_workers: 0
      data_security_mode: SINGLE_USER
      spark_conf:
      spark.master: local[*, 4]
      spark.databricks.cluster.profile: singleNode
      custom_tags: {"ResourceClass": "SingleNode"}

    • @albertwang1134
      @albertwang1134 4 месяца назад

      @@DustinVannoy Thanks a lot! This cannot be found in the Databricks documentation.

  • @thusharr7787
    @thusharr7787 5 месяцев назад

    Thanks, one question I have some metadata files in the project folder, I need to copy this to a volume in Unity catlog. Is it possible through this deploy process ?

    • @DustinVannoy
      @DustinVannoy  5 месяцев назад

      Using Databricks CLI path, you can have command that copies data up to volume. Replace all the curly brace { } parts with your own values.
      databricks fs cp --overwrite {local_path} dbfs:/Volumes/{catalog}/{schema}/{volume_name}/{filename}

  • @albertwang1134
    @albertwang1134 5 месяцев назад

    I am learning DABs at this moment. So lucky that I found this video. Thank you, @DustinVannoy. Do you mind if I ask a couple of questions?

    • @DustinVannoy
      @DustinVannoy  5 месяцев назад

      Yes, ask away. I'll answer what I can.

    • @albertwang1134
      @albertwang1134 5 месяцев назад

      Thank you, @@DustinVannoy. I wonder whether the following development progress does make sence. And if there any thing we could improve it.
      Background:
      (1) We have two Azure Databricks workspaces, one is for development, one is for production.
      (2) I am the only Data Engineer in our team, and we don't have dedicate QA. I am responsible to development and test. Who consume the data will do UAT.
      (3) We use Azure DevOps (repository and pipelines).
      Process:
      (1) Initialization
      (1.1) Create a new project by using `databricks bundle init`
      (1.2) Push the new project to Azure DevOps
      (1.3) On development DBR workspace, create a GIT Folder under `/Users/myname/` and link to the Azure DevOps repository
      (2) Development
      (2.1) Create a feature branch on DBR workspace
      (2.2) Do my development and hand test
      (2.3) Create a unit test job and the scheduled daily job
      (2.4) Create a pull request from the feature branch to the main branch on DBR workspace
      (3) CI
      (3.1) An Azure CI pipeline (build pipeline) will be trigerred after the pull request is created
      (3.2) The CI pipeline will check out the feature branch, and do `databricks bundle deploy` and `databricks bundle run --job the_unit_test_job` on the development DBR workspace by using Service Principal.
      (3.3) The test result will show on the pull request
      (4) CD
      (4.1) If everything looks good, the pull request will be approved
      (4.2) Manually trigger an Azure CD pipeline (release pipeline). Checkout the main branch, do `databricks bundle deploy` to the production DBR workspace by using Service Principal
      Explanation:
      (1) Because we are a small team and I am the only person who works on this, we do not have a `release` branch to simply the process
      (2) Due to the same reason, we also do not have a staging DBR workspace

    • @DustinVannoy
      @DustinVannoy  5 месяцев назад +1

      Overall process is good. It’s typical not to have a separate QA person. I try to use yaml pipeline for the release step so code would look pretty similar to what you use to automate deploy to dev. I recommend having unit tests you can easily run as you build which is why I try to use Databricks-connect to run a few specific unit tests at a time. But, running workflows on all-purpose or serverless isn’t too bad an option for quick testing as you develop.

  • @fb-gu2er
    @fb-gu2er 4 месяца назад

    Now do AWS 😂

    • @DustinVannoy
      @DustinVannoy  4 месяца назад

      Meaning AWS account with GitHub Actions? If not, what combo of tools are you curious about for the deployment?

  • @anindyabanerjee5733
    @anindyabanerjee5733 Месяц назад

    @DustinVannoy Will this work with a Databricks Personal Access Token instead of Service Connection/Service Principle?

    • @DustinVannoy
      @DustinVannoy  Месяц назад

      Yes, but for deploying DABs to Staging/Prod you want to use the same user every time so they are consistently the owner. For Github Actions I use a token in a secret. I think you could pull from key vault in dev ops pipeline, not positive on the best practice there.