Advancing Spark - Getting started with Repos for Databricks

Поделиться
HTML-код
  • Опубликовано: 26 ноя 2024

Комментарии • 29

  • @amateurvisser
    @amateurvisser 2 года назад

    Much better than notebook based GIT control. It was a mess, especially when moving notebooks to other folders. Thanks.

  • @chandanmishra8084
    @chandanmishra8084 3 года назад

    Wow, you solved my biggest problem while collaborating with other people. Thank you.

  • @umarhussain9334
    @umarhussain9334 3 года назад +1

    Did exactly the same I built a wrapper on top of the cli to extract notebooks and would add the notebooks into my source control. I use artifacts to add in my own custom classes and functions in Pyspark.

  • @skaina00
    @skaina00 3 года назад

    This feature is very useful and I hadn't heard about it before. It meets my current needs :) Thanks for share your tips!

  • @rohsha6958
    @rohsha6958 3 года назад +1

    Thanks for the video. I am wondering what path do we use now for calling other notebooks (using %run). It can be different within the repos and when it is actually deployed to workspace.

  • @rebeccathorn6406
    @rebeccathorn6406 2 года назад

    Great video 😊 Would be great to see another video or blog talking through the patterns your using with databricks repos!

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      I revisited this in a recent video for "files in repos", goes through the library development approach. Take a look and let me know if that covers your questions! ruclips.net/video/nN-NPnfJLNY/видео.html
      Simon

  • @vap72a25
    @vap72a25 3 года назад +2

    Finally!

  • @crazybauns
    @crazybauns 2 года назад

    thanks for this video, it's very useful! - i've got two questions :)
    1. when you set it up, you already had a project in devops and the notebooks got pushed from devops to databricks?
    is it possible to do it the other way around? eg i have a bunch of notebooks in databricks and i want to set up source control - i want all of the notebooks in databricks to be pushed to devops automatically when the source control is set up - can it be done? or do i have to add them manually and move them to the repo?
    2. also when it comes to CI/CD in azure devops - is there a simple way to not only deploy notebooks to different workspaces but also deploy actual databricks objects such as eg tables? eg there would be a script that checks if a table exists and if not it creates it or alters it if there are fields missing?
    Thanks!

  • @lackshubalasubramaniam7311
    @lackshubalasubramaniam7311 3 года назад

    Great video! Works for my case even when our repo is integrated with other code base. This is due to how we deploy code via IaC. It would be nice if I could point to a sub folder in the repo. Cleaner.

  • @alphacharith
    @alphacharith 3 года назад

    Thank you for sharing the video.....

  • @grabngoinfo
    @grabngoinfo 2 года назад

    Thank you for the video!

  • @sergiocoutinho6133
    @sergiocoutinho6133 2 года назад

    This is a wonderful video, but I have a doubt: Does "Databricks SQL ANALYTICS" work with REPOS too? Is there a way to store in a secure place all my queries and dashboards?

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      Depressingly, no. There's no source control integration within Databricks SQL - it does maintain a history of your various query objects, but they're all held within databricks itself, rather than integrating back to Git. It guess it's the same thing we've seen with other "business facing BI tools" where the assumption is that data analysts don't "get" source control.... would be nice to have the option though!

  • @marcocaviezel2672
    @marcocaviezel2672 3 года назад

    Thanks for this great video!
    If I understand correctly, the Repos don‘t replace the Workspace?
    Could you explain or share some resources, how to release the (main?) branch in the workspace?

    • @marcocaviezel2672
      @marcocaviezel2672 3 года назад

      Just as a short follow up. I can see and access notebooks in ADF, when I link the workspace as a source. Are there any objections to use notebook in production like this? (I didn‘t try autoloaders yet...)

  • @baatchus1519
    @baatchus1519 3 года назад

    Great video, I’m also wondering about the %run notebook functionality. How would that work inside the git repo? It will work with relative path but not absolute patch?

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +1

      Hey - you can use either, but the absolute path will reference a path to the repo in dbfs, unlike the standard notebook paths, so could would change between deployments etc - generally better to stick with relative if possible!

    • @mationplays1500
      @mationplays1500 2 года назад

      @@AdvancingAnalytics so does that mean you can use this repo as if it was local on your pc and import modules and stuff like a normal filestructure

  • @adityak204
    @adityak204 3 года назад

    Thanks for the video! In my project I have built a CI pipeline using databricks-cli & Azure DevOps, using databricks-cli I can trigger the Notebook which is present in my workspace (have synched workspace NB with Azure Repos) but how can I trigger Notebooks present in Databricks Repos??

    • @sergiocoutinho6133
      @sergiocoutinho6133 2 года назад

      Hi Aditya,
      I don't know if my suggestion can help you, but you could create a "global parameter" in "AZURE DATA FACTORY (ADF)", pointing to the REPOS where your notebook is.
      For example:
      A) Create a global parameter in ADF:
      DBC_NOTEBOOK_PATH = /Repos/myemail@sparkfans.com/notebooks/
      B) In the PIPELINE that runs this notebook, in the component NOTEBOOK, configure the NOTEBOOK PATH parameter with this value:
      @concat(pipeline().globalParameters.DBC_NOTEBOOK_PATH, '/silver/my_notebook.dbc')
      With these steps, you need only to change the DBC_NOTEBOOK_PATH value to work with WORKSPACE or REPOS
      C) One advice: when you are working with REPOS and WORKSPACE, be careful with your notebook's %run commands. You need to adopt the "relative path" instead of the "absolute path".
      Example:
      Rather then
      %run
      "global/functions.dbc"
      Use:
      "./../../../functions.dbc" (the number of "../" depends of where your "my_notebook.dbc" is located)
      Best regards

  • @mationplays1500
    @mationplays1500 2 года назад

    Can you also ad non notebook files(python source files) in the repo and run them on the cluster

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 года назад

      You can now! You couldn't when I made this original video. Check out the video I put together on .Py files in repos: ruclips.net/video/nN-NPnfJLNY/видео.html

  • @miguelangelfernandezguerre1681
    @miguelangelfernandezguerre1681 3 года назад

    Have you notice that if you make a commit in the Repo in a notebook, when you look at the same notebook in the Workspace that change is ignored? I'm facing this big problem, the notebooks in the workspace always ignore my last commit change...

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 года назад +2

      Not seen it, but we've also changed the patterns we use. We no longer sync workspace notebooks with git, we instead use a DevOps pipeline to push notebooks to the workspace version when changes we've made Inna feature branch are pulled into our Dev branch. Had no problems with that pattern!
      Simon

    • @miguelangelfernandezguerre1681
      @miguelangelfernandezguerre1681 3 года назад

      @@AdvancingAnalytics great tip, I'll the same. Do you have any resource/git repo with examples for that pipeline. I'm a Data Scientist and I'm new to those things about DevOps/MLOps. I appreciate any book, youtube channel or online course you could share. Thanks again, you have a fantastic channel, I love it ❤

    • @qweasdzxc2007
      @qweasdzxc2007 2 года назад

      @@AdvancingAnalytics Please share which AzureDevops release pipeline step do you use for pushing Git code into ADB Workspace.