Master Databricks and Apache Spark Step by Step: Lesson 3 - Databricks Demo

Поделиться
HTML-код
  • Опубликовано: 1 янв 2025

Комментарии • 104

  • @destinyokwuosa1048
    @destinyokwuosa1048 10 месяцев назад +5

    I needed to learn this in a hurry for an interview. Your series is priceless.

  • @ChrisUK70
    @ChrisUK70 6 месяцев назад +10

    Changes I have noticed following in June 2024 on Azure DataBricks and Community DataBricks
    1. The Data Option in the left hand side menu no longer exists, to create a table navigate to (+) New and it is in the menu off there
    2. When creating the table screen the options Bryan shows are not all there for Azure Databricks it seems to do things automatically like derive the schema and data types, however in Databricks community it has the same options
    3. The icons at the top of the notebooks are now drop down menus
    4. The visualisation button at the bottom of the cell no longer exists, to get to the visualisation screen you click on + which is next to the result table at the top.
    Thought this might be of use if anyone is following now in 2024 a big thank you Bryan for providing sample data, makes life a lot easier following along at home!

    • @sharad3877
      @sharad3877 5 месяцев назад +1

      thanks a lot

    • @D4rkm00r
      @D4rkm00r 2 месяца назад

      I'm seeing this different in my UI. Left nav bar > Data Engineering > Data Ingestion > Create or Modify table card > drag and drop tutorial file

    • @anonim5052
      @anonim5052 Месяц назад

      ty

  • @abbasstovewala2298
    @abbasstovewala2298 2 месяца назад +1

    Thanks Bryan. Wonderful way of training.

  • @Emmanuel-og7uc
    @Emmanuel-og7uc 3 года назад +21

    Thank you for this series Bryan. You have made my understanding much easier. Looking forward to more videos from you

  • @monsurahmed9547
    @monsurahmed9547 3 года назад +8

    This is a great series, and this is the best I've seen online, I'm only at video 3 looking forward to the rest and will be watching the entire series again

    • @BryanCafferky
      @BryanCafferky  3 года назад

      Thanks. Yeah. Really excited to pull everything together into an end to end series.

  • @AveryAude
    @AveryAude Год назад +1

    thank you brian! series is beyond helpful, you make complex things so simple. hard to find people to explain as well as you do!! youre the best. will be back for more

  • @ChrisUK70
    @ChrisUK70 6 месяцев назад

    Thanks Bryan, you teach the subject in an easy to understand manner.

  • @derejedesta4787
    @derejedesta4787 3 года назад +3

    You're the best! Your way of teaching is easy to understand! I can't wait to finish this series

  • @rahulberry4806
    @rahulberry4806 4 года назад +1

    nice and crisp, eagerly waiting for the next video in the series

  • @kenvdm2577
    @kenvdm2577 2 года назад +1

    Thanks, Brian. Seeing your passion and expertise, as well as listening to your clear explanations, is very helpful and motivating.

  • @saadullahkhanwarsi5853
    @saadullahkhanwarsi5853 3 года назад +1

    doing a great work sir.making it easier to understands these difficult topics that will be playing big role in future

  • @MrSrinayak
    @MrSrinayak 4 года назад +4

    Bryan, I like your clean and simple way of walk through of features/concepts. I have bought your book and by far its the best book about Azure Databricks. Please keep it coming

  • @danielkomorowski3776
    @danielkomorowski3776 Год назад

    This is a very useful series.
    I appreciate the work you've done

  • @Delchursing
    @Delchursing 8 месяцев назад

    Really enjoying the course! Thank you from Netherlands

  • @nokajaafa
    @nokajaafa 2 года назад +1

    This is a brilliant lecture, thanks

  • @samirghoudrani3659
    @samirghoudrani3659 2 года назад

    The best content on this Topic! Thank you.

  • @adanestradatoledano
    @adanestradatoledano 8 месяцев назад

    Thank you Bryan! Nice and clear content.

  • @AlokMishra-zg7qe
    @AlokMishra-zg7qe 3 года назад +1

    Your videos are excellent.Thanks for explaining all concepts in details.

    • @BryanCafferky
      @BryanCafferky  3 года назад +2

      Thanks Alok. I know it makes my videos longer than other channels but hope it is worth it.

  • @peterkatongole5984
    @peterkatongole5984 Год назад +2

    Great Video Bryan. I come from Oracle cloud and currently learning databricks, Spark & Azure. I would like to explore other cloud options for building quick data-driven applications.

  • @revanthisbnimmagadda
    @revanthisbnimmagadda Месяц назад

    this is really really awesome.

  • @ArvindDhiman-bs6dj
    @ArvindDhiman-bs6dj 7 месяцев назад

    Excellent videos you have...

  • @pcl111
    @pcl111 2 года назад

    Thank you Bryan! Very very good training. I love how well you explain everything!

  • @garychen3037
    @garychen3037 3 года назад +1

    Great Course!! Thank you

  • @chatpeters1395
    @chatpeters1395 2 года назад +1

    The best series on databricks. Thanks for breaking things down clearly and covering every important detail. Also do you have a series covering Delta Lake?

    • @BryanCafferky
      @BryanCafferky  2 года назад

      Thanks, and please share with your friends. Delta Lake is on my list.

  • @satish1012
    @satish1012 Месяц назад

    Great presentation.
    When we select Cluster , we are giving lower Limit and Uppper Limit
    But if the data is huge and if it needs more than Upper Limit, what will happen this case

  • @aks8496
    @aks8496 2 года назад +1

    Hey Brayn, can you pls make a video on while running databricks jobs what's are the sparks jobs running in background...

  • @fernandostahelin2972
    @fernandostahelin2972 Год назад

    Great content

  • @sebajo6643
    @sebajo6643 6 месяцев назад

    Great Lesson! Thank you Bryan!

  • @ajitchaudhary4444
    @ajitchaudhary4444 2 года назад

    Awesome video!!

  • @faicalammisaid3705
    @faicalammisaid3705 Год назад

    this is amayzing thank you sir for putting the effort

  • @ashukol
    @ashukol 2 года назад +2

    Hi I don't see the any file for Data file in github for lesson 3. Could anyone help

    • @BryanCafferky
      @BryanCafferky  2 года назад +2

      Hi, I just posted the link to the data files in the video description. I did not originally b/c I just saw this as a quick demo. Be sure to continue to the other videos where notebooks and data are included. Thanks

    • @ashukol
      @ashukol 2 года назад +1

      @@BryanCafferky thanks for quick reply !

  • @michaelcarrier5713
    @michaelcarrier5713 2 года назад

    Great Teacher! Thank you!

  • @rohitchakravarthi94
    @rohitchakravarthi94 Год назад

    How lucky we are to live in this parallel universe where we get to learn from Michael Scott :D

  • @BillusTinnus
    @BillusTinnus 2 года назад

    Nice video!

  • @neostar3498
    @neostar3498 3 года назад +2

    Very Good Vedio Class...Thanks for posting the same ....Can you please give links to download big datasets if possible such as sample sales data for an year for analysis purpose

    • @BryanCafferky
      @BryanCafferky  3 года назад

      Thanks. That's covered in Lesson 10 - ruclips.net/video/wqCCWAa6mFA/видео.html

  • @vibhasgoel2067
    @vibhasgoel2067 9 месяцев назад

    You should consider adding captions to some of the things which have changed. Just a suggestion!!

  • @rmehta26
    @rmehta26 9 месяцев назад +1

    are workers same as nodes in the clustes ?

    • @BryanCafferky
      @BryanCafferky  9 месяцев назад

      Yes but the driver is also a node but it is the entry point that sends the query to the cluster and orchestrates the work and receives the results.

  • @seoexperimentations6933
    @seoexperimentations6933 3 года назад +2

    Hey brian I was wondering why we couldn't use Spot Instances to dramatically reduce the cost of worker nodes.
    To my surprise Databricks recently added a Spot Instance option on the clusters page you should make a video covering that
    there''s huge cost benefits to do that

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      Hmmm... Interesting and I suspect there are real trade offs such as you can be evicted from your VMs but Azure wherenever. Thanks for the idea.

  • @vukdjunisijevic173
    @vukdjunisijevic173 Год назад +2

    I am not able to create even the smallest cluster in Databricks with a free trial Azure account. Does anyone face with a similar issue? How do you override it?

    • @BryanCafferky
      @BryanCafferky  Год назад

      Replied in your other post on this.

    • @mainakdey3893
      @mainakdey3893 6 месяцев назад

      i am facing the same problem too

  • @cranthi
    @cranthi 2 года назад +1

    Hi Bryan, thanks a ton for putting together this video series and I really appreciate it. Lesson 3 Zip file is missing, if possible can you upload it again?

    • @BryanCafferky
      @BryanCafferky  2 года назад +1

      Hi Attili, Good catch. I fixed the link in the video comments. I only included the data files as the example was a simple live coding. Other lessons have the example notebooks too. Link for the data used in this video is github.com/bcafferky/shared/blob/master/MasterDatabricksAndSpark/Lesson_03.zip

    • @cranthi
      @cranthi 2 года назад

      @@BryanCafferky you are amazing, thank you so much.

  • @satish1012
    @satish1012 Месяц назад

    Hi Bryan
    In one of the statements, it was mentioned that Azure Data Lake Gen2 emulates HDFS, but I have a different opinion on this. Can you clarify?
    ADLS Gen2 and HDFS differ fundamentally in architecture.
    ADLS Gen2: Built on object storage, leveraging rich metadata for file management while supporting a hierarchical namespace similar to file systems.
    HDFS: Uses block storage, managing data at the block level with centralized metadata through a NameNode, optimized for high-throughput data processing.
    While ADLS Gen2 emulates HDFS functionality (like directory structures and APIs), it's still fundamentally object storage at its core, not block-based.
    Thanks
    Satish

    • @BryanCafferky
      @BryanCafferky  Месяц назад

      Hi Satish,
      Thanks for asking. Poor wording on my part. ADLS Gen2 provides an HDFS compatible interface to services like Databricks but the implementation is very different as you mentioned.

  • @uguider
    @uguider 10 месяцев назад

    you are magician

  • @Raaj_ML
    @Raaj_ML 3 года назад +1

    Hi Bryan, how to close a open notebook in databricks ?

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      I usually just open something else. Notebook changes are automatically saved.

  • @nguyenquanghuy7978
    @nguyenquanghuy7978 3 года назад +3

    Can you share data in lesson ?

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      You can get the files from video 9 in the series. The file names may be slightly different. This was just meant to be a short demo. ruclips.net/video/M89l4xLzEGE/видео.html

    • @nguyenquanghuy7978
      @nguyenquanghuy7978 3 года назад +1

      @@BryanCafferky Thank you

    • @BryanCafferky
      @BryanCafferky  3 года назад

      @@nguyenquanghuy7978 YW

  • @marcellobenedetti3860
    @marcellobenedetti3860 11 месяцев назад

    Best series ever on Data Eng! Have a question: do you recommend any other resources for configuring the storage part and relative integrations? Kudos from Italy,

    • @BryanCafferky
      @BryanCafferky  11 месяцев назад

      Thank you and greetings from Boston, MA in the US. For deployments, the options are Databricks Asset Bundles, Python and the Databricks Python SDK, the Databricks Legacy CLI, and Terraform (or the new open source Terraform library called OpenTofu). Assuming this is what you are referring to.

    • @marcellobenedetti3860
      @marcellobenedetti3860 11 месяцев назад +1

      ​@@BryanCafferky thanks for your kind answer. I'm referring to GCP, S3 or Blob 😊

    • @BryanCafferky
      @BryanCafferky  11 месяцев назад +1

      @@marcellobenedetti3860 For Azure ADLS Gen 2 is the best storage option. Not sure for GCP or AWS but you want storage that is optimized for HDFS and partitioning..

  • @samore11
    @samore11 2 месяца назад

    How do cores and workers relate? Like if you need more horsepower, do I increase the workers or the cores? Or both?

    • @BryanCafferky
      @BryanCafferky  2 месяца назад

      Cheating but ChatGPT says "In Apache Spark, a worker is a node in a cluster that can run application code, and an executor is a process that runs on a worker node to perform computations and data processing for an application:
      Worker
      Monitors resource availability and spawns executors when directed by the master. Workers also monitor the resource consumption and liveness of executors.
      Executor
      Runs tasks assigned by the Spark driver program in parallel on worker nodes. Executors divide tasks into smaller units and perform computations on the data.
      Here are some other things to know about Spark workers and executors:
      Each application has its own executors.
      The number of executors created for a worker depends on the number of cores the worker has. For example, if a worker has 16 cores, then 4 executors will be created.
      The memory for an executor is the sum of the JVM Heap memory and yarn overhead memory.
      The number of CPU cores and memory that each executor can consume can be constrained using the spark-defaults.conf file, spark-env.sh file, and command line parameters. However, these parameters are static and require restarting the Spark processes for changes to take effect. "

    • @BryanCafferky
      @BryanCafferky  2 месяца назад

      See this blog raulsanmartin.me/big/data/understanging-apache-spark-runtime-architecture/

    • @samore11
      @samore11 2 месяца назад

      @@BryanCafferky Thanks.

  • @samikstream
    @samikstream 3 года назад +1

    when you run a query , does it bring all the data from the source table(from DBFS in this case) to the cluster nodes and then does the processing?

  • @mansah707
    @mansah707 5 месяцев назад

    Completed lesson 3 . Had a bit of a hard time understanding the first part but the second half of the video was a breeze and it’s etched in my mind . Maybe it was a little difficult due to different layouts then and now .. I mean the UI .

  • @satish1012
    @satish1012 Месяц назад

    Suppose we are doing a Sum of Million Records , DBricks spit that into multiple Workder Nodes, if one node fails, the sum is abviosly wrong. Does DataBricks handles it?

    • @BryanCafferky
      @BryanCafferky  Месяц назад

      Spelling "Databricks" does not capitalize the B. Spark is completely fault tolerant. If any node fails, it will rerun that work and get the data again. That's why the underlying data structure is called a Resilient Distributed Dataset.

  • @tonatiuhdeleon8236
    @tonatiuhdeleon8236 4 месяца назад +1

    does databricks charge $ everytime you run a notebook code chunk?

    • @BryanCafferky
      @BryanCafferky  4 месяца назад +1

      I think it is really the compute you use that cost the money. If its 40 nodes running in parallel over 200 TBs of data for 3 hours, that will cost whereas a single node running for a few minutes is cheap. I do it all the time on my personal account.

    • @tonatiuhdeleon8236
      @tonatiuhdeleon8236 4 месяца назад

      @@BryanCafferky thank you good sir

  • @zhangmr7955
    @zhangmr7955 2 года назад +2

    Hi Bryan, it seems costing before becoming a qualified data engineer. Would you explain more specific about the clustering expenses for a beginner? Still have no idea after reviewing the pricing page.

    • @BryanCafferky
      @BryanCafferky  2 года назад +1

      It is an art to estimates costs b/c there are many variables: driver and node VM sizes, number of worker nodes, query optimization which is mostly automated but may need help, how data is partitioned, volume of data, on and on. If you can start with a portion of a large dataset and experiment with incrementally larger data subsets and queries, you'll be able to get a sense of what it will cost. The most important factors is compute, i.e. how many workers are running and for how long. The more, the longer running, the larger the node sizes, the more expensive. Databricks can scale up and down and automatically turn clusters after non use so it really comes down to how much compute you use. Make sense?

    • @vukdjunisijevic173
      @vukdjunisijevic173 Год назад +1

      @@BryanCafferky I am not able to create even the smallest cluster in Databricks with a free trial Azure account. How do you override it in 2023? Thank you for the great lectures you are the best teacher in this field!

    • @BryanCafferky
      @BryanCafferky  Год назад

      @@vukdjunisijevic173 Yeah. It's a problem. 1) Recommend you use the Databricks Community Edition which is always free and automatically gives you a single node cluster. 2) if not that, make sure all other resources that databricks are deleted from the free subscription, the VM limit applies across all services. 3) select a singe node (driver node only) cluster in Databricks and make sure all other cluster definitions are deleted. They count even when not running.

    • @BryanCafferky
      @BryanCafferky  Год назад

      DB Community Edition here docs.databricks.com/getting-started/community-edition.html

    • @BryanCafferky
      @BryanCafferky  Год назад

      The link to get the Community Edition is hidden, so look for a tiny font. They don't want you to find it and prefer you get the trail instead.

  • @benjia4612
    @benjia4612 3 года назад +2

    Hi Cafferky, I really like this Master Databricks lesson series. One question here: You mentioned the notebook will be stored to blob, may I know is it visible for us? And where I can find it?
    One more thing, may I share your videos to other website? As you may know, most people in China cannot access to RUclips.

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      Hi Ben, I think I answered this elsewhere but the notebooks when you create it gets store to blob but you access it vis the Databricks UI workspace icon, under whatever folder you like.

  • @arturgrover2306
    @arturgrover2306 3 года назад +1

    Would it be possible to use Databricks notebook as a option to convey information in a better way then Power BI?

    • @BryanCafferky
      @BryanCafferky  3 года назад

      For basic visualizations, it could be used but Power BI supports active slicing and dicing and dynamic data linking. A good use case for Databricks dashboards would be for the management/leadership of data science and research teams especially if the developers can use custom Python visualizations which now support plot.ly.

  • @alfredsfutterkiste7534
    @alfredsfutterkiste7534 Год назад

    Great!