Changes I have noticed following in June 2024 on Azure DataBricks and Community DataBricks 1. The Data Option in the left hand side menu no longer exists, to create a table navigate to (+) New and it is in the menu off there 2. When creating the table screen the options Bryan shows are not all there for Azure Databricks it seems to do things automatically like derive the schema and data types, however in Databricks community it has the same options 3. The icons at the top of the notebooks are now drop down menus 4. The visualisation button at the bottom of the cell no longer exists, to get to the visualisation screen you click on + which is next to the result table at the top. Thought this might be of use if anyone is following now in 2024 a big thank you Bryan for providing sample data, makes life a lot easier following along at home!
This is a great series, and this is the best I've seen online, I'm only at video 3 looking forward to the rest and will be watching the entire series again
thank you brian! series is beyond helpful, you make complex things so simple. hard to find people to explain as well as you do!! youre the best. will be back for more
Bryan, I like your clean and simple way of walk through of features/concepts. I have bought your book and by far its the best book about Azure Databricks. Please keep it coming
Great Video Bryan. I come from Oracle cloud and currently learning databricks, Spark & Azure. I would like to explore other cloud options for building quick data-driven applications.
The best series on databricks. Thanks for breaking things down clearly and covering every important detail. Also do you have a series covering Delta Lake?
Great presentation. When we select Cluster , we are giving lower Limit and Uppper Limit But if the data is huge and if it needs more than Upper Limit, what will happen this case
Hi, I just posted the link to the data files in the video description. I did not originally b/c I just saw this as a quick demo. Be sure to continue to the other videos where notebooks and data are included. Thanks
Very Good Vedio Class...Thanks for posting the same ....Can you please give links to download big datasets if possible such as sample sales data for an year for analysis purpose
Hey brian I was wondering why we couldn't use Spot Instances to dramatically reduce the cost of worker nodes. To my surprise Databricks recently added a Spot Instance option on the clusters page you should make a video covering that there''s huge cost benefits to do that
I am not able to create even the smallest cluster in Databricks with a free trial Azure account. Does anyone face with a similar issue? How do you override it?
Hi Bryan, thanks a ton for putting together this video series and I really appreciate it. Lesson 3 Zip file is missing, if possible can you upload it again?
Hi Attili, Good catch. I fixed the link in the video comments. I only included the data files as the example was a simple live coding. Other lessons have the example notebooks too. Link for the data used in this video is github.com/bcafferky/shared/blob/master/MasterDatabricksAndSpark/Lesson_03.zip
Hi Bryan In one of the statements, it was mentioned that Azure Data Lake Gen2 emulates HDFS, but I have a different opinion on this. Can you clarify? ADLS Gen2 and HDFS differ fundamentally in architecture. ADLS Gen2: Built on object storage, leveraging rich metadata for file management while supporting a hierarchical namespace similar to file systems. HDFS: Uses block storage, managing data at the block level with centralized metadata through a NameNode, optimized for high-throughput data processing. While ADLS Gen2 emulates HDFS functionality (like directory structures and APIs), it's still fundamentally object storage at its core, not block-based. Thanks Satish
Hi Satish, Thanks for asking. Poor wording on my part. ADLS Gen2 provides an HDFS compatible interface to services like Databricks but the implementation is very different as you mentioned.
You can get the files from video 9 in the series. The file names may be slightly different. This was just meant to be a short demo. ruclips.net/video/M89l4xLzEGE/видео.html
Best series ever on Data Eng! Have a question: do you recommend any other resources for configuring the storage part and relative integrations? Kudos from Italy,
Thank you and greetings from Boston, MA in the US. For deployments, the options are Databricks Asset Bundles, Python and the Databricks Python SDK, the Databricks Legacy CLI, and Terraform (or the new open source Terraform library called OpenTofu). Assuming this is what you are referring to.
@@marcellobenedetti3860 For Azure ADLS Gen 2 is the best storage option. Not sure for GCP or AWS but you want storage that is optimized for HDFS and partitioning..
Cheating but ChatGPT says "In Apache Spark, a worker is a node in a cluster that can run application code, and an executor is a process that runs on a worker node to perform computations and data processing for an application: Worker Monitors resource availability and spawns executors when directed by the master. Workers also monitor the resource consumption and liveness of executors. Executor Runs tasks assigned by the Spark driver program in parallel on worker nodes. Executors divide tasks into smaller units and perform computations on the data. Here are some other things to know about Spark workers and executors: Each application has its own executors. The number of executors created for a worker depends on the number of cores the worker has. For example, if a worker has 16 cores, then 4 executors will be created. The memory for an executor is the sum of the JVM Heap memory and yarn overhead memory. The number of CPU cores and memory that each executor can consume can be constrained using the spark-defaults.conf file, spark-env.sh file, and command line parameters. However, these parameters are static and require restarting the Spark processes for changes to take effect. "
Completed lesson 3 . Had a bit of a hard time understanding the first part but the second half of the video was a breeze and it’s etched in my mind . Maybe it was a little difficult due to different layouts then and now .. I mean the UI .
Suppose we are doing a Sum of Million Records , DBricks spit that into multiple Workder Nodes, if one node fails, the sum is abviosly wrong. Does DataBricks handles it?
Spelling "Databricks" does not capitalize the B. Spark is completely fault tolerant. If any node fails, it will rerun that work and get the data again. That's why the underlying data structure is called a Resilient Distributed Dataset.
I think it is really the compute you use that cost the money. If its 40 nodes running in parallel over 200 TBs of data for 3 hours, that will cost whereas a single node running for a few minutes is cheap. I do it all the time on my personal account.
Hi Bryan, it seems costing before becoming a qualified data engineer. Would you explain more specific about the clustering expenses for a beginner? Still have no idea after reviewing the pricing page.
It is an art to estimates costs b/c there are many variables: driver and node VM sizes, number of worker nodes, query optimization which is mostly automated but may need help, how data is partitioned, volume of data, on and on. If you can start with a portion of a large dataset and experiment with incrementally larger data subsets and queries, you'll be able to get a sense of what it will cost. The most important factors is compute, i.e. how many workers are running and for how long. The more, the longer running, the larger the node sizes, the more expensive. Databricks can scale up and down and automatically turn clusters after non use so it really comes down to how much compute you use. Make sense?
@@BryanCafferky I am not able to create even the smallest cluster in Databricks with a free trial Azure account. How do you override it in 2023? Thank you for the great lectures you are the best teacher in this field!
@@vukdjunisijevic173 Yeah. It's a problem. 1) Recommend you use the Databricks Community Edition which is always free and automatically gives you a single node cluster. 2) if not that, make sure all other resources that databricks are deleted from the free subscription, the VM limit applies across all services. 3) select a singe node (driver node only) cluster in Databricks and make sure all other cluster definitions are deleted. They count even when not running.
Hi Cafferky, I really like this Master Databricks lesson series. One question here: You mentioned the notebook will be stored to blob, may I know is it visible for us? And where I can find it? One more thing, may I share your videos to other website? As you may know, most people in China cannot access to RUclips.
Hi Ben, I think I answered this elsewhere but the notebooks when you create it gets store to blob but you access it vis the Databricks UI workspace icon, under whatever folder you like.
For basic visualizations, it could be used but Power BI supports active slicing and dicing and dynamic data linking. A good use case for Databricks dashboards would be for the management/leadership of data science and research teams especially if the developers can use custom Python visualizations which now support plot.ly.
I needed to learn this in a hurry for an interview. Your series is priceless.
Changes I have noticed following in June 2024 on Azure DataBricks and Community DataBricks
1. The Data Option in the left hand side menu no longer exists, to create a table navigate to (+) New and it is in the menu off there
2. When creating the table screen the options Bryan shows are not all there for Azure Databricks it seems to do things automatically like derive the schema and data types, however in Databricks community it has the same options
3. The icons at the top of the notebooks are now drop down menus
4. The visualisation button at the bottom of the cell no longer exists, to get to the visualisation screen you click on + which is next to the result table at the top.
Thought this might be of use if anyone is following now in 2024 a big thank you Bryan for providing sample data, makes life a lot easier following along at home!
thanks a lot
I'm seeing this different in my UI. Left nav bar > Data Engineering > Data Ingestion > Create or Modify table card > drag and drop tutorial file
ty
Thanks Bryan. Wonderful way of training.
Thank You!
Thank you for this series Bryan. You have made my understanding much easier. Looking forward to more videos from you
Glad to help. Thanks for watching.
This is a great series, and this is the best I've seen online, I'm only at video 3 looking forward to the rest and will be watching the entire series again
Thanks. Yeah. Really excited to pull everything together into an end to end series.
thank you brian! series is beyond helpful, you make complex things so simple. hard to find people to explain as well as you do!! youre the best. will be back for more
Thanks Bryan, you teach the subject in an easy to understand manner.
Thank You.
You're the best! Your way of teaching is easy to understand! I can't wait to finish this series
nice and crisp, eagerly waiting for the next video in the series
Thanks
Thanks, Brian. Seeing your passion and expertise, as well as listening to your clear explanations, is very helpful and motivating.
Thanks. Glad my videos are helping.
doing a great work sir.making it easier to understands these difficult topics that will be playing big role in future
Thanks!
Bryan, I like your clean and simple way of walk through of features/concepts. I have bought your book and by far its the best book about Azure Databricks. Please keep it coming
Thanks so much Sridhar!
This is a very useful series.
I appreciate the work you've done
Thanks!
Really enjoying the course! Thank you from Netherlands
This is a brilliant lecture, thanks
The best content on this Topic! Thank you.
Thank you Bryan! Nice and clear content.
Your videos are excellent.Thanks for explaining all concepts in details.
Thanks Alok. I know it makes my videos longer than other channels but hope it is worth it.
Great Video Bryan. I come from Oracle cloud and currently learning databricks, Spark & Azure. I would like to explore other cloud options for building quick data-driven applications.
this is really really awesome.
Excellent videos you have...
Thank you Bryan! Very very good training. I love how well you explain everything!
You're welcome. Glad it is helpful.
Great Course!! Thank you
The best series on databricks. Thanks for breaking things down clearly and covering every important detail. Also do you have a series covering Delta Lake?
Thanks, and please share with your friends. Delta Lake is on my list.
Great presentation.
When we select Cluster , we are giving lower Limit and Uppper Limit
But if the data is huge and if it needs more than Upper Limit, what will happen this case
Hey Brayn, can you pls make a video on while running databricks jobs what's are the sparks jobs running in background...
Great content
Great Lesson! Thank you Bryan!
Awesome video!!
this is amayzing thank you sir for putting the effort
You're Welcome!
Hi I don't see the any file for Data file in github for lesson 3. Could anyone help
Hi, I just posted the link to the data files in the video description. I did not originally b/c I just saw this as a quick demo. Be sure to continue to the other videos where notebooks and data are included. Thanks
@@BryanCafferky thanks for quick reply !
Great Teacher! Thank you!
YW
How lucky we are to live in this parallel universe where we get to learn from Michael Scott :D
Nice video!
Very Good Vedio Class...Thanks for posting the same ....Can you please give links to download big datasets if possible such as sample sales data for an year for analysis purpose
Thanks. That's covered in Lesson 10 - ruclips.net/video/wqCCWAa6mFA/видео.html
You should consider adding captions to some of the things which have changed. Just a suggestion!!
are workers same as nodes in the clustes ?
Yes but the driver is also a node but it is the entry point that sends the query to the cluster and orchestrates the work and receives the results.
Hey brian I was wondering why we couldn't use Spot Instances to dramatically reduce the cost of worker nodes.
To my surprise Databricks recently added a Spot Instance option on the clusters page you should make a video covering that
there''s huge cost benefits to do that
Hmmm... Interesting and I suspect there are real trade offs such as you can be evicted from your VMs but Azure wherenever. Thanks for the idea.
I am not able to create even the smallest cluster in Databricks with a free trial Azure account. Does anyone face with a similar issue? How do you override it?
Replied in your other post on this.
i am facing the same problem too
Hi Bryan, thanks a ton for putting together this video series and I really appreciate it. Lesson 3 Zip file is missing, if possible can you upload it again?
Hi Attili, Good catch. I fixed the link in the video comments. I only included the data files as the example was a simple live coding. Other lessons have the example notebooks too. Link for the data used in this video is github.com/bcafferky/shared/blob/master/MasterDatabricksAndSpark/Lesson_03.zip
@@BryanCafferky you are amazing, thank you so much.
Hi Bryan
In one of the statements, it was mentioned that Azure Data Lake Gen2 emulates HDFS, but I have a different opinion on this. Can you clarify?
ADLS Gen2 and HDFS differ fundamentally in architecture.
ADLS Gen2: Built on object storage, leveraging rich metadata for file management while supporting a hierarchical namespace similar to file systems.
HDFS: Uses block storage, managing data at the block level with centralized metadata through a NameNode, optimized for high-throughput data processing.
While ADLS Gen2 emulates HDFS functionality (like directory structures and APIs), it's still fundamentally object storage at its core, not block-based.
Thanks
Satish
Hi Satish,
Thanks for asking. Poor wording on my part. ADLS Gen2 provides an HDFS compatible interface to services like Databricks but the implementation is very different as you mentioned.
you are magician
Hi Bryan, how to close a open notebook in databricks ?
I usually just open something else. Notebook changes are automatically saved.
Can you share data in lesson ?
You can get the files from video 9 in the series. The file names may be slightly different. This was just meant to be a short demo. ruclips.net/video/M89l4xLzEGE/видео.html
@@BryanCafferky Thank you
@@nguyenquanghuy7978 YW
Best series ever on Data Eng! Have a question: do you recommend any other resources for configuring the storage part and relative integrations? Kudos from Italy,
Thank you and greetings from Boston, MA in the US. For deployments, the options are Databricks Asset Bundles, Python and the Databricks Python SDK, the Databricks Legacy CLI, and Terraform (or the new open source Terraform library called OpenTofu). Assuming this is what you are referring to.
@@BryanCafferky thanks for your kind answer. I'm referring to GCP, S3 or Blob 😊
@@marcellobenedetti3860 For Azure ADLS Gen 2 is the best storage option. Not sure for GCP or AWS but you want storage that is optimized for HDFS and partitioning..
How do cores and workers relate? Like if you need more horsepower, do I increase the workers or the cores? Or both?
Cheating but ChatGPT says "In Apache Spark, a worker is a node in a cluster that can run application code, and an executor is a process that runs on a worker node to perform computations and data processing for an application:
Worker
Monitors resource availability and spawns executors when directed by the master. Workers also monitor the resource consumption and liveness of executors.
Executor
Runs tasks assigned by the Spark driver program in parallel on worker nodes. Executors divide tasks into smaller units and perform computations on the data.
Here are some other things to know about Spark workers and executors:
Each application has its own executors.
The number of executors created for a worker depends on the number of cores the worker has. For example, if a worker has 16 cores, then 4 executors will be created.
The memory for an executor is the sum of the JVM Heap memory and yarn overhead memory.
The number of CPU cores and memory that each executor can consume can be constrained using the spark-defaults.conf file, spark-env.sh file, and command line parameters. However, these parameters are static and require restarting the Spark processes for changes to take effect. "
See this blog raulsanmartin.me/big/data/understanging-apache-spark-runtime-architecture/
@@BryanCafferky Thanks.
when you run a query , does it bring all the data from the source table(from DBFS in this case) to the cluster nodes and then does the processing?
Yes.
Completed lesson 3 . Had a bit of a hard time understanding the first part but the second half of the video was a breeze and it’s etched in my mind . Maybe it was a little difficult due to different layouts then and now .. I mean the UI .
Suppose we are doing a Sum of Million Records , DBricks spit that into multiple Workder Nodes, if one node fails, the sum is abviosly wrong. Does DataBricks handles it?
Spelling "Databricks" does not capitalize the B. Spark is completely fault tolerant. If any node fails, it will rerun that work and get the data again. That's why the underlying data structure is called a Resilient Distributed Dataset.
does databricks charge $ everytime you run a notebook code chunk?
I think it is really the compute you use that cost the money. If its 40 nodes running in parallel over 200 TBs of data for 3 hours, that will cost whereas a single node running for a few minutes is cheap. I do it all the time on my personal account.
@@BryanCafferky thank you good sir
Hi Bryan, it seems costing before becoming a qualified data engineer. Would you explain more specific about the clustering expenses for a beginner? Still have no idea after reviewing the pricing page.
It is an art to estimates costs b/c there are many variables: driver and node VM sizes, number of worker nodes, query optimization which is mostly automated but may need help, how data is partitioned, volume of data, on and on. If you can start with a portion of a large dataset and experiment with incrementally larger data subsets and queries, you'll be able to get a sense of what it will cost. The most important factors is compute, i.e. how many workers are running and for how long. The more, the longer running, the larger the node sizes, the more expensive. Databricks can scale up and down and automatically turn clusters after non use so it really comes down to how much compute you use. Make sense?
@@BryanCafferky I am not able to create even the smallest cluster in Databricks with a free trial Azure account. How do you override it in 2023? Thank you for the great lectures you are the best teacher in this field!
@@vukdjunisijevic173 Yeah. It's a problem. 1) Recommend you use the Databricks Community Edition which is always free and automatically gives you a single node cluster. 2) if not that, make sure all other resources that databricks are deleted from the free subscription, the VM limit applies across all services. 3) select a singe node (driver node only) cluster in Databricks and make sure all other cluster definitions are deleted. They count even when not running.
DB Community Edition here docs.databricks.com/getting-started/community-edition.html
The link to get the Community Edition is hidden, so look for a tiny font. They don't want you to find it and prefer you get the trail instead.
Hi Cafferky, I really like this Master Databricks lesson series. One question here: You mentioned the notebook will be stored to blob, may I know is it visible for us? And where I can find it?
One more thing, may I share your videos to other website? As you may know, most people in China cannot access to RUclips.
Hi Ben, I think I answered this elsewhere but the notebooks when you create it gets store to blob but you access it vis the Databricks UI workspace icon, under whatever folder you like.
Would it be possible to use Databricks notebook as a option to convey information in a better way then Power BI?
For basic visualizations, it could be used but Power BI supports active slicing and dicing and dynamic data linking. A good use case for Databricks dashboards would be for the management/leadership of data science and research teams especially if the developers can use custom Python visualizations which now support plot.ly.
Great!