Master Databricks and Apache Spark Step by Step: Lesson 9 - Creating the SQL Tables on Databricks

Bryan Cafferky

Просмотров 13 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 янв 2025

Комментарии • 23

@BryanCafferky 2 года назад ⁺¹⁰
This a re-edited upload. I cleaned up the sound, removing some annoying background noise and I cut our some parts that seemed unnecessary.
@codyjackson92 Год назад ⁺¹
I've been watching these videos for a couple of days, and they are great. I have a Udemy account through my employer but the videos available there are lacking. They don't necessarily give a rhyme or reason why you want/need to do something, and they completely ignore the background of Databricks and Spark, just jumping straight into how to use notebooks. Bryan spends a lot more time explaining how and why you do something, which means you are more likely to figure out how to do what you want to do rather than simply memorizing commands.
TL;DR: This video series is much more valuable than the paid-for content on Udemy and, possibly, similar sites.
@arghyaroy8539 8 дней назад
@brian - I do not see the Notebook in the Github download link, I see the CSVs and the PDF slides, am I missing something?
@itsshehri Год назад
hey @BryanCafferky when i am adding the factinternetsalesreason table it is also adding the header on the first line even though I have done header = "false"
@jeremyturner4327 4 месяца назад ⁺¹
The data tab appears to be absent in the new ui?
@DM-py7pj Год назад ⁺¹
these schema-on-read "tables" are persisted by hive;what does this mean for user visibility? i.e., who can view and/or use this stored info?
@BryanCafferky Год назад
It is accessible within the Databricks Workspace. Each workspace has a Hive Catalog.
@Lee12238z Год назад ⁺¹
@Bryan Cafferky the updated UI does not let you load more than 10 files at a time (your practice above has 19), any work around or suggestions?
@BryanCafferky Год назад ⁺¹
I assume you mean upload to the Databricks workspace. I suggest just doing it in two steps then, 1 - 10, 11 - 19. In practice, files would probably be put on cloud storage outside of Databricks. This method is really more for learning.
@rugbymatt Год назад ⁺³
I just went through this and the main issue is that the Databricks UI has changed, so getting to the right place to upload the files is a bit trickier now. This works as of Aug. 17, 2023.
1) Go to Admin Settings --> Workspace Settings, and enable "DBFS File Browser" under Advanced (and then refresh the browser window if it was not already enabled)
2) In the sidebar, go to the Data category
3) Click on the Browse DBFS button that appears towards the upper right (between the "+ Add" button and the cluster button)
4) Click the Upload button at the top of the DBFS browser
5) Click the Select button, browse to the "tables" folder, and click the Confirm button
6) Now you can drag in the files that you want into the grey box, or click the grey box and browse to the folder where you have the files on your local computer and select the ones you wish to add. This brings up the same view as you see in the video with each file being loaded in parallel with progress bars.
7) When you have uploaded all of the files that you want, click the Done button
NOTES:
Using this method does not have the 10 file limit.
Clicking the "+ Add" button seems to try to add all of the files to one single table.
@phemasundar 9 месяцев назад ⁺¹
It's better to use a Python script with for loop to create all the tables in one go compared to writing multiple SQL statements which does more or less the same job, right?
@jamalkamar1210 5 месяцев назад
I believe he mentioned that all Python scripts needs to be converted back to SQL for better efficiency
@santicodaro Год назад ⁺¹
Hi Bryan thanks for the content. I have a question:
Thinking in a real life scenario, what would be the advantage of creating tables within Databricks/Spark rather than reading files from a blob storage as dataframes and the writing the output to blob storage or an OLAP database?
Wouldn't having tables within Databricks will add complexity to the storage layer?
I guess I am missing the use cases here.
Thanks a lot!
@BryanCafferky Год назад
It's just schema on read in this case so it creates meta information in the Hive catalog. No additional complexity. The advantage is you get SQL language support. When you get to Delta Tables, you get full CRUD (create, read, update, and delete) operations.
@ashukol 2 года назад
How to avoid duplicate csv file upload? What will be the cost impact on azure cloud due to upload of duplicate files?
@BryanCafferky 2 года назад ⁺²
I don't think that question relates to this video.
@ashukol 2 года назад
@@BryanCafferky out of curiosity I asked
@hemalpbhatt 5 месяцев назад
I am getting this error: UnityCatalogServiceException: [RequestId=4a3d6ef7-7b72-4487-b53b-b08bad1f0894 ErrorClass=INVALID_PARAMETER_VALUE] GenerateTemporaryPathCredential uri /FileStore/tables/DimDate.csv is not a valid URI. Error message: INVALID_PARAMETER_VALUE: Missing cloud file system scheme.
I added dbfs and still syntax won't run [UC_FILE_SCHEME_FOR_TABLE_CREATION_NOT_SUPPORTED] Creating table in Unity Catalog with file scheme dbfs is not supported.
Instead, please create a federated data source connection using the CREATE CONNECTION command for the same table provider, then create a catalog based on the connection with a CREATE FOREIGN CATALOG command to reference the tables therein. SQLSTATE: 0AKUC
@BryanCafferky 5 месяцев назад ⁺¹
Looks like you have Unity Catalog enabled and it is conflicting with this code. UC came out after I created this video.
@AnandChalapati 3 дня назад
@@BryanCafferky I too face the same challenge. Any workaround will help me with my learning. Thanks in advance!
@muskangupta735 Год назад
Error in SQL statement: AnalysisException: Unable to infer schema for CSV. It must be specified manually. I am getting this error @BryanCafferky
@jasonbellis9128 Год назад
Upload the CSV files first before running the statement. See minute 8:00 in vid

Следующие

Автовоспроизведение

Master Databricks and Apache Spark Step by Step: Lesson 8 - Spark SQL DDL on Spark