I've been watching these videos for a couple of days, and they are great. I have a Udemy account through my employer but the videos available there are lacking. They don't necessarily give a rhyme or reason why you want/need to do something, and they completely ignore the background of Databricks and Spark, just jumping straight into how to use notebooks. Bryan spends a lot more time explaining how and why you do something, which means you are more likely to figure out how to do what you want to do rather than simply memorizing commands. TL;DR: This video series is much more valuable than the paid-for content on Udemy and, possibly, similar sites.
hey @BryanCafferky when i am adding the factinternetsalesreason table it is also adding the header on the first line even though I have done header = "false"
I assume you mean upload to the Databricks workspace. I suggest just doing it in two steps then, 1 - 10, 11 - 19. In practice, files would probably be put on cloud storage outside of Databricks. This method is really more for learning.
I just went through this and the main issue is that the Databricks UI has changed, so getting to the right place to upload the files is a bit trickier now. This works as of Aug. 17, 2023. 1) Go to Admin Settings --> Workspace Settings, and enable "DBFS File Browser" under Advanced (and then refresh the browser window if it was not already enabled) 2) In the sidebar, go to the Data category 3) Click on the Browse DBFS button that appears towards the upper right (between the "+ Add" button and the cluster button) 4) Click the Upload button at the top of the DBFS browser 5) Click the Select button, browse to the "tables" folder, and click the Confirm button 6) Now you can drag in the files that you want into the grey box, or click the grey box and browse to the folder where you have the files on your local computer and select the ones you wish to add. This brings up the same view as you see in the video with each file being loaded in parallel with progress bars. 7) When you have uploaded all of the files that you want, click the Done button NOTES: Using this method does not have the 10 file limit. Clicking the "+ Add" button seems to try to add all of the files to one single table.
It's better to use a Python script with for loop to create all the tables in one go compared to writing multiple SQL statements which does more or less the same job, right?
Hi Bryan thanks for the content. I have a question: Thinking in a real life scenario, what would be the advantage of creating tables within Databricks/Spark rather than reading files from a blob storage as dataframes and the writing the output to blob storage or an OLAP database? Wouldn't having tables within Databricks will add complexity to the storage layer? I guess I am missing the use cases here. Thanks a lot!
It's just schema on read in this case so it creates meta information in the Hive catalog. No additional complexity. The advantage is you get SQL language support. When you get to Delta Tables, you get full CRUD (create, read, update, and delete) operations.
I am getting this error: UnityCatalogServiceException: [RequestId=4a3d6ef7-7b72-4487-b53b-b08bad1f0894 ErrorClass=INVALID_PARAMETER_VALUE] GenerateTemporaryPathCredential uri /FileStore/tables/DimDate.csv is not a valid URI. Error message: INVALID_PARAMETER_VALUE: Missing cloud file system scheme. I added dbfs and still syntax won't run [UC_FILE_SCHEME_FOR_TABLE_CREATION_NOT_SUPPORTED] Creating table in Unity Catalog with file scheme dbfs is not supported. Instead, please create a federated data source connection using the CREATE CONNECTION command for the same table provider, then create a catalog based on the connection with a CREATE FOREIGN CATALOG command to reference the tables therein. SQLSTATE: 0AKUC
This a re-edited upload. I cleaned up the sound, removing some annoying background noise and I cut our some parts that seemed unnecessary.
I've been watching these videos for a couple of days, and they are great. I have a Udemy account through my employer but the videos available there are lacking. They don't necessarily give a rhyme or reason why you want/need to do something, and they completely ignore the background of Databricks and Spark, just jumping straight into how to use notebooks. Bryan spends a lot more time explaining how and why you do something, which means you are more likely to figure out how to do what you want to do rather than simply memorizing commands.
TL;DR: This video series is much more valuable than the paid-for content on Udemy and, possibly, similar sites.
@brian - I do not see the Notebook in the Github download link, I see the CSVs and the PDF slides, am I missing something?
hey @BryanCafferky when i am adding the factinternetsalesreason table it is also adding the header on the first line even though I have done header = "false"
The data tab appears to be absent in the new ui?
these schema-on-read "tables" are persisted by hive;what does this mean for user visibility? i.e., who can view and/or use this stored info?
It is accessible within the Databricks Workspace. Each workspace has a Hive Catalog.
@Bryan Cafferky the updated UI does not let you load more than 10 files at a time (your practice above has 19), any work around or suggestions?
I assume you mean upload to the Databricks workspace. I suggest just doing it in two steps then, 1 - 10, 11 - 19. In practice, files would probably be put on cloud storage outside of Databricks. This method is really more for learning.
I just went through this and the main issue is that the Databricks UI has changed, so getting to the right place to upload the files is a bit trickier now. This works as of Aug. 17, 2023.
1) Go to Admin Settings --> Workspace Settings, and enable "DBFS File Browser" under Advanced (and then refresh the browser window if it was not already enabled)
2) In the sidebar, go to the Data category
3) Click on the Browse DBFS button that appears towards the upper right (between the "+ Add" button and the cluster button)
4) Click the Upload button at the top of the DBFS browser
5) Click the Select button, browse to the "tables" folder, and click the Confirm button
6) Now you can drag in the files that you want into the grey box, or click the grey box and browse to the folder where you have the files on your local computer and select the ones you wish to add. This brings up the same view as you see in the video with each file being loaded in parallel with progress bars.
7) When you have uploaded all of the files that you want, click the Done button
NOTES:
Using this method does not have the 10 file limit.
Clicking the "+ Add" button seems to try to add all of the files to one single table.
It's better to use a Python script with for loop to create all the tables in one go compared to writing multiple SQL statements which does more or less the same job, right?
I believe he mentioned that all Python scripts needs to be converted back to SQL for better efficiency
Hi Bryan thanks for the content. I have a question:
Thinking in a real life scenario, what would be the advantage of creating tables within Databricks/Spark rather than reading files from a blob storage as dataframes and the writing the output to blob storage or an OLAP database?
Wouldn't having tables within Databricks will add complexity to the storage layer?
I guess I am missing the use cases here.
Thanks a lot!
It's just schema on read in this case so it creates meta information in the Hive catalog. No additional complexity. The advantage is you get SQL language support. When you get to Delta Tables, you get full CRUD (create, read, update, and delete) operations.
How to avoid duplicate csv file upload? What will be the cost impact on azure cloud due to upload of duplicate files?
I don't think that question relates to this video.
@@BryanCafferky out of curiosity I asked
I am getting this error: UnityCatalogServiceException: [RequestId=4a3d6ef7-7b72-4487-b53b-b08bad1f0894 ErrorClass=INVALID_PARAMETER_VALUE] GenerateTemporaryPathCredential uri /FileStore/tables/DimDate.csv is not a valid URI. Error message: INVALID_PARAMETER_VALUE: Missing cloud file system scheme.
I added dbfs and still syntax won't run [UC_FILE_SCHEME_FOR_TABLE_CREATION_NOT_SUPPORTED] Creating table in Unity Catalog with file scheme dbfs is not supported.
Instead, please create a federated data source connection using the CREATE CONNECTION command for the same table provider, then create a catalog based on the connection with a CREATE FOREIGN CATALOG command to reference the tables therein. SQLSTATE: 0AKUC
Looks like you have Unity Catalog enabled and it is conflicting with this code. UC came out after I created this video.
@@BryanCafferky I too face the same challenge. Any workaround will help me with my learning. Thanks in advance!
Error in SQL statement: AnalysisException: Unable to infer schema for CSV. It must be specified manually. I am getting this error @BryanCafferky
Upload the CSV files first before running the statement. See minute 8:00 in vid