Thank you for sharing the great tutorial. One of the cool thing is: at the end of each video, you review the content that previously taught in the video. ✅💯👍💖
In Databricks and HDinsight, you don't need to install Spark separately as they come with Spark already. How about local premise (say laptop ) ? How do we install Spark ? Is installing pyspark equal to installing Spark ?
@@BryanCafferky , Thanks. What if I just do "pip install pyspark" in Anaconda ? Is it equivalent to installing Spark including Spark core etc ? Because I am able to get Spark context, session etc if I just install pyspark.
@@BryanCafferky , Ok, but in my anaconda environment, I just installed pyspark and could get Spark context etc to do dataframe analysis etc..So perhaps, while installing pyspark, does it pull Spark core etc ?
Hi Bryan, important For spark and all those code in sql, can i use Jupyter notebook instead of Zeppelin as Zeppelin is not free and I would like to stick with the free thing. Please let me know is setting up hdinsight really important. Thanks!
Actually, Zeppelin Notebook is free and open source and you can download it here zeppelin.apache.org/download.html I use HDInsight for convenience, you can use any Apache Spark installation you like. And Jupyter with it if you prefer. Zeppelin is more powerful but not as easy to install.
Hi Yizheng, The link is in the video description. I always put it there. Copied here github.com/bcafferky/shared/blob/master/MasterDatabricksAndSpark/Lesson_10_AW_Create_Tables_On_Spark.zip
Hi Bryan, I was wondering why data engineering job market is getting more huge than ML jobs...Probably you answered it in this video.."EDA is the place where many companies end the process.." they don't go past that to create predictive models.. :)
I'm not sure whether data scientists are less in demand than data engineers but it does seem so. If you think about data science, it relies on the same stages are data analysis and business intelligence so you need data engineering for both. Many organizations are struggling just to get a handle on their data and extract business insights so machine learning may be a later priority. Also, BI has been around for a while while using machine learning is still pretty new. Data science has a steep learning curve and getting a return is uncertain so management may be slow to adopt it.
In the case of this video topic, No. Because you are only creating a schema definition on top a file, i.e., schema on read. Mind you, the file system is Azure Data Lake Storage which is like a drive do it does not lock up. However, if you create a Delta table (not discussed here b/c it was very new and not in GA at the time of this video), that would create a new parquet file and related logs and these should be locked until the process is complete. Make sense?
Thank you for sharing the great tutorial. One of the cool thing is: at the end of each video, you review the content that previously taught in the video. ✅💯👍💖
Yeah. I realized at some point that I need that recap at the end so I thought others might benefit too. Thanks
@@BryanCafferky Thank you for your great job!
someone who has already watched Lesson 9, can directly jump to 5:05
In Databricks and HDinsight, you don't need to install Spark separately as they come with Spark already. How about local premise (say laptop ) ? How do we install Spark ? Is installing pyspark equal to installing Spark ?
You can download open source Apache Spark here spark.apache.org/downloads.html It comes with a PySpark shell in addition to a Scala shell.
@@BryanCafferky , Thanks. What if I just do "pip install pyspark" in Anaconda ? Is it equivalent to installing Spark including Spark core etc ? Because I am able to get Spark context, session etc if I just install pyspark.
@@Raaj_ML No. PySpark is a separate library for Python on Spark. If you install Spark, you will get PySpark too.
@@BryanCafferky , Ok, but in my anaconda environment, I just installed pyspark and could get Spark context etc to do dataframe analysis etc..So perhaps, while installing pyspark, does it pull Spark core etc ?
@@Raaj_ML Not according to the documentation. I have not tried it.
Hi Bryan, important
For spark and all those code in sql, can i use Jupyter notebook instead of Zeppelin as Zeppelin is not free and I would like to stick with the free thing. Please let me know is setting up hdinsight really important.
Thanks!
Actually, Zeppelin Notebook is free and open source and you can download it here
zeppelin.apache.org/download.html
I use HDInsight for convenience, you can use any Apache Spark installation you like. And Jupyter with it if you prefer. Zeppelin is more powerful but not as easy to install.
Very good useful video
Thanks!
Where can I download all the data used in this lesson (all the .csv files) as well as the .dbc file for this lesson.
Thank you so much for sharing!!!
Hi Yizheng, The link is in the video description. I always put it there. Copied here github.com/bcafferky/shared/blob/master/MasterDatabricksAndSpark/Lesson_10_AW_Create_Tables_On_Spark.zip
Hi Bryan, I was wondering why data engineering job market is getting more huge than ML jobs...Probably you answered it in this video.."EDA is the place where many companies end the process.." they don't go past that to create predictive models.. :)
I'm not sure whether data scientists are less in demand than data engineers but it does seem so. If you think about data science, it relies on the same stages are data analysis and business intelligence so you need data engineering for both. Many organizations are struggling just to get a handle on their data and extract business insights so machine learning may be a later priority. Also, BI has been around for a while while using machine learning is still pretty new. Data science has a steep learning curve and getting a return is uncertain so management may be slow to adopt it.
@@BryanCafferky , I agree completely..That seems to be the case.
Thanks Bryan sorry another question when a table is created does it lock the file so it cannot be deleted from the file system?
In the case of this video topic, No. Because you are only creating a schema definition on top a file, i.e., schema on read. Mind you, the file system is Azure Data Lake Storage which is like a drive do it does not lock up. However, if you create a Delta table (not discussed here b/c it was very new and not in GA at the time of this video), that would create a new parquet file and related logs and these should be locked until the process is complete. Make sense?
@@BryanCafferky Perfect thanks, it really is a different way of thinking from RDBMS.