Master Databricks and Apache Spark Step by Step: Lesson 10 - Creating the SQL Tables on Spark

Поделиться
HTML-код
  • Опубликовано: 2 фев 2025

Комментарии • 22

  • @tinaxu4214
    @tinaxu4214 3 года назад +3

    Thank you for sharing the great tutorial. One of the cool thing is: at the end of each video, you review the content that previously taught in the video. ✅💯👍💖

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      Yeah. I realized at some point that I need that recap at the end so I thought others might benefit too. Thanks

    • @tinaxu4214
      @tinaxu4214 3 года назад

      @@BryanCafferky Thank you for your great job!

  • @sharad3877
    @sharad3877 6 месяцев назад +1

    someone who has already watched Lesson 9, can directly jump to 5:05

  • @Raaj_ML
    @Raaj_ML 3 года назад +2

    In Databricks and HDinsight, you don't need to install Spark separately as they come with Spark already. How about local premise (say laptop ) ? How do we install Spark ? Is installing pyspark equal to installing Spark ?

    • @BryanCafferky
      @BryanCafferky  3 года назад

      You can download open source Apache Spark here spark.apache.org/downloads.html It comes with a PySpark shell in addition to a Scala shell.

    • @Raaj_ML
      @Raaj_ML 3 года назад +1

      @@BryanCafferky , Thanks. What if I just do "pip install pyspark" in Anaconda ? Is it equivalent to installing Spark including Spark core etc ? Because I am able to get Spark context, session etc if I just install pyspark.

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      @@Raaj_ML No. PySpark is a separate library for Python on Spark. If you install Spark, you will get PySpark too.

    • @Raaj_ML
      @Raaj_ML 3 года назад

      @@BryanCafferky , Ok, but in my anaconda environment, I just installed pyspark and could get Spark context etc to do dataframe analysis etc..So perhaps, while installing pyspark, does it pull Spark core etc ?

    • @BryanCafferky
      @BryanCafferky  3 года назад +2

      @@Raaj_ML Not according to the documentation. I have not tried it.

  • @amaytrivedi3482
    @amaytrivedi3482 3 года назад +2

    Hi Bryan, important
    For spark and all those code in sql, can i use Jupyter notebook instead of Zeppelin as Zeppelin is not free and I would like to stick with the free thing. Please let me know is setting up hdinsight really important.
    Thanks!

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      Actually, Zeppelin Notebook is free and open source and you can download it here
      zeppelin.apache.org/download.html
      I use HDInsight for convenience, you can use any Apache Spark installation you like. And Jupyter with it if you prefer. Zeppelin is more powerful but not as easy to install.

  • @srinivasansaripalli9498
    @srinivasansaripalli9498 3 года назад +1

    Very good useful video

  • @yizhengfeng2450
    @yizhengfeng2450 2 года назад +1

    Where can I download all the data used in this lesson (all the .csv files) as well as the .dbc file for this lesson.
    Thank you so much for sharing!!!

    • @BryanCafferky
      @BryanCafferky  2 года назад

      Hi Yizheng, The link is in the video description. I always put it there. Copied here github.com/bcafferky/shared/blob/master/MasterDatabricksAndSpark/Lesson_10_AW_Create_Tables_On_Spark.zip

  • @Raaj_ML
    @Raaj_ML 3 года назад +3

    Hi Bryan, I was wondering why data engineering job market is getting more huge than ML jobs...Probably you answered it in this video.."EDA is the place where many companies end the process.." they don't go past that to create predictive models.. :)

    • @BryanCafferky
      @BryanCafferky  3 года назад +6

      I'm not sure whether data scientists are less in demand than data engineers but it does seem so. If you think about data science, it relies on the same stages are data analysis and business intelligence so you need data engineering for both. Many organizations are struggling just to get a handle on their data and extract business insights so machine learning may be a later priority. Also, BI has been around for a while while using machine learning is still pretty new. Data science has a steep learning curve and getting a return is uncertain so management may be slow to adopt it.

    • @Raaj_ML
      @Raaj_ML 3 года назад

      @@BryanCafferky , I agree completely..That seems to be the case.

  • @ChrisUK70
    @ChrisUK70 7 месяцев назад

    Thanks Bryan sorry another question when a table is created does it lock the file so it cannot be deleted from the file system?

    • @BryanCafferky
      @BryanCafferky  7 месяцев назад +1

      In the case of this video topic, No. Because you are only creating a schema definition on top a file, i.e., schema on read. Mind you, the file system is Azure Data Lake Storage which is like a drive do it does not lock up. However, if you create a Delta table (not discussed here b/c it was very new and not in GA at the time of this video), that would create a new parquet file and related logs and these should be locked until the process is complete. Make sense?

    • @ChrisUK70
      @ChrisUK70 7 месяцев назад

      @@BryanCafferky Perfect thanks, it really is a different way of thinking from RDBMS.