Master Databricks and Apache Spark Step by Step: Lesson 21 - PySpark Using RDDs

Поделиться
HTML-код
  • Опубликовано: 19 ноя 2024

Комментарии • 23

  • @anthonygonsalvis121
    @anthonygonsalvis121 3 года назад +2

    Can't wait for more of your videos on PySpark!

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      Hi Anthony, You've been busy this weekend! Good for you. I took the 4th weekend off. :-) Working on the next PySpark video currently. Thanks

  • @Raaj_ML
    @Raaj_ML 3 года назад +3

    Bryan, thanks for the series. But expecting more explanation on parallelize, partitions etc which seem to be the very purpose of using Spark. Many training videos just explain the pyspark code for reading parsing Data frames etc...but how to really parallelize a big data ? What are partitions and how to partition ? Can you please explain these more ?

    • @BryanCafferky
      @BryanCafferky  3 года назад +4

      When you use Spark SQL and the dataframe/dataset API, Spark does parallelize the work for you automatically. If you want to force partitioning, you can save data in parquet organized by partition. I think this topic needs a series of its own and agree it is worth covering. Here are some blogs you may find useful on this.
      towardsdatascience.com/3-methods-for-parallelization-in-spark-6a1a4333b473
      luminousmen.com/post/spark-partitions

    • @Raaj_ML
      @Raaj_ML 3 года назад

      @@BryanCafferky Thanks a lot. You are doing a great work. Waiting for more.

  • @ammarahmed5981
    @ammarahmed5981 4 месяца назад

    Awesome series.

  • @sadeshsewnath6298
    @sadeshsewnath6298 3 года назад +1

    Like the explanation!

  • @annukumar7500
    @annukumar7500 3 года назад +2

    Golden Content and a grand series!
    quick question, What would be the difference between a simple SQL statement and a pyspark's spark-sql statement? Both seem to launch spark jobs when executed in databricks. Would they both leverage distributed computing?

    • @BryanCafferky
      @BryanCafferky  3 года назад +2

      Spark SQL is an API that can be called from languages like Python, R, and Scala. Datarbricks notebooks expose SQL directly so you can execute SQL statements without using a different language. When you execute in a Python cell, sark.sql('select * from mytable') . Its just running Spark SQL. When you use PySpark methods, they use the same spark dataframe classes as SQL so really use the same code under the covers. I even suspect the PySpark methods are translated into SQL prior to executing but have not confirmed this. All three forms run on the cluster nodes. Make sense?

    • @annukumar7500
      @annukumar7500 3 года назад +2

      @@BryanCafferky Makes perfect sense.
      This exact piece was missing from my lego blocks!
      Thank you!

  • @Pasdpawn
    @Pasdpawn Год назад

    you are the best Bryan

  • @ayeshasarwar615
    @ayeshasarwar615 2 года назад

    great job

  • @itsshehri
    @itsshehri Год назад

    Hey brayn thankyou so much for this series. I have a question whats the difference between spark session and spark context.

    • @BryanCafferky
      @BryanCafferky  Год назад

      I took this from a blog but there we so many pop up ads, I'll not give the link. " Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset."

    • @itsshehri
      @itsshehri Год назад

      @@BryanCafferky thankyou so much for your reply. so we use spark sessions since spark 2.0 and not spark context anymore?

  • @sudipbala9647
    @sudipbala9647 3 года назад +1

    How would i upload a txt file ??

    • @BryanCafferky
      @BryanCafferky  3 года назад +2

      See video 9 in the series. It covers that but with CSV files. Same thing.
      ruclips.net/video/M89l4xLzEGE/видео.html

    • @sudipbala9647
      @sudipbala9647 3 года назад +1

      @@BryanCafferky yes sir . i have been watching your series. I see no option to upload .txt files. There are only CSV, JSON, Avro File types. i am practicing in community edition.

    • @BryanCafferky
      @BryanCafferky  3 года назад +1

      @@sudipbala9647 How about renaming the file to have a csv extension? When I do the walkthru via the Databricks Community Edition GUI, I just get a window to upload any file on my system, no filters. Have you watched video 9 again, it shows you. how to do this. You go to Data, Create Table, then click on Drag File to Upload of Click to Browse. Note this uploads the file but does not create a table from it automatically.

    • @sudipbala9647
      @sudipbala9647 3 года назад +2

      @@BryanCafferky thank you sir. done .

  • @AnisIMANI-r9y
    @AnisIMANI-r9y 9 месяцев назад

    19:28 so funny 😂