How to Submit a PySpark Script to a Spark Cluster Using Airflow!

Поделиться
HTML-код
  • Опубликовано: 22 окт 2023
  • Master the intricacies of deploying PySpark scripts on Spark clusters with our comprehensive guide, leveraging the power of Airflow. Delve into step-by-step procedures, best practices, and common pitfalls to ensure a smooth execution. Understand the synergy between PySpark's distributed computing capabilities and Airflow's robust workflow management. This tutorial is designed for both beginners and seasoned professionals looking to optimize their big data processing tasks. Subscribe for in-depth insights into Spark, Airflow, and the dynamic world of data engineering!
    registry.astronomer.io/provid...
    airflow.apache.org/docs/apach...

Комментарии • 49

  • @vladimirkotoff4260
    @vladimirkotoff4260 9 месяцев назад +2

    Thank you so much for the vid ! Both the format and the subject are awesome !

  • @JP-zz6ql
    @JP-zz6ql 9 месяцев назад +1

    Love this new format

  • @Airaselven
    @Airaselven 3 месяца назад +1

    Thank you very much for the useful video.

  • @anikethdeshpande8336
    @anikethdeshpande8336 22 дня назад +1

    this worked well !
    thank you so much 😃

  • @sandeepnarwal8782
    @sandeepnarwal8782 4 месяца назад +1

    Best Tutorial

  • @QuangHungLe
    @QuangHungLe 10 дней назад +1

    nice vid bro but if i got my airflow installed in docker how can i get the SparkSubmitOperator Package

    • @thedataguygeorge
      @thedataguygeorge  4 дня назад

      Add it to your requirements file and restart your environment

  • @SoumilShah
    @SoumilShah 5 месяцев назад +2

    Where are you running spark cluster is it running locally
    Why did we use spark://master
    in connection im confused how is docker container connecting to spark cluster running on your laptop ?

    • @thedataguygeorge
      @thedataguygeorge  5 месяцев назад

      Yes i'm running Spark locally! That's why I'm able to connect to it that way!

    • @gautamrishi5391
      @gautamrishi5391 2 месяца назад +1

      @@thedataguygeorge How can i check port as you have used maybe it is different for me or not?

  • @sildistruttore
    @sildistruttore Месяц назад +1

    how am i supposed to send data directly into snowflakes? I mean, if i'm processing Petabytes how am i supposed to retrieve them in my local machine and then push them into snowflakes?

    • @thedataguygeorge
      @thedataguygeorge  29 дней назад

      You could chunk it by ingesting smaller pieces of the larger data set, or spin up a remote spark cluster with a ton of compute!

  • @gpmahendra7129
    @gpmahendra7129 7 месяцев назад +2

    Hi Data Guy, the video is so insightful. I am just struck at understanding the Host & Port details you provided while creating the connection in Airflow Webinterface. How to configure those while bringing up the airflow container to submit a pyspark script. Kindly help me with this one.

    • @thedataguygeorge
      @thedataguygeorge  7 месяцев назад

      Thanks! What specifically are you having trouble with? Is the Host/Port combo for the Spark connection not working?

  • @pyramidofmerlinii4368
    @pyramidofmerlinii4368 4 месяца назад +2

    I followed you but have one error
    JAVA_HOME is not set
    So I have to install Java in my Container?

  • @kenskyschulz1979
    @kenskyschulz1979 6 месяцев назад +1

    I am thinking of any consequences, if using python operator directly and call out a pyspark code within, and fill all the ins and outs within a given task. Do you think this works?

    • @thedataguygeorge
      @thedataguygeorge  6 месяцев назад +1

      Tentatively yes, but what do you mean by fill all the ins and outs of a given task?

    • @kenskyschulz1979
      @kenskyschulz1979 6 месяцев назад

      @@thedataguygeorge Thanks for the reply. I meant the input args and in/out data referenced by a Dag / connected tasks.

  • @naveenkonda395
    @naveenkonda395 Месяц назад

    how can i connect to 2 remote servers at the same time when i have to run multiple jobs

  • @user-ob6zr2px9q
    @user-ob6zr2px9q 7 месяцев назад +2

    Thank you.
    plz show setup spark cluster

    • @thedataguygeorge
      @thedataguygeorge  7 месяцев назад

      Would you like to OSS Spark locally or cloud based?

    • @KathirVel-fb2sf
      @KathirVel-fb2sf 6 месяцев назад

      kubernetes based spark job submit is best for real production grade, very less videos on spark airflow kubernetes job submit in dynamic@@thedataguygeorge

    • @anikethdeshpande8336
      @anikethdeshpande8336 22 дня назад

      @@thedataguygeorge locally

  • @MadaraUchiha0155
    @MadaraUchiha0155 3 месяца назад +1

    Can we connect databricks community edition to airflow ?

    • @thedataguygeorge
      @thedataguygeorge  3 месяца назад

      I believe so but not 100% sure since I know CE is pretty limited!

  • @ManishJindalmanisism
    @ManishJindalmanisism 9 месяцев назад +1

    hi Can you add another episode to it describing how to submit pyspark on a cloud provider cluster like databricks AWS-EMR GCP-Dataproc etc,, because what i have seen is just like astronomer for airflow, people use clusters on cloud providers rather than installing/running spark on bare metal or VMs

    • @thedataguygeorge
      @thedataguygeorge  9 месяцев назад +1

      Yeah forsure! I have an Azure Databricks instance available, would that work for ya?

  • @alielzahaby3315
    @alielzahaby3315 2 месяца назад +1

    wow man that astro dev init killed me
    I am making the docker image myself every time lol
    is there any video you have to help with the astro package and its tips and tricks?

    • @thedataguygeorge
      @thedataguygeorge  2 месяца назад

      Even better than a video, check out this guide to getting started with Astro!

    • @yaaror7373
      @yaaror7373 17 дней назад

      @@thedataguygeorge Hi there! It seems the link for the guide wasn't attached to your comment for some reason... Can you please re-attach it? Thanks! 🙏

  • @SoumilShah
    @SoumilShah 7 месяцев назад +1

    Where can I get code ??

  • @vittalshanbag2892
    @vittalshanbag2892 9 месяцев назад +1

    Hi the data guy
    make some knowledge transfer session on spark structured streaming with kafka integration pls.
    Thank you 😅

  • @adeyinkaadegbenro9645
    @adeyinkaadegbenro9645 7 месяцев назад

    Thank you for the video, Kindly assist with this error: Cannot execute: spark-submit --master yarn --num-executors 1 --total-executor-cores 1 --executor-cores 1 --executor-memory 2g --driver-memory 2g --name arrow-spark --queue root.default sparksubmit_basic.py. Error code is: 1.; 96442

    • @thedataguygeorge
      @thedataguygeorge  6 месяцев назад

      Could you please give me a little more context?

  • @gautamrishi5391
    @gautamrishi5391 2 месяца назад

    getting this error Error when trying to pre-import module 'airflow.providers.apache.spark.operators.spark_submit' found in test_dag.py: No module named 'airflow.providers.apache

  • @79texx
    @79texx 9 месяцев назад +2

    Hey man, as a fellow data analyst I really enjoy your videos, but you gotta stop those premiers… it’s really annoying to see an interesting video in my subfeed and then everything it’s just a premier for like in 5 days… I mean of course you can use this feature once in a while but please not 5 times a week, it’s really frustrating.

    • @thedataguygeorge
      @thedataguygeorge  9 месяцев назад +2

      Ok good to know, will stop! Sorry about that, I just did it so people could see the upcoming schedule but didn't realize it was so prominent in your feed!

    • @79texx
      @79texx 9 месяцев назад +1

      @@thedataguygeorgeThanks man!

    • @viralstingray5590
      @viralstingray5590 9 месяцев назад +1

      I totally agree!
      Love the content and the 10min short and no-bs is just amazing. But the scheduled thing is very annoying

    • @thedataguygeorge
      @thedataguygeorge  9 месяцев назад

      No problem, thanks for letting me know!

    • @pattekhaashwak8167
      @pattekhaashwak8167 4 месяца назад

      Hi how do we setup connection between two dockers so that airflow can trigger task in spark docker. Can you make video on it