Running Spark jobs on Amazon EMR Serverless

Поделиться
HTML-код
  • Опубликовано: 10 дек 2024

Комментарии • 17

  • @AnGELsPearhead
    @AnGELsPearhead 2 года назад

    Amazing Demo!!!

  • @disrupcao4674
    @disrupcao4674 Год назад

    great video

  • @kingsleywen3889
    @kingsleywen3889 2 года назад

    Amazing. Could you do a tutorial about using step function with EMR Serverless? Thanks.

    • @dacort
      @dacort  2 года назад +1

      EMR Serverless is not natively supported with Step Functions today, but there is a way to do it using Lambda functions.
      We have a blog post about it here, if it's helpful! aws.amazon.com/blogs/big-data/run-a-data-processing-job-on-amazon-emr-serverless-with-aws-step-functions/

  • @viewermm1588
    @viewermm1588 3 месяца назад

    Does anyone here knows if it is possible to use Spark to select/collect multiple Parquet files from s3 bucket ( all in "ABC" folder) and combined them in one Parquet file in ( "DEF") file in the same location? and if so what is the code , thanks

  • @ManishBhandari-df2xf
    @ManishBhandari-df2xf Год назад

    Hi Great video - can you please also show steps on how to install external libraries on EMR - bootstrap script replacement?

    • @dacort
      @dacort  Год назад

      Assuming you're talking about EMR Serverless, there's a couple different options. You can use custom images ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html ) to install OS-level dependencies. If you're just talking about PySpark dependencies you can also bundle a virtual environment ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html ).

    • @srirajvasireddy2615
      @srirajvasireddy2615 10 месяцев назад

      For pyspark dependencies like pandas or kafka. How to bundle a virtual environment?
      New to python, any help or suggestions are greatly appreciated.

  • @bariowd
    @bariowd 2 года назад

    Amazing video
    do you know if there is any chance to send parameters from airflow DAG to the called notebook?
    For example the DAG receives a random date&&number then when you trigger the DAG it send those parameters to the notebook.
    Thank you! :)

    • @dacort
      @dacort  2 года назад

      I didn't use notebooks in this video, the EMR StartNotebookExecution API allows you to pass parameters to notebook runs.
      We have a blog post about that here: aws.amazon.com/blogs/big-data/orchestrating-analytics-jobs-on-amazon-emr-notebooks-using-amazon-mwaa/

  • @julsgranados6861
    @julsgranados6861 Год назад

    Great video!! , Is there any way to run a dbt project using emr serverless?, I have seen that they have the Thrift option to connect to EMR on EC2, but I am not sure if it is possible to connect it to EMR serverless :(

    • @dacort
      @dacort  Год назад

      Unfortunately not as of today. :(

  • @JayanthNaidu-w5e
    @JayanthNaidu-w5e Год назад

    Is there a way to install custom Java versions without creating custom images?

    • @dacort
      @dacort  Год назад

      We now support Java 17 ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-java-runtime.html ). Unfortunately not another way to use custom Java versions without custom images.

  • @subhomoysikdar
    @subhomoysikdar Год назад

    Is there a way to run EMR serverless with GPU? I want to run pyspark jobs with NVIDIA RAPIDS

    • @dacort
      @dacort  Год назад

      Not as of today. For that you'll still need EMR on EC2 or EMR on EKS.

    • @subhomoysikdar
      @subhomoysikdar Год назад

      @@dacort Ok. Thank you