AWS Tutorials - AWS Glue Job Optimization Part-3

Поделиться
HTML-код
  • Опубликовано: 12 мар 2022
  • AWS Glue Job and Lake Formation Crash Course - • AWS Tutorials Crash Co...
    Building AWS Glue Job using PySpark - • Building AWS Glue Job ...
    AWS Tutorials - AWS Glue Job Optimization Part-1 - • AWS Tutorials - AWS Gl...
    Job Code - github.com/aws-dojo/analytics...
    Data File- github.com/aws-dojo/analytics...
    Optimization of AWS Glue Job is an interesting and most-asked topic. There are many ways to optimize AWS Glue Job such as optimizing memory or capacity. In this video, you learn how to control parallelism in workers and spark task by grouping files based on size.
  • НаукаНаука

Комментарии • 16

  • @arunasingh8617
    @arunasingh8617 2 года назад

    Simply explained.

  • @VishalSharma-hv6ks
    @VishalSharma-hv6ks 2 года назад +1

    I am forwarding your videos in my office telegram group as well

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      Many thanks for appreciation. If you have any specific requirement, please let me know - I would love to cover that in some video if not already done.

  • @gds86-discoveries
    @gds86-discoveries 2 года назад

    Also if use groupSize, as I understand your output files should have a larger file size and have a lesser number of files, because it is being group?

  • @nikhilgupta110
    @nikhilgupta110 2 года назад

    Can we connect on LinkedIn? . Your content goes beyond most of the paid courses. Thanks to workflow video, I was able to create a scalable system for my ETL.

  • @gds86-discoveries
    @gds86-discoveries 2 года назад

    How can I achieve using the _from_catalog() method, it seems if I add groupFiles, groupsize settings it would not work? Please advise.

  • @AshishGolchha47
    @AshishGolchha47 2 года назад +1

    This is a very great video, I am facing issues as well regarding loading the data while performing some transformations using pyspark sql. How can we improve performance for data which has more than 60+ Million records in glue? Currently using 10 DPUs.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      Need more info like source type, number of tables, full dump or increamental.

  • @vijaymani83
    @vijaymani83 2 года назад +1

    Wonderful videos and highly useful to learn concepts that are not widely discussed elsewhere. I need to create a table in Snowflake (dynamically) based on the schema definition from Glue catalog (that crawls a few parquet files). Is it possible?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      Hi, sorry but I don't have much idea about snowflake

    • @vijaymani83
      @vijaymani83 2 года назад

      @@AWSTutorialsOnline Not a problem. Once again thanks for your enlightening videos with valuable contents

  • @hash510
    @hash510 Год назад

    Nice! But - "The AWS Glue Parquet writer has historically been accessed through the glueparquet format type. This access pattern is no longer advocated"... Use classical "parquet" format

  • @rajeshkumarnimmagadda2547
    @rajeshkumarnimmagadda2547 Год назад

    As per aws documentation, groupFiles is not supported for parquet format.

  • @venugupta7809
    @venugupta7809 Год назад +1

    Can we control parallelism if we are reading only one file with huge data. like a text file with 3 million records data ?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  Год назад

      Parallelism does not work for large files. I will recommend you write ETL to break large files into small ones.