Building AWS Glue Job using PySpark

Поделиться
HTML-код
  • Опубликовано: 30 июл 2024
  • The workshop URLs
    Part1- aws-dojo.com/workshoplists/wo...
    Part2- aws-dojo.com/workshoplists/wo...
    AWS Glue Jobs are used to build ETL job which extracts data from sources, transforms the data, and loads it into targets. The job can be built using languages like Python and PySpark. PySpark is the Python API for Spark and it used for big data processing. It can perform data transformation on large scale data in fast and efficient way. This workshop will be covered in two parts.
    Part-1: You learn about setting up a data lake, creating development environment for PySpark and finally building a Glue job using PySpark.
    Part-2: You learn about PySpark for various types of transformations especially when using AWS Lake Formation and Glue based data lake. These transformations can be used in AWS Glue job.
  • НаукаНаука

Комментарии • 116

  • @ducnt40
    @ducnt40 2 года назад +8

    Hi, I just comment here to say thank you for creating this awesome workshop. It helps me a lot while making acquaintance with datalake concept. One again: Thank you so much and keep the good work. :)

  • @atsource3143
    @atsource3143 3 года назад +1

    Thank you so much for such a wonderful workshop.

  • @melaniecummings4117
    @melaniecummings4117 3 года назад

    This was helpful. I am a new Data Engineer and this helped with my job.

  • @eddiepalomino4614
    @eddiepalomino4614 Год назад

    Thanks so much for all learning !!!!

  • @timmyzheng6049
    @timmyzheng6049 2 года назад +1

    Thank you for uploading this video, very helpful.

  • @deepakrawat5065
    @deepakrawat5065 2 года назад

    Great tutorial.Thanks for creating and sharing .

  • @Blessings_Kaira_Yati
    @Blessings_Kaira_Yati 2 года назад

    Thank you for such a brief information ,please make more videos about aws glue job , I am totally new to Amazon glue so this video is very usefull for me

  • @ManojKumar-cg7ft
    @ManojKumar-cg7ft 2 года назад +1

    Thanks for the workshop document.

  • @anandsingh7271
    @anandsingh7271 3 года назад +1

    Thanks a'lot it's was nice session .. God bless You😊

  • @aswintummala
    @aswintummala 2 года назад

    Awesome tutorial!.

  • @sivaranganath396
    @sivaranganath396 3 года назад +1

    Thanks for videos.its so detailed

  • @krishnasanagavarapu4858
    @krishnasanagavarapu4858 2 года назад +1

    love ur videos

  • @josemanuelgutierrez4095
    @josemanuelgutierrez4095 Год назад +1

    it'd be a great step if you do this but not only showing images but also doing it in the aws console

  • @arumugams4673
    @arumugams4673 Год назад

    can we use plain pyspark syntex for filter data from the df in glue?

  • @aishwaryawalkar5950
    @aishwaryawalkar5950 2 года назад

    My question I can we use local jupyter notebook ...because sagemaker notebook carries high cost and at the same time how to schedule this notebooks

  • @dawnofzombies6583
    @dawnofzombies6583 3 года назад +3

    Hi! thank you for the great video. When I ran the Glue Job at the end of the part 1 tutorial, the job created over 80 JSON objects containing two rows each. I'd expect a single JSON file which was the result I got when running the PySpark script from SageMaker. Did I miss anything?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      not sure what went wrong without looking at the code. But you can force repartition by using "repartition" method. Here is one link which will give you details - stackoverflow.com/questions/56294720/aws-glue-output-one-file-with-partitions
      Hope it helps,

    • @dawnofzombies6583
      @dawnofzombies6583 3 года назад

      @@AWSTutorialsOnline thank you!

  • @mirirshadali33
    @mirirshadali33 3 года назад +2

    Great video. Could you please advice on some performance improvement measure for AWS Glue? Looking forward to your valuable input.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +2

      my first recommendation is using Glue Version 2.0 which has improved the start-up time big way. Memory management is very critical for the Glue Job. Please refer this article for that - aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/
      Glue generally uses Athena query for data. Optimizing Athena query improves performance of the Glue Job. Please refer this link for Athena Query optimization - aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
      Hope it helps,

    • @mirirshadali33
      @mirirshadali33 3 года назад

      @@AWSTutorialsOnline Thanks a lot

  • @pk-wanderluster
    @pk-wanderluster 3 года назад +1

    Hello, thank you for amazing workshop. I have specific scenario-I am getting files from oracle via DMS migration task to s3 raw bucket. Initial load brings all the columns as is but for Incremental CDC load I get extra column OP with values U, I and D, which tells me if records is updated, inserted or deleted. When i create a crawler, it detects the columns but when I see the data in Athena it is completely disorganized. Any suggestion how to address this issue? Should I be creating 2 crawlers, then use py spark glue job to merge based on OP columns?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      Thanks for the mail. Let's isolate the problem first. Without CDC, when you catalog data using and query in Athena; does it work? If yes - then probably CDC column is messing up. In that case; I would like to keep CDC data separate and use PaySpark job to marge back. Hope it helps,

    • @pk-wanderluster
      @pk-wanderluster 3 года назад +1

      ​@@AWSTutorialsOnline As indicated in workshop I created glue job with bookmark enabled, it wrote expected file in target but when I ran the job second time, it created the file again. I was under impression that turning bookmark would not created another file as it would save the state from previous run.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      @@pk-wanderluster Hi, I think the problem is with transformation context. In my example, I didn't use transformation context. As per AWS documentation - "For job bookmarks to work properly, enable the job bookmark parameter and set the transformation_ctx parameter. If you don't pass in the transformation_ctx parameter, then job bookmarks are not enabled for a dynamic frame or a table used in the method.". Please try with transformation context and it should solve the problem. Please let me know,

    • @pk-wanderluster
      @pk-wanderluster 3 года назад

      @@AWSTutorialsOnline perfect! As you said, added the parameter and it worked!

  • @guillermorodriguez8916
    @guillermorodriguez8916 3 года назад +1

    Thanks for the tutorial, it was really helpful.
    However, I have a problem. When granting permissions to customer an sales tables in AWS Lake formation, I get this error message: Resource does not exist or requester is not authorized to access requested permissions.
    Can someone help me?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      It seems you have missed some of the previous steps. Your logged in account should be added as lake formation administrator. You might want to configure Lake Formation step again.

    • @yogenderpal
      @yogenderpal 2 года назад

      Were you able to solve this issue?

  • @prakashkaraka
    @prakashkaraka 3 года назад +1

    your videos are super useful in building my knowledge on AWS. Thank you sir. could you provide the part 2 of this session?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Both part are together in this video. You can see two labs in the description.

  • @Amarjeet-fb3lk
    @Amarjeet-fb3lk Год назад +1

    Hi, Can’t we use native Pyspark code,instead of GlueContext. If there is a way ,can you explain that as well?

    • @AnishBhola
      @AnishBhola 9 месяцев назад

      you can use native pyspark code but for that youd have to use spark dataframe instead of aws glues dynamic frame. dynamic frame js better optimized to work in glue etl as it works performantly with gluecontext methods. there are some downsides too as spark dataframe offers a better variety of methods (glue methods are very limited). so if you can process your source data just by using methods from glue context, use dynamic frame else convert to spark data frame and use all the native pyspark methods.

  • @purabization
    @purabization Год назад

    superb workshop
    but whenever I am trying to query this s3 datalake data using athena,i am getting an error called COLUMN_NOT_FOUND: line 1:8: SELECT * not allowed from relation that has no columns. So please help me out with this thanks in advance.

  • @ryanyue5159
    @ryanyue5159 3 года назад

    Hi, First time I use root user, when running crawler, tables created but not found anywhere. then I create IAM user and login, retry and I cannot create tables by crawler. can you tell me why?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Hello Ryan. The crawler runs with a role authorization. The role should have permission to S3 bucket where the data is and also have create / alter permission in the database in the data lake. This permission could be one reason. The other reason could be wrong configuration of crawler. Many times people (including me) , people configure crawler to the actual data file. But the crawler should be configured for the folder where the data file is. This is another reason when the crawler will not be able to create table.
      Try these options and let me know if it worked for you. Otherwise we will find someway to help you fix it.

    • @ryanyue5159
      @ryanyue5159 3 года назад

      @@AWSTutorialsOnline thanks a lot. I finally removed all and recreated, and it works. I will continue following your tutorial

  • @DanielWeikert
    @DanielWeikert 2 года назад

    Can you dive deeper into Glue and Pyspark code? Eg do we need Gluecontext even though pyspark scripts do also work. What are the most important functions and actions?... Thanks!

  • @sakinafakhri1320
    @sakinafakhri1320 2 года назад +1

    Can you please let me know if this is programmatic error or its related to underlying infra ?
    with the error : an error occured while
    calling 0137 pywritedynamicframe
    exception thrown in await result

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      what are you trying to do when you get this error.

    • @sakinafakhri1320
      @sakinafakhri1320 2 года назад +1

      Adding new columns at redshift and subsequently updating the glue code with the same

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      @@sakinafakhri1320 can you please share the code snippet?

  • @aishwaryawalkar5950
    @aishwaryawalkar5950 2 года назад

    Can I use this method for upsert purpose?

  • @prmurali1leo
    @prmurali1leo 3 года назад

    Good one. how do you set a Spark UI history server for glue job using to docker to view them locally

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Sorry for coming back late. Do you want local monitoring of the Glue jobs?

    • @prmurali1leo
      @prmurali1leo 3 года назад

      Yes monitoring and see the graphs generated as we see in onprem

  • @Viral_Crazy_shots
    @Viral_Crazy_shots 2 года назад

    Thank you for such a brief information! I am totally new to Amazon glue so just wanted to know, Can we convert csv file to .txt file?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      when we talk about conversion; we talk about format conversion like CSV to TSV to JSON to PARQUET. So what would be the format in the .txt file.

  • @sukanyas2651
    @sukanyas2651 3 года назад +1

    can you please tell me how to read the data directly from s3 trough job with spark script

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      to read data directly from s3 - you can create Glue Job with Python Shell. In the job, you can use Python Boto3 SDK to read data directly from S3 bucket.
      Have a look at this lab provided by AWS. It has example of Python Shell script which you are looking for - lakeformation.aworkshop.io/40-beginner/402-etl/4022-python-container.html

    • @sukanyas2651
      @sukanyas2651 3 года назад

      Thank you
      can you explain how to do the row partition using config file in s3 and through glue job

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Hi, this link will help - docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

  • @vivekjacobalex
    @vivekjacobalex 3 года назад +1

    i have done the workshop of part1 and part 2 . thankyou for helping us lean to building ETL jobs. This dojo is really useful to have hand on experience and basic step for building production ETL. I have a suggestion . After Create Developer Endpoint currently in aws there is no option to create sage maker notebook. could you add a video on how to use zeppelin. else i will go through the AWS Glue Studio video .

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      It could be because of sagemaker notebook supportability for a particular AWS Region. Which region, you are using?

    • @vivekjacobalex
      @vivekjacobalex 3 года назад

      @@AWSTutorialsOnline Paris eu-west-3

    • @vivekjacobalex
      @vivekjacobalex 3 года назад

      i have a doubt. when we run crawler, the date filed is showing as string. it there a way to show date as date ? do we need to convert 'dd-mm-yyyy' to timestamp

    • @divyathanikonda3666
      @divyathanikonda3666 2 года назад

      Could you help me in doing the workshop, it is urgency for me

    • @veerachegu
      @veerachegu 2 года назад

      @@divyathanikonda3666 i am also looking my project also aws glue

  • @RwSkipper007
    @RwSkipper007 3 года назад

    Why you emphasized on creating the s3 bucket in frankfurt region any not any other ?

  • @josemanuelgutierrez4095
    @josemanuelgutierrez4095 Год назад +1

    Hi bro, I was working normally till create my notebook using sagemaker but this error that you can see below appear and I cannot create my notebook , what can I do ?
    Platform identifier (notebook-al1-v1) is not supported in this region

  • @veerachegu
    @veerachegu 2 года назад +1

    Can we use another bucket to upload the data after done the glue job

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      sorry could not get the question? Can you please explain what you want to achieve?

  • @macklonfernandes7902
    @macklonfernandes7902 3 года назад +1

    Same Dataset was used to run 3 jobs with different workers using Glue 2.0
    G2.X with 10 workers executed in 15 minutes and consumed 4.97 DPU hours
    G2.X with 149 workers executed in 15 minutes and consumed 76.52 DPU hours
    G2.X with 149 workers executed in 13 minutes and consumed 62.98 DPU hours
    If its consuming more resources it should reduce execution time right? May i know what the reason why this is happening

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      Thanks for sharing. This is interesting. Couple of things to investigate
      1) Compare first run vs. second run duration. First run includes start-up time as well and it can get higher for high number of worker nodes.
      2) The code is not able to leverage parallelism. We would need to see the code to identify improvements. But he following two links can help -
      aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/
      aws.amazon.com/blogs/big-data/optimizing-spark-applications-with-workload-partitioning-in-aws-glue/
      aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/
      Hope it helps,

    • @macklonfernandes7902
      @macklonfernandes7902 3 года назад +1

      @@AWSTutorialsOnline start up time was similar for 3 jobs i.e. 7 secs.
      And i did not run the jobs parallel it was in sequential order.
      I was testing the jobs with different worker and got that result.
      Can you help me to understand why that happened?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      Can you please share some details of the transformation you are performing? possibly share code. I can then try to investigate.

    • @divyathanikonda3666
      @divyathanikonda3666 2 года назад

      Could you help me in doing this workshop, it would be helpful for me to crack an interview

  • @swethadharmappa
    @swethadharmappa 3 года назад

    Hello. Could you please show the steps of loading the data in to a table which is a part of data catalog. Before loading the target table should be emptied(truncated). Thanks in advance!.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      by table, you mean relational database table? or data file in S3?

    • @swethadharmappa
      @swethadharmappa 3 года назад

      ​@@AWSTutorialsOnline It's a table created using a crawler whose source is a S3 file.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      @@swethadharmappa Then, you can use Python boto3 library to delete S3 files before data load.

    • @swethadharmappa
      @swethadharmappa 3 года назад

      @@AWSTutorialsOnline Thanks a lot!. Tried using boto3 and was able to drop the table from catalog(truncate option was not available).

  • @cssensei610
    @cssensei610 3 года назад

    can you do one combining- aws wrangler, glue, athena & mwaa using boto3

  • @redolfmahlaule9893
    @redolfmahlaule9893 3 года назад +1

    Hi sir great video, how to READ data from database (PostgreSQL) to glue then to S3

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      Hi, please check this lab - aws-dojo.com/workshoplists/workshoplist33/
      Hope it helps,

  • @prakashsuthar4388
    @prakashsuthar4388 3 года назад +2

    Any suggestions on how i can list all the tables under a db in glue datacatalog so i can iterate over each table individually??

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад +1

      You can use Glue API for that. Please check this out - docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-GetTables

    • @prakashsuthar4388
      @prakashsuthar4388 3 года назад

      Thanks@@AWSTutorialsOnline , any better way to do that from a Glue Job ?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      There could be a better way but that depends upon your requirement. Please tell me your requirement - why you need to iterate over the tables in the database?

    • @prakashsuthar4388
      @prakashsuthar4388 3 года назад

      @@AWSTutorialsOnline my etl job needs to be generic so that if u specify a catalog database name as job arg, it should load all the tables from it and perform needed transformations on them. In short, same py file should be reusable in multiple jobs without any code change

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      ok. but the transformation per table would be specific right? or would it be standard.

  • @1UniverseGames
    @1UniverseGames 2 года назад +1

    How can we load deep learning trained model .pt format files?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад +1

      Hi, I have limited knowledge about the .pt files; but I understand .pt files are use to package PyTorch models that can be served inside C++ applications. It seems we need c++ libraries to access .pt file. Since, Glue Job support Pyspark/Python and Scala only, I assume it would not be possible to access .pt files from the job.

    • @1UniverseGames
      @1UniverseGames 2 года назад

      @@AWSTutorialsOnline thanks for your response. I have one question like I have a Deep learning model pytorch code base, how can I integrate it with Pyspark then? I actually checked many resources online but can't figure it out, as mostly worked with a csv file, not a model. Any help regarding this, like any videos of deploy Pytorch model into Pyspark or documentation.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      @@1UniverseGames Again, not expert in this area but I understand how ML works in AWS. I recommend to deploy your pytorch model in AWS SageMaker as SageMaker Endpoint. This SageMaker endpoint can be called using python in Glue Job.

  • @divyathanikonda3666
    @divyathanikonda3666 2 года назад +1

    I am unable to load tables to Lake formation
    I followed all the steps I tried for several times.
    Could you help me with this?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      what part you are not able to do? What errors you are getting?

    • @divyathanikonda3666
      @divyathanikonda3666 2 года назад

      @@AWSTutorialsOnline created crawler and once it is done , clicked on run crawler, once it is ready then I checked tables in data lake database, tables are not retrieved there

    • @divyathanikonda3666
      @divyathanikonda3666 2 года назад

      Could i get ur mail id

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 года назад

      @@divyathanikonda3666 Did you give crawler role permission on the database for create table?

    • @divyathanikonda3666
      @divyathanikonda3666 2 года назад

      @@AWSTutorialsOnline yes I gave

  • @9000revsyt
    @9000revsyt 3 года назад +1

    Thanks for the very informative tutorial. Do you have any tutorials on how a job can write data to Snowflake?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Hi - sorry I don't have one for Snowflake. but I found one article which might help you - www.snowflake.com/blog/how-to-use-aws-glue-with-snowflake/

    • @9000revsyt
      @9000revsyt 3 года назад +1

      @@AWSTutorialsOnline Thanks. Also, are the sample datasets used in the workshop available?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      Hi, I used two sample datasets - sales and customer. They available for download in the part-1 for the workshop. I am not sure if that is what you were asking for.

    • @9000revsyt
      @9000revsyt 3 года назад

      @@AWSTutorialsOnline Yes, I could not find them in the Data section

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 года назад

      They are at step 4 when you create s3 bucket and the data