Why Data Engineers Should Develop AWS Glue Jobs Locally

Поделиться
HTML-код
  • Опубликовано: 5 сен 2023
  • If you're a data engineer, developer, or anyone working with AWS Glue, you'll know that the process of building and testing ETL jobs can be complex and resource-intensive. However, there's a solution that offers more control, faster feedback, and greater flexibility in your development process - developing your AWS Glue jobs locally. I will cover the top reasons I think Data Engineers will benefit from developing your glue jobs locally rather than using the UI on the AWS Glue Service.
    Buy Me a Coffee: www.buymeacoffee.com/dataengu
    My Tutorials for Running AWS Glue Locally
    Configure AWS Glue with Docker - PyCharm: • Develop AWS Glue Jobs ...
    Configure AWS Glue with Interactive Glue Sessions: • Author AWS Glue jobs w...

Комментарии • 20

  • @julianromero3359
    @julianromero3359 3 месяца назад +2

    Awesome and valuable information. Great option to develop locally.

  • @andrzejkozielec139
    @andrzejkozielec139 10 месяцев назад +1

    great video!

  • @sjvr1628
    @sjvr1628 10 месяцев назад

    Keep doing more 😊

    • @DataEngUncomplicated
      @DataEngUncomplicated  10 месяцев назад

      Thanks! I will 😊 I have a lot of video ideas in the pipeline

  • @user-vb7im1jb1b
    @user-vb7im1jb1b 8 месяцев назад

    Great informative video. Thanks for sharing. By the way, do you also have a tutorial showing how to work with interactive sessions with jupyter lab/notebooks (anaconda)?

  • @user-nv9pq9ex8y
    @user-nv9pq9ex8y 7 месяцев назад

    Hi can you make an video on Data migration On premises to AWS cloud with end to end process and what are tools used.

  • @AdamAdam-oq4fy
    @AdamAdam-oq4fy 10 месяцев назад

    Well, my way of building glue jobs
    - using glue notebook/zepplin to build all the logic
    - using vscode/pycharm to wrap things up into classes/modules/methods with all the extentions of vscode
    - using cdk to deploy the glue job: using the scripts created above and link to the correct folder structure when deploying
    - once deployed, I should have my glue job ready on the console
    - run/ test/ or modify when needed, but I encourage doing the changes through code

    • @tello9504
      @tello9504 5 месяцев назад

      Do you have a tutorial?

    • @mickyman753
      @mickyman753 2 месяца назад

      My team also does the same, I think if you have a established ci/cd setup then , this is the only way to perform addition of new glue jobs

  • @wilsonwaigant4827
    @wilsonwaigant4827 10 месяцев назад +1

    Nice video! I´m currently working on a project but I was worry about the cost of working on AWS. Now I have a question, if I started working locally, where could I storage the data that I´d generate in the process? and, how and when to migrate the whole work to AWS?
    Thank you!

    • @DataEngUncomplicated
      @DataEngUncomplicated  10 месяцев назад +1

      Thanks Wilson, so if you configure your AWS credentials, you can store your data in AWS s3 that you generate in the process if you need to store it.
      So you should migrate your process to AWS when you are done developing and ready to run your job on the actual data. I'm assuming your data is large and that's why you might want to use pyspark and a larger cluster to process it all.
      The best way to migrate it to AWS is by using infrastructure as code like cdf or terraform. I am going to make a video on how to do this with terraform soon.

    • @wilsonwaigant4827
      @wilsonwaigant4827 10 месяцев назад

      @@DataEngUncomplicated thank you! I'm waiting your video to learn more about it

  • @ColdBlkPenguin
    @ColdBlkPenguin 9 месяцев назад

    Great video! Thank you for making this - my only feedback would be that it feels like you are reading a script to me (which I am sure you probably are). The information you are providing is great, but the delivery can feel a bit "lecturer reading off the powerpoint slides"-y. The video would also feel more engaging if you were "making eye contact" with the camera. Keep up the good work

    • @DataEngUncomplicated
      @DataEngUncomplicated  9 месяцев назад

      Thanks for the valuable feedback, I don't do too many talking head videos, it's definitely something I could improve on!

  • @harshadk4264
    @harshadk4264 3 месяца назад

    How do we orchestrate these aws glue jobs? Do we create the python code for eventbridge, lambda and step functions?

    • @DataEngUncomplicated
      @DataEngUncomplicated  2 месяца назад

      You have many options for orchestrating glue jobs, Glue has an orchestration section which you can orchestrate your glue jobs. You can also orchestrate this in airflow if your company is already using this. If your jobs are more complex and requires trigging other aws services along the way, It would probably be a good idea to leverage step functions.

  • @externalbiconsultant2054
    @externalbiconsultant2054 19 дней назад

    wondering if watching costs are really a data engineers activity?

    • @DataEngUncomplicated
      @DataEngUncomplicated  19 дней назад

      Yes, cost optimization is part of every role when working in a cloud environment. If you work for a large funded organization that isn't coming down on costs you might night feel it as much as a start up that freaks out for an extra $100 in cloud costs.