DE Zoomcamp 2.2.3 - ETL Pipelines with Postgres in Kestra

Поделиться
HTML-код
  • Опубликовано: 7 фев 2025
  • In this video, we'll cover how you can ingest the Yellow Taxi data from the NYC Taxi and Limousine Commission (TLC) and load it into a Postgres database. We'll cover how to extract data from CSV files, and load them into a local Postgres database running in a Docker container.
    Check out the GitHub Repository here: go.kestra.io/d...
    Check out the FAQ here: • DE Zoomcamp FAQ - Post...
    Kestra is an open-source, event-driven orchestration platform that makes both scheduled and event-driven workflows easy. By bringing Infrastructure as Code best practices to data, process, and microservice orchestration, you can build reliable workflows directly from the UI in just a few lines of YAML.
    The course will cover the basics of workflow orchestration, why it's important, and how it can be used to build data engineering pipelines.
    Chapters
    0:00 - Introduction
    0:58 - Create Workflow
    1:46 - Add Inputs
    2:58 - Create Dynamic Variables
    3:59 - Set Labels at Execution
    4:10 - Download and Uncompress CSV file
    4:33 - Execute Extract Process
    4:57 - Set up Postgres DB
    5:46 - Create Postgres DB Table
    9:55 - View DB with pgAdmin
    10:47 - Load Data into Table
    12:45 - Add Unique ID
    15:53 - Truncate Staging Table
    16:47 - Merge Data from Staging Table
    19:31 - Purge Output Files
    20:24 - Add If Statement to process Yellow or Green datasets
    27:24 - Summary
    ----------
    📖 Read the documentation: go.kestra.io/docs
    ⭐ Start your journey with Kestra: go.kestra.io/g...
    🚀 Join the Kestra Community: go.kestra.io/s...
    For more information, visit Kestra's Website: go.kestra.io/

Комментарии • 37

  • @kestra-io
    @kestra-io  24 дня назад +4

    If you’re having difficulty configuring your PostgresDB and getting it set up with pgAdmin, then check out this FAQ: ruclips.net/video/ywAPYNYFaB4/видео.html
    If you’re running pgAdmin in Docker and you’re having difficulties connecting it to your database, check out this video: ruclips.net/video/S1MDvdAo3oM/видео.html

  • @TheCheeseWizard94
    @TheCheeseWizard94 4 часа назад

    Hey Will! Thanks for the explanation. I enjoy your teaching style. I think the MD5 hash is a cool concept for generating a unique ID. I think we could have added an additional step here to check that the staging tables, themselves, won't have a duplicate in them. For instance, Green 2019-02 has some duplicates based off the MD5 hash unique id column. The merge statement only checks for clashes between the staging table and the master table and therefore duplicates can be ingested.

  • @cengytech497
    @cengytech497 12 дней назад

    this is sooo powerful, we need more real-life examples like this one 🔥🔥🔥🔥🔥🔥🔥🔥

  • @abdelali4004
    @abdelali4004 21 день назад

    Great explanation! Thank you Will

  • @shanhtetaung3827
    @shanhtetaung3827 2 дня назад

    It is really powerful tool, but i need to watch three time to understand what you are writing in kestra .Peace ya thank

  • @krishnachaitanya7374
    @krishnachaitanya7374 4 дня назад

    Great explanation. But in some cases you seem to use inverted commas for variables and in other cases you just render without them. So when should we use inverted commas.

    • @kestra-io
      @kestra-io  2 дня назад

      If you're putting the expression in a field on its own, it usually won't work without quotes around it. If you try it, the editor will prompt you to add them if you forget. Some fields contain more than just the expression so these vary

  • @hoanghainguyen8014
    @hoanghainguyen8014 15 дней назад +1

    Hei, many thanks for your content, Will! That was mind-blowing with this tool, however, it seems like you already established the script for the whole session? I imagine that how could I make it by myself 😂? Or we can use human language and translate it by chatGPT?

    • @kestra-io
      @kestra-io  15 дней назад

      The workflows are all available on GitHub in the description! I’d also recommend checking our tutorial series to get you up to speed with Kestra!

    • @glaswasser
      @glaswasser 7 дней назад

      yeah hope chatgpt has enough training data to create those for us. tired of learning the syntax of thousands of new tools popping up every day lol

  • @__sarik
    @__sarik 6 дней назад

    you didnt show the part where you added if statement to choose if it is green or yellow dataset. At 15:37, you are saying "we need to add this to green one as well" and your topology is way different than at 13:02

    • @kestra-io
      @kestra-io  5 дней назад

      Good spot, check out 20:24 where I explain about having an If statement

  • @Hashindata
    @Hashindata 12 дней назад +1

    Wow I am impressed by the way you use Kestra via yml. But Kestra also has the ability to call a python script. Is there any other tutorial which will use ETL python scripts to do the task and then get the outputs in each tasks using Kestra?

    • @kestra-io
      @kestra-io  12 дней назад +2

      We have a video coming soon to show case ETL with Python

  • @krishnavamshithumma7377
    @krishnavamshithumma7377 22 дня назад +1

    If you are using Linux and running pgAdmin in docker, you probably might be getting an error when you execute the flow, in that case, add extra_hosts:
    - "host.docker.internal:host-gateway" to your docker-compose.yml file and the problem is solved

  • @whitetiger1810
    @whitetiger1810 6 дней назад

    what exactly are the steps for creating a server for viewing the database with pgadmin? The steps at 10:14 are massively unclear

    • @kestra-io
      @kestra-io  5 дней назад +1

      Check out the FAQ ruclips.net/video/ywAPYNYFaB4/видео.htmlsi=63zaY7fC8BynxPRg&t=150
      Right click on Servers, press Register and add the info in there

    • @whitetiger1810
      @whitetiger1810 5 дней назад

      @kestra-io thanks a lot

  • @gregwilson007
    @gregwilson007 25 дней назад +1

    Are you running PGAdmin from a local install? In the first module of the DE Zoomcamp, there is a whole video showing how to run PG Admin from a container. My assumption is that most participants will be doing this. As explained in the video (ruclips.net/video/hCAIVe9N0ow/видео.html), when running PG Admin from a container, you can't reference localhost as the host, as this will be referring to the localhost of the container, not the host machine. Might be worth an annotation if you're diverging from the setup used in the first module.

    • @kestra-io
      @kestra-io  24 дня назад

      This is useful context - thank you! I'll add a note!

    • @franciscolemos1674
      @franciscolemos1674 17 дней назад +1

      if you are running PGAdmin from a container instead of localhost in the host name just used the docker's container name instead, that worked for me

    • @saydee8010
      @saydee8010 9 дней назад

      Hey I actually think the workaround is to connect via DBeaver/database client to the postgres database :)

  • @TuanLe-g3v4h
    @TuanLe-g3v4h 7 дней назад

    a switch instead of an if here is more comprehensive I think

  • @Someone-yl8sb
    @Someone-yl8sb День назад

    This is like learning a new language again. Really hard to understand, it would be nice if most of the tasks are done in python then just orchestrate in kestra.

    • @kestra-io
      @kestra-io  День назад

      You can also do that if you want! Write in any language you like and then have Kestra orchestrate your Python code!

  • @itzcallmepro4963
    @itzcallmepro4963 6 дней назад +1

    a very annoying orchestration tool for me actually , which adds a ton of unnecessary and complications to the code , we are literally writing the code in YAML instead of python , i understand the idea of wanting to make a universal orchestrator that's independant of the programming language , but in that process we have literally made YAML something like a programming language with all it's complications , but it's annoying as for python alot of libraries , batteries and a full programming language with ide support and eco-system is being used , here we are writing a very long and annoying Yaml code in a small tab

    • @kestra-io
      @kestra-io  6 дней назад

      Appreciate your feedback. You don’t have to have all the logic inside of YAML. You can have Python do the heavy lifting and then have Kestra run it using a schedule or with an event driven trigger. Allow Python to do the complicated programming while the YAML just dictates what’s going to happen and when

  • @moshoodolanrewaju5384
    @moshoodolanrewaju5384 Месяц назад

    Can we get the link to the source code?

    • @kestra-io
      @kestra-io  Месяц назад

      We’ll be sharing a full GitHub repository soon!

    • @kestra-io
      @kestra-io  24 дня назад

      It’s available in the video description!

  • @gregwilson007
    @gregwilson007 29 дней назад

    Thanks for the vids, Will. Myself and at least one other person are not seeing any output when running the first task. The logs are returning the following error: `No serializer found for class io.kestra.core.models.tasks.runners.TaskRunnerDetailResult and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: io.kestra.plugin.scripts.exec.scripts.models.ScriptOutput["taskRunner"])`. I've verified that our Flow files match yours. I'll post any guidance you provide back in the DE Zoomcamp slack channel.

    • @kestra-io
      @kestra-io  29 дней назад +1

      Hey! Can you confirm that you're using the Docker image with the tag `latest` to make sure you've got all the plugins correctly installed? Feel free to ping me on Slack too!

    • @afsnozdemir2019
      @afsnozdemir2019 29 дней назад +2

      I had the same issue, I didn't give me output as well. The code in the github for docker-compose is downloading the image: kestra:kestra/develop, changing it to kestra:kestra/latest solve the problem for me

    • @gregwilson007
      @gregwilson007 25 дней назад

      @@kestra-io this resolved my issue, thanks.