Common Table Expressions vs Subqueries vs Views vs Temp Tables for data engineers

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024

Комментарии • 72

  • @anandahs6078
    @anandahs6078 8 месяцев назад +34

    fist time learning that materialized can be run concurrently with unique index. thanks for great content.

    • @EcZachly_
      @EcZachly_  8 месяцев назад +4

      Thanks! Anything else you’d want to learn from me?

    • @pavelk.1558
      @pavelk.1558 8 месяцев назад +1

      Yep, it's really helpful function.

    • @ssjdonkey4489
      @ssjdonkey4489 8 месяцев назад +1

      Great video! How do you go about learning complex data engineer concepts? What's your learning algorithm? Thanks

    • @chrishabgood8900
      @chrishabgood8900 7 месяцев назад

      If you want them to be current then yep

  • @rishiraj1616
    @rishiraj1616 5 дней назад +1

    That's a nice suggestion to use materialized views or Indexed views when the sub query is taking longer time than the actual query.

  • @stanislawcronberg3271
    @stanislawcronberg3271 8 месяцев назад +11

    When a DE at my work showed me how to rewrite some of my queries with indexed temp tables to join all the model features into a final table I was so happy

  • @karlnorberg7768
    @karlnorberg7768 7 месяцев назад +4

    Huge thanks for this informative video! Picking up some handy stuff even after +10 years of data engineering or whatever we call it now 🙂

  • @mikefenn00
    @mikefenn00 8 месяцев назад +4

    Subscribed on TikTok when I came across your videos a couple of months ago. Now RUclips has recommended you as well. Seems like a good sign for you!

  • @peekknuf
    @peekknuf 8 месяцев назад +9

    Man you're on fire, producing content at the pace I can't consume it :D
    Nice job!

  • @Kyriaeus
    @Kyriaeus 6 месяцев назад +5

    The CROSS APPLY or LATERAL JOIN clause is also nice if you’re doing multiple steps of transformations on the same relation and don’t want to break out a separate CTE for each layer.

  • @apibarra
    @apibarra 8 месяцев назад +2

    Love the video! I am a huge proponent of CTEs. Nothing worst when you open a table’s source code and see on massive block of join to multiple tables and sub queries all with their own unique filtering logic at the bottom of the joins or nested in the sub query. I feel like I can understand the logic and read the code much quicker with the CTEs at the top.

  • @HichamelBouazzaoui-y8s
    @HichamelBouazzaoui-y8s 8 месяцев назад +3

    Zach…sir…this video is effin awesome!!! Like the speed of explanation and the examples!!! 👊

  • @zmadni1
    @zmadni1 Месяц назад +1

    Not sure about other DB systems but i do know that Oracle does offer Materilze hints for CTE that basically store the query results in a GTT (Global Temp Table) like structure and then drops it once the query finishes. I have used this strategy a lot at my work. Thanks for the explanation though Zach.

  • @jhonnafg
    @jhonnafg 8 месяцев назад +1

    this is so helpful Zach, we get table apis and do a bunch of dataset out of it to build up our sass applications. This sheds light since my current role as a DE in this company is really heavy on SQL but not that part of the etl, they kind of separated that and is more on the backend de. Im trying to bridge that since this is also my first role as de

  • @ingenieroriquelmecagardomo4067
    @ingenieroriquelmecagardomo4067 8 месяцев назад +4

    Awesome video! Would love to see you cover Change Data Capture - with all your breadth and experience, i'm sure it would result in an amazing video.

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      Thanks for the suggestion!

  • @SQLBucketHatGirl
    @SQLBucketHatGirl 8 месяцев назад +1

    If only this was posted earlier id probably pass my technical interview on materialized transformation. 😢 thank you for the great explanation and use cases.

  • @chrism3790
    @chrism3790 2 месяца назад

    You mentioned troubleshooting as a potential issue with temp tables at around 13:00, but I would say that if you want to troubleshoot, you can insert statements in the middle of your script that clone the tempt tables, so you can take a look at the intermediary steps.
    Something like
    CREATE TEMP TABLE staging_data AS
    {some sql logic};
    -- Dump data into a sandbox table for debugging, delete snippet once you're done.
    CREATE TABLE
    CLONE staging_data;
    MERGE INTO target_schema.table_name
    {some sql logic}

  • @sagarjoshi7097
    @sagarjoshi7097 8 месяцев назад +3

    Hey Zach, thank for such informative videos.
    Next can you please explain something about dbt usecases. Points like how it tests sql, orchestration, pipline. Other alternativer if any.

  • @DookyButter
    @DookyButter 4 месяца назад

    I’m a BI developer who occasionally has to dabble in some heavy SQL scripting and your content is hitting some nails on the head that I’ve always been curious about. Humorously, I arrived at the same idea that CTEs are (most often) the best use case scenario. We have another developer that uses subqueries heavily and the performance is really bad.
    Great content. You def earned a sub!

  • @AntMak-p3c
    @AntMak-p3c 4 месяца назад +1

    Great video! What could have been briefly included, in my opinion, is the comparison between materialized views vs (normal) tables, as they seem pretty similar. When to use which one.
    Keep rocking the great work👍

    • @Ps5Statie
      @Ps5Statie 3 месяца назад

      Mv is updated simply by refresh and you can keep the log in mv_stats. Tables can be updated theough complex etl

  • @dima13693
    @dima13693 4 месяца назад +2

    Delivery tip: if something is awesome, you don't have to say so. It will be apparent. Having said that, seeing the 1 to 1 translation between different concepts WAS awesome.

  • @diaconescutiberiu7535
    @diaconescutiberiu7535 8 месяцев назад +1

    Great content, as ever! I'd like to see a video on Data Vault (with hashes) or at least if you can recommend some resources for it... a practical example would be best. Keep up the good work

  • @madhanv160
    @madhanv160 8 месяцев назад +1

    This is new to me. Great comparison.

  • @MikeHaynes-b1y
    @MikeHaynes-b1y 3 месяца назад

    Please make more videos like this. Can you just do a series on query optimization

  • @chandanjha3205
    @chandanjha3205 7 месяцев назад +2

    Important thing to note here is that at 6:30, if you are creating a permanent view in your DB, you need to do it only once as its a DDL statement adding a new object to your schema. It does not have to be repeated everytime else your code will start throwing error.
    Modern DBs may let you create temporary views whose scope is only during the execution and is not global. You may choose to use that.

  • @subhasishsarkar5106
    @subhasishsarkar5106 8 месяцев назад +4

    Great work, thank you so much Zach! Very informative and concise. Keep these coming!

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      I appreciate your support! Any topics in particular you'd want me to cover?

    • @subhasishsarkar5106
      @subhasishsarkar5106 8 месяцев назад +1

      @@EcZachly_ I'd be interested to know your thoughts on the different architecture patterns like the lakehouse, vault, mesh, etc. Looking forward to more from you! :D

  • @MrMe77-u4y
    @MrMe77-u4y 8 месяцев назад +3

    I was reviewing some SQL code today, and I came across an example of nested SQL query aka an infinite series of subqueries, and I was asking myself: WHY?

  • @antovfukov9005
    @antovfukov9005 2 месяца назад

    If an MV is not too complex, they can be incrementally (delta) refreshed on some RDBMSes. MVs are great for rollup ( counts, min,max, histograms). Sadly pivots usually are "complex" and require a full rebuild.

  • @md.jannatultasnim2661
    @md.jannatultasnim2661 8 месяцев назад +2

    i used CTEs , views , materialized views, temp tables but never found any detail in depth explanations how they works internally and performance comparison. Now everything is clear to me Thank you for the contents you shared. Another thing, as you described here 15:14 does this happens only in the relational DBMS ? or in both relational and data warehouses like snowflake BigQuery ?

    • @EcZachly_
      @EcZachly_  8 месяцев назад +1

      Great question! I haven’t tested that exact case. My guess is it’s still blocking based on how refreshes work but that might be something that varies with data warehouses

  • @tsveti_p
    @tsveti_p 8 месяцев назад +1

    Great explanation Zach! This is one of the best comparisons I've heard between all these - with all their pros and cons! 🌟🌟🌟
    On this topic, I would be really thankful for an advice about performance between temporary tables and using staging tables, which are truncated and reloaded on every execution of the ETL.
    My use case is a complex transformation in which we have multiple stages of preparing the dataset moving it from one temporary table to another after applying some logic, joining a table or deleting some rows. It all happens within one script/session, so we don't need to keep the intermediary results, only the final dataset. We use Redshift and the performance is definitely not great.
    Do you think we can get any performance gains if we turn these temp tables into actual staging tables and we truncate and reload them? Also, do you know if defining DISTRIBUTION/SORT key on a temporary table in Redshift is as efficient as dist/sort key on an actual table?

    • @sf-spark129
      @sf-spark129 6 месяцев назад

      Personally I use staging tables over temp tables for debugging purposes. When your end/user-facing table is returning incorrect or duplicate data, you can always trace back to an upstream staging table layer by layer to identify the root cause of the issue. Also, it's a lot easier to fine-tune the performance with staging table because you can test and validate different optimization techniques on them.

  • @shiroyasha7178
    @shiroyasha7178 5 месяцев назад +1

    So good. Thanks!

  • @dima13693
    @dima13693 4 месяца назад +2

    9:29 can you explain why this is slow in more detail? I assumed that everything would run once and get reused in-memory. What exactly re-ran and why?

  • @muhammadraza3290
    @muhammadraza3290 8 месяцев назад +1

    Hey Zach, been following you almost religiously now, cheers for the clear conceptual content. What I am not clear on is why the computation for the CTE 9:55 teammates and enemies took that long. Shouldn't the creation of a CTE and the creation of a materialized table take the same amount of time? Does it have anything to do with the complex JOIN condition?

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      No because the CTE also had an aggregation step afterward

    • @muhammadraza3290
      @muhammadraza3290 8 месяцев назад

      @@EcZachly_ so temp table + aggregation was far lower time than CTE with aggregation. would it always be the case though?

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      @@muhammadraza3290 Temp Tables can be useful for improving performance though. It's very difficult to say something will always be the case

  • @punpompur
    @punpompur 8 дней назад +1

    Wow I never heard about Jane Table API before. Something new to look into. Does it work together with any other database platform like MySQL or PostgreSQL? I have only worked with MySQL so far and never used combinations of databases in projects so I am curious to know

  • @arulkumarankumaraswamipill2516
    @arulkumarankumaraswamipill2516 8 месяцев назад

    Timeseries databases like TimescaleDB has continuous aggregates on materialiazed views , which is incremental.

  • @strawhatSEO
    @strawhatSEO 8 месяцев назад +1

    Great video thank you Zach! Where do stored procedures come in with this conversation? Are sprocs materialized or non-materialized? Or are sprocs used for a purpose besides storing transformations?

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      They are mostly non-materialized

  • @alialfaily7829
    @alialfaily7829 8 месяцев назад +1

    thanks for great content.

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      Glad you enjoy it!

  • @byte_easel
    @byte_easel 8 месяцев назад +1

    This is really great. Can you make videos about SQL aggregate functions, triggers, and transactions? especially aggregate functions and transactions if you can

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      Thanks for the suggestion!

  • @sanooosai
    @sanooosai 3 месяца назад

    thank you sir, you rock !!!

  • @hanpanBR
    @hanpanBR 6 месяцев назад +1

    The spark "registerAsTempTable" goes into what type of those?
    It works like a temporary table, saving the data just for that session?
    I guess I've been using it wrong =p

  • @NeumsFor9
    @NeumsFor9 8 месяцев назад +1

    Materialized views are oftentimes like the fine print or the auction speak at the end of a commercial..... SO many limitations (depending on the vendor or edition or featureset).
    It's like dating the perfect 10...... as long as you pay the 5k dinner bill and have the person home by 8pm because that person has a data engineering conference to speak at in the morning.... 😂😂😂😂

  • @tangy_ra
    @tangy_ra 8 месяцев назад +1

    I have a question Zach. In snowflake, you cannot use join in materialized view. So, does materialized views vary from software to software? Also, there are many restrictions on using materialized view on snowflake.

  • @dn9416
    @dn9416 8 месяцев назад +1

    Intresting

  • @fernandoangelrodriguezgonz1393
    @fernandoangelrodriguezgonz1393 8 месяцев назад +2

    Good

  • @FedericoSanchez-q4k
    @FedericoSanchez-q4k 8 месяцев назад +1

    we have a really long sql statement that uses a sequence of CTE. Each CTE is then used in the next CTE through a join condition. Do you think we would benefit from going to temp tables instead?

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      Probably. Try it out!

  • @kevinwang5225
    @kevinwang5225 4 месяца назад

    Can u create a video on airflow best practices with sql / python?

  • @JPY_Data
    @JPY_Data 8 месяцев назад +1

    hi Zach, great video. What would be the difference between creating a materialized view and creating a table of the results? Would it be the same thing performance wise?

    • @EcZachly_
      @EcZachly_  8 месяцев назад

      If the materialized view has the same indices as the table, they’ll be the same.
      The main difference is you can automatically refresh the materialized view. But you can incrementally build a table

  • @FranLegon
    @FranLegon 6 месяцев назад +1

    I don't really get the benefits of using a materialized view instead of a regular table other that it's a little easier to update (with regular table you would need tp update using keys and check for changed rows, probably using hash)

    • @Ps5Statie
      @Ps5Statie 3 месяца назад

      If the scripy is not complex, i prefer mv

  • @NeumsFor9
    @NeumsFor9 8 месяцев назад +1

    So many collateral contexts to any of these...

  • @johndt1013
    @johndt1013 8 месяцев назад +1

    Hi Zach, thanks for this vid, it was such a great breakdown! I had 1 question if you have time:

    At 17:40 you mention how temp tables have incremental gains, I'm struggling to understand how that is, seeing as how similar to materialized views it appears.
    Is this because if you're adding new data 'A' to existing data 'B', temp tables can just focus on adding 'A', whereas materialized views require you to handle 'B' as well due to the refresh?
    This might be me not understanding the concept of incremental gains, thanks!

    • @EcZachly_
      @EcZachly_  8 месяцев назад +1

      Your understanding is right!