How to use Common Table Expressions vs Subqueries vs Views vs Temp Tables like a data engineer!

Поделиться
HTML-код
  • Опубликовано: 3 мар 2024
  • The data for today's video is available here: github.com/EcZachly/video-gam...
    Common Table Expression and subquery is a very common debate! The temporary table crowd is pretty loud too! Don't forget the materialized view crowd !
    Check out dataexpert.io for my DataExpert community!

Комментарии • 57

  • @stanislawcronberg3271
    @stanislawcronberg3271 3 месяца назад +9

    When a DE at my work showed me how to rewrite some of my queries with indexed temp tables to join all the model features into a final table I was so happy

  • @anandahs6078
    @anandahs6078 3 месяца назад +20

    fist time learning that materialized can be run concurrently with unique index. thanks for great content.

    • @EcZachly_
      @EcZachly_  3 месяца назад +4

      Thanks! Anything else you’d want to learn from me?

    • @pavelk.1558
      @pavelk.1558 3 месяца назад +1

      Yep, it's really helpful function.

    • @ssjdonkey4489
      @ssjdonkey4489 3 месяца назад +1

      Great video! How do you go about learning complex data engineer concepts? What's your learning algorithm? Thanks

    • @chrishabgood8900
      @chrishabgood8900 3 месяца назад

      If you want them to be current then yep

  • @Kyriaeus
    @Kyriaeus Месяц назад +4

    The CROSS APPLY or LATERAL JOIN clause is also nice if you’re doing multiple steps of transformations on the same relation and don’t want to break out a separate CTE for each layer.

  • @mikefenn00
    @mikefenn00 3 месяца назад +4

    Subscribed on TikTok when I came across your videos a couple of months ago. Now RUclips has recommended you as well. Seems like a good sign for you!

  • @karlnorberg7768
    @karlnorberg7768 2 месяца назад +3

    Huge thanks for this informative video! Picking up some handy stuff even after +10 years of data engineering or whatever we call it now 🙂

  • @peekknuf
    @peekknuf 3 месяца назад +8

    Man you're on fire, producing content at the pace I can't consume it :D
    Nice job!

  • @user-ud2zd7nx4c
    @user-ud2zd7nx4c 3 месяца назад +2

    Zach…sir…this video is effin awesome!!! Like the speed of explanation and the examples!!! 👊

  • @chandanjha3205
    @chandanjha3205 2 месяца назад +2

    Important thing to note here is that at 6:30, if you are creating a permanent view in your DB, you need to do it only once as its a DDL statement adding a new object to your schema. It does not have to be repeated everytime else your code will start throwing error.
    Modern DBs may let you create temporary views whose scope is only during the execution and is not global. You may choose to use that.

  • @apibarra
    @apibarra 3 месяца назад +2

    Love the video! I am a huge proponent of CTEs. Nothing worst when you open a table’s source code and see on massive block of join to multiple tables and sub queries all with their own unique filtering logic at the bottom of the joins or nested in the sub query. I feel like I can understand the logic and read the code much quicker with the CTEs at the top.

  • @subhasishsarkar5106
    @subhasishsarkar5106 3 месяца назад +4

    Great work, thank you so much Zach! Very informative and concise. Keep these coming!

    • @EcZachly_
      @EcZachly_  3 месяца назад

      I appreciate your support! Any topics in particular you'd want me to cover?

    • @subhasishsarkar5106
      @subhasishsarkar5106 3 месяца назад +1

      @@EcZachly_ I'd be interested to know your thoughts on the different architecture patterns like the lakehouse, vault, mesh, etc. Looking forward to more from you! :D

  • @SQLBucketHatGirl
    @SQLBucketHatGirl 3 месяца назад +1

    If only this was posted earlier id probably pass my technical interview on materialized transformation. 😢 thank you for the great explanation and use cases.

  • @jhonnafg
    @jhonnafg 3 месяца назад +1

    this is so helpful Zach, we get table apis and do a bunch of dataset out of it to build up our sass applications. This sheds light since my current role as a DE in this company is really heavy on SQL but not that part of the etl, they kind of separated that and is more on the backend de. Im trying to bridge that since this is also my first role as de

  • @ingenieroriquelmecagardomo4067
    @ingenieroriquelmecagardomo4067 3 месяца назад +4

    Awesome video! Would love to see you cover Change Data Capture - with all your breadth and experience, i'm sure it would result in an amazing video.

    • @EcZachly_
      @EcZachly_  3 месяца назад

      Thanks for the suggestion!

  • @madhanv160
    @madhanv160 3 месяца назад +1

    This is new to me. Great comparison.

  • @diaconescutiberiu7535
    @diaconescutiberiu7535 3 месяца назад +1

    Great content, as ever! I'd like to see a video on Data Vault (with hashes) or at least if you can recommend some resources for it... a practical example would be best. Keep up the good work

  • @sagarjoshi7097
    @sagarjoshi7097 3 месяца назад +3

    Hey Zach, thank for such informative videos.
    Next can you please explain something about dbt usecases. Points like how it tests sql, orchestration, pipline. Other alternativer if any.

  • @shiroyasha7178
    @shiroyasha7178 16 дней назад +1

    So good. Thanks!

  • @alialfaily7829
    @alialfaily7829 3 месяца назад +1

    thanks for great content.

    • @EcZachly_
      @EcZachly_  3 месяца назад

      Glad you enjoy it!

  • @md.jannatultasnim2661
    @md.jannatultasnim2661 3 месяца назад +2

    i used CTEs , views , materialized views, temp tables but never found any detail in depth explanations how they works internally and performance comparison. Now everything is clear to me Thank you for the contents you shared. Another thing, as you described here 15:14 does this happens only in the relational DBMS ? or in both relational and data warehouses like snowflake BigQuery ?

    • @EcZachly_
      @EcZachly_  3 месяца назад +1

      Great question! I haven’t tested that exact case. My guess is it’s still blocking based on how refreshes work but that might be something that varies with data warehouses

  • @pasqualesalomone8902
    @pasqualesalomone8902 3 месяца назад +3

    I was reviewing some SQL code today, and I came across an example of nested SQL query aka an infinite series of subqueries, and I was asking myself: WHY?

  • @hanpanBR
    @hanpanBR Месяц назад +1

    The spark "registerAsTempTable" goes into what type of those?
    It works like a temporary table, saving the data just for that session?
    I guess I've been using it wrong =p

  • @strawhatSEO
    @strawhatSEO 3 месяца назад +1

    Great video thank you Zach! Where do stored procedures come in with this conversation? Are sprocs materialized or non-materialized? Or are sprocs used for a purpose besides storing transformations?

    • @EcZachly_
      @EcZachly_  3 месяца назад

      They are mostly non-materialized

  • @FranLegon
    @FranLegon Месяц назад +1

    I don't really get the benefits of using a materialized view instead of a regular table other that it's a little easier to update (with regular table you would need tp update using keys and check for changed rows, probably using hash)

  • @tsveti_p
    @tsveti_p 3 месяца назад +1

    Great explanation Zach! This is one of the best comparisons I've heard between all these - with all their pros and cons! 🌟🌟🌟
    On this topic, I would be really thankful for an advice about performance between temporary tables and using staging tables, which are truncated and reloaded on every execution of the ETL.
    My use case is a complex transformation in which we have multiple stages of preparing the dataset moving it from one temporary table to another after applying some logic, joining a table or deleting some rows. It all happens within one script/session, so we don't need to keep the intermediary results, only the final dataset. We use Redshift and the performance is definitely not great.
    Do you think we can get any performance gains if we turn these temp tables into actual staging tables and we truncate and reload them? Also, do you know if defining DISTRIBUTION/SORT key on a temporary table in Redshift is as efficient as dist/sort key on an actual table?

    • @sf-spark129
      @sf-spark129 Месяц назад

      Personally I use staging tables over temp tables for debugging purposes. When your end/user-facing table is returning incorrect or duplicate data, you can always trace back to an upstream staging table layer by layer to identify the root cause of the issue. Also, it's a lot easier to fine-tune the performance with staging table because you can test and validate different optimization techniques on them.

  • @dn9416
    @dn9416 3 месяца назад +1

    Intresting

  • @muhammadraza3290
    @muhammadraza3290 3 месяца назад +1

    Hey Zach, been following you almost religiously now, cheers for the clear conceptual content. What I am not clear on is why the computation for the CTE 9:55 teammates and enemies took that long. Shouldn't the creation of a CTE and the creation of a materialized table take the same amount of time? Does it have anything to do with the complex JOIN condition?

    • @EcZachly_
      @EcZachly_  3 месяца назад

      No because the CTE also had an aggregation step afterward

    • @muhammadraza3290
      @muhammadraza3290 3 месяца назад

      @@EcZachly_ so temp table + aggregation was far lower time than CTE with aggregation. would it always be the case though?

    • @EcZachly_
      @EcZachly_  3 месяца назад

      @@muhammadraza3290 Temp Tables can be useful for improving performance though. It's very difficult to say something will always be the case

  • @fernandoangelrodriguezgonz1393
    @fernandoangelrodriguezgonz1393 3 месяца назад +2

    Good

  • @byte_easel
    @byte_easel 3 месяца назад +1

    This is really great. Can you make videos about SQL aggregate functions, triggers, and transactions? especially aggregate functions and transactions if you can

    • @EcZachly_
      @EcZachly_  3 месяца назад

      Thanks for the suggestion!

  • @NeumsFor9
    @NeumsFor9 3 месяца назад +1

    Materialized views are oftentimes like the fine print or the auction speak at the end of a commercial..... SO many limitations (depending on the vendor or edition or featureset).
    It's like dating the perfect 10...... as long as you pay the 5k dinner bill and have the person home by 8pm because that person has a data engineering conference to speak at in the morning.... 😂😂😂😂

  • @tangy_ra
    @tangy_ra 3 месяца назад +1

    I have a question Zach. In snowflake, you cannot use join in materialized view. So, does materialized views vary from software to software? Also, there are many restrictions on using materialized view on snowflake.

  • @JPY_Data
    @JPY_Data 3 месяца назад +1

    hi Zach, great video. What would be the difference between creating a materialized view and creating a table of the results? Would it be the same thing performance wise?

    • @EcZachly_
      @EcZachly_  3 месяца назад

      If the materialized view has the same indices as the table, they’ll be the same.
      The main difference is you can automatically refresh the materialized view. But you can incrementally build a table

  • @arulkumarankumaraswamipill2516
    @arulkumarankumaraswamipill2516 3 месяца назад

    Timeseries databases like TimescaleDB has continuous aggregates on materialiazed views , which is incremental.

  • @user-bh5ws7hx1t
    @user-bh5ws7hx1t 3 месяца назад +1

    we have a really long sql statement that uses a sequence of CTE. Each CTE is then used in the next CTE through a join condition. Do you think we would benefit from going to temp tables instead?

    • @EcZachly_
      @EcZachly_  3 месяца назад

      Probably. Try it out!

  • @NeumsFor9
    @NeumsFor9 3 месяца назад +1

    So many collateral contexts to any of these...

  • @johndt1013
    @johndt1013 3 месяца назад +1

    Hi Zach, thanks for this vid, it was such a great breakdown! I had 1 question if you have time:

    At 17:40 you mention how temp tables have incremental gains, I'm struggling to understand how that is, seeing as how similar to materialized views it appears.
    Is this because if you're adding new data 'A' to existing data 'B', temp tables can just focus on adding 'A', whereas materialized views require you to handle 'B' as well due to the refresh?
    This might be me not understanding the concept of incremental gains, thanks!

    • @EcZachly_
      @EcZachly_  3 месяца назад +1

      Your understanding is right!