How to find consecutive streaks in data using SQL window functions (and identify cheaters in Halo 5)

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024
  • This Github repo for this video is here so you can follow along:
    github.com/EcZ...
    In this video we'll be using a data set I scraped from Halo 5 between 2015 and 2016. We'll use SQL to find players that have very long win streaks (who might be cheating).
    We'll use LAG, SUM, and ROW_NUMBER to accomplish this with a bunch of common table expressions (i.e. CTEs).

Комментарии • 43

  • @mateo_mojica_hernandez
    @mateo_mojica_hernandez 2 года назад +2

    Man i have been trying to understand sql for years but you make it so clear

  • @rembautimes8808
    @rembautimes8808 22 дня назад +1

    Wow it’s very useful and has many applications in business, such as fraud detection. Also nice to see the husky background 😂

  • @ST95953
    @ST95953 2 года назад +7

    Hey Zach, thank you for providing such insightful content. You’re helping Data engineers around the world, including myself

  • @vectoralphaSec
    @vectoralphaSec Год назад +2

    Just discovered your channel. You seem like a cool knowledgeable person on this topic. So im wondering if you can help me with a question i have. So I am a recent graduate in Computer Science and want to instead go into Data Science industry instead of the traditional Software Developer/ Software Engineer career. So as someone like me who eventually one day wants to work in data science industry, is there an entry level job role that one can apply as a starting point and just get promoted/ level up to a data science in time rather than just trying to apply for a data science position from the get go? I dont mind starting at the bottom and working my way up the ladder. If anyone else have any suggestions i would greatly appreciate it?

  • @FranLegon
    @FranLegon 4 месяца назад +1

    Very interesting!

  • @Videoorchard
    @Videoorchard 2 года назад +1

    @zack, can you make a video on SCD type 2? How does it apply for a large dataset. Its so commonly used in DW, but cant really get my head around it. If you can query it on bigquery, it would be great.
    Love your video. thank you for taking your time . love from india

  • @Davidkiania
    @Davidkiania 2 года назад +3

    Hi Zach this is awesome. Love how the query/CTE built up to the answer the question. Very informative awesome.

  • @neonnuggetchoco
    @neonnuggetchoco 2 месяца назад +1

    you'r a G

  • @renaniwayama1441
    @renaniwayama1441 7 месяцев назад +1

    Hi Zach! Thank you for the video!
    I have this following-up question:
    I would like to understand what can be done to product those metrics on an incremental basis. Meaning every day we get new data in the tables and it might be a waste of compute resources to apply the logic to the full data every day it will only grow larger.
    Is it possible to maintain a table with the players "completed" streaks and only update another table with the "on going streaks"?
    Tks in advance for your toughts on this.

  • @muhammadumaramanat738
    @muhammadumaramanat738 2 года назад +2

    Awesome demonstration of SQL windows Function. Finally, I know why CTE is a must tool for DS. 😀

  • @stevegonzales4411
    @stevegonzales4411 2 месяца назад +1

    Great. I'm surprised there isn't a more elegant way of doing this...

  • @hellenruthes2056
    @hellenruthes2056 2 года назад +2

    I enjoy your videos and posts. Thanks for sharing

  • @Spartan300-ch7ps
    @Spartan300-ch7ps 6 месяцев назад +1

    Awesome channel!! I used to self join on row_number - 1, lag is much better, thanks for sharing this!
    Just one suggestion, don’t need the order by completion_date on middle two queries since SUM is associative

  • @addysuryavanshi
    @addysuryavanshi Год назад +1

    I want to work under you and learn can we make this happen. would you be kind enough to mentor me sir..?
    Thank you for your time and consideration and I look forward to hearing from you soon.

  • @llllllllllllllllllllIll
    @llllllllllllllllllllIll Год назад +1

    Really helpful video, i could'nt wait to get my hands dirty on this one so i installed postgres (Am not familiar with postgres) and tried to use the .dump file but i was unable to load it into the PGadmin tool, can you please help me out on how to get the database on to the tool so that i can practice the awesome stuff which you showed

  • @Ibad_gamer937
    @Ibad_gamer937 Год назад +1

    once i used two CTEs in a query and i thought google should call me now. lol
    Thanks for the video.

  • @MuhammadTKhan-re6cz
    @MuhammadTKhan-re6cz 2 года назад +2

    Great video Zach! I like how you elegantly used CTEs throughout the query.

  • @mayureshkhanaj4003
    @mayureshkhanaj4003 2 года назад +1

    You are honest hard-working guy ..
    Quality and to the point content
    Kudos to you keep creating for
    People like me ..:)

  • @siyanasithole7362
    @siyanasithole7362 Год назад +1

    Thank you ,much this was very informative and helpful

  • @amjadbaba1216
    @amjadbaba1216 2 года назад +1

    Hi Zach, first thanks a lot for your content, can you make a video explaining window functions.

  • @camillasouza8993
    @camillasouza8993 4 месяца назад +1

    Pretty nice way to solve the problem!

  • @stefantodorovikj6165
    @stefantodorovikj6165 Год назад +1

    Thanks for the quality content Zach. Much appreciated.
    I would only recommend because this is a very good beginner project for someone who wants to build portfolio , would have been useful if you actually showed us how you created the schema and pulled data from the APIs :)
    You are a gem in the data world , much respect

  • @nadikim1740
    @nadikim1740 2 года назад +1

    Hey there, really enjoyed the video! would love to get these as audio only, are you on audea?

  • @caassimbah2485
    @caassimbah2485 2 года назад +1

    This was awesome Zach! You really explained this well thank you!

  • @GauravSharmagvs
    @GauravSharmagvs 2 года назад +1

    Thanks for the video. Great explanantion.

  • @CalmingRelaxingAmbiance
    @CalmingRelaxingAmbiance 2 года назад +1

    It’s very informative! Definitely need some more videos like this 🙌🏻

  • @andrewkimball219
    @andrewkimball219 2 года назад +1

    Zach what is your huskies name ?:)

  • @christiandgreat8127
    @christiandgreat8127 2 года назад +1

    Loved this!! thanks Zach! hope you made more of this.

  • @saptarshiroy4815
    @saptarshiroy4815 2 года назад +1

    Hi Zach , not sure if this is the right platform to ask .. just wondering if you happen to suggest some ideas on how to generate aggregate values based on all possible combinations of columns of a dataframe in spark. We have cube function but that doesn't seem to work with millions of rows and >10 columns

    • @Spartan300-ch7ps
      @Spartan300-ch7ps 6 месяцев назад

      Can’t handle all of them but there are techniques to find the important ones
      en.m.wikipedia.org/wiki/Curse_of_dimensionality

  • @jamesbarnard3659
    @jamesbarnard3659 Год назад +1

    An incredibly clear outline to obtaining streak data in SQL. As someone brand new to the language, this was incredible. Thanks you.

    • @EcZachly_
      @EcZachly_  Год назад

      Thanks for your kind words!

  • @BJTangerine
    @BJTangerine 2 года назад +1

    Can't wait for more content from ya

  • @pushpanthkumar9028
    @pushpanthkumar9028 2 года назад +1

    We need more data engineering videos.

  • @510z2
    @510z2 2 года назад +1

    Sooooo good. Keep going mijo

  • @sarv-shreshth
    @sarv-shreshth Год назад +1

    Need with python

  • @jasonrogers837
    @jasonrogers837 2 года назад

    Yo, this one is faster (I've taken out playlist from both yours and mine for brevity, but feel free to add it back in)...
    ---
    WITH
    T1 AS (
    SELECT
    *
    ,CASE
    WHEN DidWin = 0 THEN 1
    WHEN LAG(DidWin) OVER (PARTITION BY PlayerId ORDER BY CompletionDate ASC) = 1 THEN 0
    ELSE 1
    END AS IsStreakStart
    ,CASE
    WHEN DidWin = 0 THEN 1
    WHEN LEAD(DidWin) OVER (PARTITION BY PlayerId ORDER BY CompletionDate ASC) = 1 THEN 0
    ELSE 1
    END AS IsStreakEnd
    FROM [MatchDetail]
    )
    ,T2 AS (
    SELECT
    *
    ,SUM(IsStreakStart) OVER (PARTITION BY PlayerId ORDER BY CompletionDate ASC) AS StreakGroup
    FROM T1
    )
    ,T3 AS (
    SELECT
    *
    ,COUNT(1) OVER (PARTITION BY PlayerId, StreakGroup) AS StreakSize
    FROM T2
    )
    SELECT
    *
    FROM T3
    WHERE
    DidWin = 1
    AND IsStreakEnd = 1;

  •  2 года назад +1

    👏👏👏👏👏👏

  • @neilkod
    @neilkod 2 года назад +1

    Lulu!!

    • @EcZachly_
      @EcZachly_  2 года назад

      Thank you so much Neil! Lulu says hi!

  • @Tarantantalas
    @Tarantantalas 2 года назад +1

    Hey Zach, great video! A small constructive criticism: it would be great if you could briefly select * with a limit 10 on each table that you'll be using, so people that don't setup a local postgres can build a mental map of the existing dataset and kind of know what type of info we have available.

    • @EcZachly_
      @EcZachly_  2 года назад +1

      Thank you so much for the feedback!