Odds Ratios and Log(Odds Ratios), Clearly Explained!!!

Поделиться
HTML-код
  • Опубликовано: 10 фев 2025

Комментарии • 463

  • @statquest
    @statquest  5 лет назад +43

    Corrections:
    8:53 I meant to say "the remaining 198.7 people without the mutated gene".
    14:27 I correctly wrote "If the tests worked as expected, 5% should have p-values less than 0.05". That is correct. However, when recording the voiceover, I said "5% should have p-values less than 0.5", which is not correct.
    NOTE: In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations.
    Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @jacksmith870
      @jacksmith870 5 лет назад +2

      thanks for clarification. tiny bam!

    • @MM-jm1il
      @MM-jm1il 3 года назад +1

      Why do we take the natural log vs log base 10?

    • @statquest
      @statquest  3 года назад +3

      @@MM-jm1il I believe it is because the derivative of ln(x) is super simple, just 1/x. Likewise, the derivative of the exponential function, e^x, is e^x.

    • @speakers159
      @speakers159 3 года назад

      Chi Squared test!!

    • @statquest
      @statquest  3 года назад

      @@speakers159 Noted!

  • @mukhayyirkhujaabdurakhmono356
    @mukhayyirkhujaabdurakhmono356 4 года назад +49

    I've been excelling at math without understanding such fundamentals of statistics which I've been silently ashamed of. Your materials have significantly improved my understanding. Hats off to you sir. Thank you.

  • @larrythegreatest
    @larrythegreatest 9 месяцев назад +14

    It is amazing how many math students who are advanced in math and algebra lack very basic foundation in statistics. Your materials are highly valuable. It’s like the Neanderthals discovering fire.

  • @oshtontsen5428
    @oshtontsen5428 4 года назад +49

    As a full-time machine learning engineer, I must say I love the animations and your clear concise explanations! Your videos really helped me understand the fundamentals to succeed in my field. Thank you

  • @joshuah2786
    @joshuah2786 5 лет назад +26

    Graduate students across the globe thank you for these excellent videos. Top marks sir.

    • @amrdel2730
      @amrdel2730 Месяц назад

      And graduate students and PhD ones damn sure need statistics understanding for machine learning

  • @chaojiang5251
    @chaojiang5251 6 лет назад +39

    This is the best series of stats videos I have ever seen happening on youtube. You sir is a legend

    • @statquest
      @statquest  6 лет назад +6

      Thank you so much!!! I'm so happy to hear that you like the videos. :)

    • @PunmasterSTP
      @PunmasterSTP 10 месяцев назад

      Five years later and I still feel this is absolutely true!

  • @PunmasterSTP
    @PunmasterSTP 10 месяцев назад +6

    With videos like these, the odds of learning something are good, and it's easy to log a lot of time here!

    • @statquest
      @statquest  10 месяцев назад +2

      TRIPLE BAM! :)

  • @Theviswanath57
    @Theviswanath57 4 года назад +17

    Actually StatQuest is on my mind all night long

  • @lifeisbeautiful882
    @lifeisbeautiful882 4 года назад +4

    First time continuously watching 10 videos in a series without any break. 10 x BAM.

    • @statquest
      @statquest  4 года назад +1

      OMG!!! Total binge! BAM! :)

  • @raygivler
    @raygivler 5 лет назад +18

    This contained the first reference (that I've seen) to "small Bam". This is important StatQuest trivia!

    • @statquest
      @statquest  5 лет назад +4

      I think this is a good piece of trivia! I'm not 100% it is correct - I don't know off the top of my head when the first small bam showed up, but this could be it. :)

    • @dvdmrn
      @dvdmrn 5 лет назад +7

      historians will take note

    • @arenashawn772
      @arenashawn772 Год назад +2

      @@statquestthis is the first appearance of “small bam” to the best of my knowledge - I watched the videos from the oldest forward in the past month (😂) and I think I haven’t seen it’s appearance anywhere before this one.
      Now I feel like an 11 year old groupie girl 🤪

    • @statquest
      @statquest  Год назад

      @@arenashawn772 So it is confirmed! TRIPLE BAM! :)

  • @chenghaowang4541
    @chenghaowang4541 4 года назад +2

    This is so helpful, I have been struggling about log(odd ratio) for a while, the meaning of log part confuses me forever, just never get why log is necessary for the calculation, now I finally understand, thank you so much

  • @lisacalhoun306
    @lisacalhoun306 5 лет назад +3

    Great video! I needed this refresher. Very clear and concise! A point of clarification: at approximately 14:18 the voice over says that "...5% should have p-values of less than 0.5." I believe you meant "less than 0.05" as the slide shows. I wanted to clarify for listeners.

    • @statquest
      @statquest  5 лет назад

      Thanks for catching that. I just added a pinned comment with that note so that it will be easy to find by future viewers.

  • @hajarhajar8906
    @hajarhajar8906 3 года назад +2

    This is such an underrated channel! Thanks again for saving my life lol

  • @rrrprogram8667
    @rrrprogram8667 6 лет назад +27

    Brilliant stuff Josh.... I keeep coming back to your video again and again... only to find that there is more and more information which I missed to grasp earlier ..
    Loving Statquesting.

    • @statquest
      @statquest  6 лет назад +1

      Hooray!!! I'm glad you keep learning more and more. :)

  • @umdmrlbro
    @umdmrlbro 6 лет назад +4

    You are the man Josh. Wish I had a teacher like you when I was in school.

    • @statquest
      @statquest  6 лет назад +3

      Thank you so much!! I'm glad you like the videos. :)

  • @evanrushton1
    @evanrushton1 6 лет назад +13

    I wrote an R script to generate the histogram alluded to by Josh at 10:00. For people following along in R this might be of use, so I figured I'd share it. Thank you for your work Dr. Starmer :bow:
    # StatQuest: Odds Ratios and Log(Odds Ratios), Clearly Explained!!! Josh Starmer
    # Steps to reproduce the histogram for Wald's Test in R
    # Author: Evan Rushton
    # Date: 09/22/18
    logOdds

    • @statquest
      @statquest  6 лет назад +1

      This is awesome!!!!! Nice work. :)

    • @marcoventura9451
      @marcoventura9451 3 года назад

      Fantastic!!! I will tray to grasp every passage, I am a novice to R. Thank You Evan.

  • @aylin7409
    @aylin7409 5 лет назад +1

    Man I love the intro song to your video, so calming... my exam is on 15th of April pray for me guys...

  • @muffinman1
    @muffinman1 Год назад +1

    Another confusion clarified, thanks Josh.

  • @AbheeBrahmnalkar
    @AbheeBrahmnalkar 10 месяцев назад +1

    Thanks!

    • @statquest
      @statquest  10 месяцев назад

      TRIPLE BAM!!! Thank you for supporting StatQuest!!! :)

  • @MiMi-zm2uc
    @MiMi-zm2uc 3 года назад +2

    OMG it's so... CLEARLY EXPLAINED!!!

  • @immunostatst3435
    @immunostatst3435 4 месяца назад +1

    Fantastic video, as always clearly explained! One minor point of confusion - for the Wald simulated histogram, in each iteration of the loop, a **single sample** of size N is selected. Then, as stated in the video and as commented in @evanrushton1's R code below, a random number between 0 and 1 is selected, and if the number is < 0.08, the **sample** is described as 'having cancer'; however, in the code, this command generates a vector, of size N, containing N random numbers between 0 and 1, which seem to instead represent the probability that each **person** in the sample has cancer. Same for the **sample** having the mutation vs **each person in the sample** having the mutation...as each iteration of the loop generates a 2x2 table containing integers representing the number of **persons** in each cell, from which to calculate log(OR).
    Apologies if this is semantic but this distinction confused me in the video and as such, it wasn't obvious to me how to turn those sample-based probabilities into a loop until I saw the code (and thanks to both Josh and Evan for sharing their R code). Does anyone have a simple explanation if this confused them as well?

  • @shivanshsingh5555
    @shivanshsingh5555 3 года назад +1

    U r a god in explaining difficult things so easy

  • @wolfisraging
    @wolfisraging 6 лет назад +5

    My most awaited topic, thank u very much.
    Big fan😊

    • @statquest
      @statquest  6 лет назад

      Hooray!!! You're welcome. I'm glad you like the video! :)

  • @preet111
    @preet111 2 года назад +1

    This channel is gold

  • @silviapetrova8562
    @silviapetrova8562 5 лет назад +12

    (waiting for that statquest on chi-square test) but still thank you

  • @MrSpiritmonger
    @MrSpiritmonger 4 года назад +2

    my son or daughter will watch your videos when they grow up.

  • @andyn6053
    @andyn6053 4 года назад +2

    You deserve waaaay more subscribers!

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

  • @bardicayt
    @bardicayt 3 года назад +1

    Hey Josh--looking for a great explanation of odds ratios for my students, and bam ! there you are. Please say hi to Jack for me!

    • @statquest
      @statquest  3 года назад

      BAM!!! Wow! It's Al Bardi!!!! Cool. I'll definitely pass the word on to Jack.

  • @vanya.antonov
    @vanya.antonov 6 лет назад +2

    Thanks for another great video! Looking forward to the StatQuest about the Chi-square test))

  • @manuelm962
    @manuelm962 Год назад +1

    Came for the content, stayed for the intro

  • @kanagalkumar
    @kanagalkumar 4 года назад +1

    I have one word for your videos, Lucid!

  • @alonsoquijano6749
    @alonsoquijano6749 4 года назад +2

    Think about people with cancer as delicious m&ms... this made me laugh

  • @haroldfelipezuluagagrisale3875
    @haroldfelipezuluagagrisale3875 4 года назад +1

    Thanks for this rich content, best educational video about machine learning, youre the best!!!

  • @zakiamahmoudi4577
    @zakiamahmoudi4577 2 года назад +1

    Horray,
    I've made it to the end of the stats playlist ...

  • @mostinho7
    @mostinho7 4 года назад

    Odds of something is itself a ratio (probability of it happening/probability of it not happening)
    The odds are usually odds of x given y (in this case the odds of getting cancer given that you do have a gene)
    Log odds ratio is a ratio of ratios (ratio of the log odds) but the given condition changes. This tells us if the “given” condition has an effect on the odds of x happening.
    Example: 3:40
    Log odds ratio = (log odds of getting cancer | you have the gene) / (log odds of getting cancer | you don’t have the gene)
    This can tell us the effect of the given condition (given you have the gene vs given you don’t have the gene) on the odds of x happening (cancer vs no cancer)

  • @bonob0123
    @bonob0123 9 месяцев назад

    would be great to have a video differentiating odds ratios from relative risk from hazard ratios. in terms of interpretation and when one is used instead of the other

    • @statquest
      @statquest  9 месяцев назад

      I'll keep that in mind.

  • @emirlanaliiarbekov8729
    @emirlanaliiarbekov8729 4 месяца назад +1

    First of all, thanks for this series! It's cool how Wald's test can show relationship or absence of it. But we got the distributions of cancer people and mutated people independently, meaning we said that 8% of samples have cancer and 39% of samples have mutated gene. As I look at names of features ("cancer", "mutated gene"), I am biased to believe that there is a relationship. But if the 2nd feature had the name "wearing green shoes", I would assume there is no relationship and the wald's test would still show p value

    • @statquest
      @statquest  4 месяца назад

      That's interesting. I guess you have to try to ignore the names. Maybe just label them "a" and "b".

  • @annamoskalew5465
    @annamoskalew5465 5 лет назад

    small mistake 8 minute 50s :expected values for Chi-square should be in the second row (for no mutated gene) 17.8 - cancer and 198.2 for no cancer. It's easy to check without computing cause they should add up to integers totals under each column
    but anyway great video and thank you for your work

    • @statquest
      @statquest  5 лет назад

      How did you get 17.8? When I multiply the number of people that do not have the mutated gene, 216, by the probability that someone will have cancer, 0.08, I get 17.3.

  • @taladiv3415
    @taladiv3415 4 года назад +1

    Good explainer!

    • @statquest
      @statquest  4 года назад +1

      Glad you think so!

  • @douglasespindola5185
    @douglasespindola5185 6 лет назад +1

    You're the best, Josh!

  • @domillima
    @domillima 4 года назад +1

    brahhh we need a dedicated chi square lecture D: plssss

    • @statquest
      @statquest  4 года назад +2

      I'll keep that in mind.

  • @anakagung7613
    @anakagung7613 4 года назад +1

    Good job, man

  • @yulinliu850
    @yulinliu850 6 лет назад +1

    Many thanks Josh! It would be more convenient if you add the link to your video about Fisher's exact test.

    • @statquest
      @statquest  6 лет назад

      I tried adding a "card' link to the Fisher's Exact test (this is a pop-up link that shows up at the right time in the video), but maybe something went wrong with it. Regardless, I added a link to the StatQuest in the description, and here it is as well: ruclips.net/video/udyAvvaMjfM/видео.html
      Happy StatQuesting!!!

  • @georgie532
    @georgie532 Год назад +1

    Amazing video. Could you please, one day, make a video explaining why the standard deviation is the inverse of the observed values? Thank you!!:)

  • @ArnabJoardar
    @ArnabJoardar 4 года назад

    Hi Josh.
    Just started watching your videos. Currently going through your 'Statistics Fundamentals' playlist.
    In this video, at the 7:50 mark, you mention 'So, if the gene is not associated with the 140 people with the mutated gene...', shouldn't the assumption be that 'So, if we assume that cancer is not associated with the 140 people with the mutated gene...'.
    That's the only reason you used the expected value using the 'estimated' population probability of having cancer (0.08) to calculate the 'expected value' of the number of people having both the mutated gene (140) and cancer.

    • @statquest
      @statquest  4 года назад

      Yes, that's a typo. It should be "If cancer is not associated with the 140 people with the mutated gene."

    • @HankGussman
      @HankGussman 3 года назад

      This clarification was desperately needed. My head was spinning trying to make sense out of typo text, where none could be made

  • @AC-rt2kr
    @AC-rt2kr 3 года назад +1

    this video is fecking amazin

  • @MrCk0212
    @MrCk0212 Год назад

    Here is the code in Python to generate the histogram at 10:10. But one thing I would like to raise is that it seems not necessary to have random sample size as you can notice that it would be cancelled out when calculating the odd ratios.
    import numpy as np
    import matplotlib.pyplot as plt
    n = 1000000
    sample_size = np.random.rand(n)*100+300 #Sample size ~ uniform disturbution between 300 and 400
    p1 = np.random.rand(n) #p1 = proability with cancer ~ uniform distribution between 0 and 1
    p2 = np.random.rand(n) #p2 = probability with gene mutation ~ uniform distribution between 0 and 1
    odd_ratios = (sample_size*p1*p2 / sample_size*(1-p1)*p2) / (sample_size*p1*(1-p2) / sample_size*(1-p1)*(1-p2))
    log_odd_ratios = np.log(odd_ratios)
    plt.hist(log_odd_ratios,bins=np.arange(-10,10.5,0.5))

  • @donjr3270
    @donjr3270 5 лет назад

    At 13:50 of the video, the p-value that mutated genes does not have relationship with cancer is 0.00005. So does it mean that mutated genes has a strong association with cancer then?
    P.S: It would be great that each time when you give an example, you will have a conclusion to it, so it wouldn't be confusing. Great video!

    • @donjr3270
      @donjr3270 5 лет назад

      And how did you get 0.00005 p-value anyway? It's an estimate value you gave based on the normal distribution?

    • @statquest
      @statquest  5 лет назад +1

      You are correct - I should have had a more obvious conclusion to this example. Often I do, but I forgot in this case. The small p-value says the association isn't just due to chance or noise. How strong that relationship is, however, is determined not by the p-value, but by the log(odds ratio). In other words, if we had a small log(odds ratio) and small p-value, we would have a significant, but weak association. If you have a large log(odds ratio) and a small p-value, you have a significant and strong association.

    • @statquest
      @statquest  5 лет назад +1

      To calculate the p-value, we estimate the mean and standard deviation of a normal distribution and then we the area under that curve to calculate the p-value.

  • @ekbalhossain7485
    @ekbalhossain7485 2 года назад +1

    Excellent

  • @ivanguettler
    @ivanguettler 6 лет назад +1

    Excellent video! Thnx for the work!

  • @colin-kun3611
    @colin-kun3611 3 года назад +1

    Great stuff.. thanks a ton!

  • @eceada421
    @eceada421 6 лет назад +1

    Perfect! Thank you so much! really grateful for clear explanation best statistics series on youtube

    • @statquest
      @statquest  6 лет назад

      Hooray! You're welcome! :)

  • @nikunj2554
    @nikunj2554 4 года назад

    Hello Josh, I had a confusion at 12:02 in the video. You said that the - "Wald's test typically uses the estimated standard deviation" ; but in reality we replaced the histogram with the normal curve having standard deviation of the observed values i.e 0.47. Hence, shouldn't it be - "Wald's test typically uses the observed standard deviation" instead of the expected and then we have a normal curve of STD=0.47 at 12:02. Based on my understanding the expected standard deviation will come from the earlier matrix of expected values which we calculated and here we have used the observed values matrix for getting a STD of 0.47. Let me know if I am missing something, but I am kind of lost in this section of the video and could use your help. Thanks again for an amazing video.

    • @statquest
      @statquest  4 года назад

      In this situation "estimated" means "estimated from the observed data", so the estimated standard deviation comes from the observed data.

  • @saheensultana9309
    @saheensultana9309 4 года назад +2

    Thank uuuuuuuuuuuuuu soooooo much 😊😊😊😊😊😊😊😊😊

  • @fardaddanaeefard8247
    @fardaddanaeefard8247 3 года назад +1

    You're amazing!!!!!!

  • @ashishpalsingh6404
    @ashishpalsingh6404 5 лет назад +1

    Sir u are the best..!!

  • @vasilyovechko
    @vasilyovechko 6 лет назад +3

    13:20 However, this is traditionally done using a standard normal curve (i.e. a normal curve with mean = 0 and standard deviation = 1). I am a little bit confused about the word "traditionally".
    Are you saying that the log(odds ratio) is not necessarily distributed according to standard normal distribution?

    • @statquest
      @statquest  6 лет назад +2

      Ah. The log(odds ratio) is normally distributed, but the standard deviation is not always 1. Traditionally, you convert the distribution for the log(odds ratio) to a standard normal curve (mean = 0, sd = 1) by dividing by the standard deviation (which you calculate using the method of your choice). I say "traditionally", because that's how you had to do solve for the p-value back in the day. You had a table of values for a standard normal curve in the back of some book, and whenever you needed a p-value for a normal curve, you converted your normal distribution to a standard one (by subtracting the mean and dividing by the sd) so you could reference the table in the back of the book. These days, however, a computer can calculate the p-value for any normal curve, so it's not really important to transform to a standard normal distribution any more (however, everyone still does it).

    • @vasilyovechko
      @vasilyovechko 6 лет назад +2

      Thank you so much for making it clear.

  • @anthonym9130
    @anthonym9130 4 года назад

    One other point this applies to a case where the dependant variable is categorical because when the dependant variables is continuous we would use an F value to determine whether the model is significant

  • @arenashawn772
    @arenashawn772 Год назад +1

    Just to clarify for myself regarding fisher’s exact test - to calculate the p value we are calculating the sum of probabilities of getting >= 23 cases of cancers if we are choosing 140 (= 23 + 117) people from the total 356 (= 23 + 117 + 6 + 210) people, is my understanding correct?
    I have only used fisher’s exact test in school when sample size is really small and all the cases can be laid out in a table by hand, and have never done in using a software package. I know scipy has a fisher_exact function but admit I haven’t read its docs or used it. Is there an R package that you would recommend using for doing it?
    Thanks as always 😊

    • @statquest
      @statquest  Год назад

      Yes, that is the idea. And Fisher's exact test is built into R: fisher.test(). So you can get help with ?fisher.test

  • @daesoolee1083
    @daesoolee1083 3 года назад +1

    wow, I love you man

  • @mianzhusex
    @mianzhusex 5 лет назад +1

    Hi Josh, I just wanna make sure that I'm correct. Now you assume there is no relationship between gene and cancer, and the log(odd ratio) equal to 0 means no relationship at all. Then based on what you observed from an experience, the log odd ratio is 1.93, and the p-value is something extremely small given that there is no relationship, which means it is really rare to happen when there is no relationship. But it does happen, so we need reject that there is no relationship, and accept that there is a relationship between gene and cancer. Am I correct? really confused.

    • @statquest
      @statquest  5 лет назад +1

      The small p-value tells us that it would be vary rare for random chance to give us the observed log(odds ratio). Thus, we reject that the observed log(odds ratio) is due to random chance. Does that make sense?

  • @domillima
    @domillima 4 года назад

    Your music gives me Fitz and the Tantrums vibes, which I am enjoying.
    A question: In this example you used a 2 x 2 contingency table and calculated the odds ratio for the positive response (cancer with a mutated gene). In my case, lets say *hypothetically* I am tracking the rate at which surgeons perform 2 procedures, A and B, over the past 10 years.

    • @domillima
      @domillima 4 года назад

      So my table is Procedure A, Procedure B x 2010, 2011, 2012....2020.

    • @domillima
      @domillima 4 года назад

      I can calculate the odds of procedure A and procedure B and then find the odds ratio. In order to say the odds are changing, what statistic would I use?

    • @statquest
      @statquest  4 года назад

      Fisher's exact test and the Chi-square test could work well with this data (they would tell you that the proportions between procedures A and B change).

  • @vt2788
    @vt2788 6 лет назад +5

    To infinity (and beyond!) nice

  • @adamjung2867
    @adamjung2867 Год назад

    I was taught that when I see the equation log(x) to assume the base of the logarithm is 10. If I want to do logarithm with base e, I have always written ln(x). It took me a while to figure out why my math didn't work out the same as in the video.

    • @statquest
      @statquest  Год назад

      It's unfortunate that the conventions for what log() means are not consistent. In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations.

    • @adamjung2867
      @adamjung2867 Год назад +1

      @@statquest Thanks a lot. I noticed this when I used the log function in R. I am reviewing statistical principles as I write my prospectus for my Master's thesis and your videos are extremely useful.

  • @umanagaswamy6358
    @umanagaswamy6358 4 года назад +1

    Josh, great content, at 7.50 should it be "if cancer is not associated with the people with the mutated gene" ? Am confused 🤔!

    • @statquest
      @statquest  4 года назад +1

      Since we're looking at the data in terms of the rows, we are thinking about the gene's association with cancer.

  • @drkim2
    @drkim2 4 года назад +1

    thank you

  • @KishoreKumar-fv2cx
    @KishoreKumar-fv2cx 2 года назад

    Hi Josh - Many thanks for your videos. At 10:46 - This gives the matrix that did not depend on the relation between mutated gene and cancer. If there is no relation, matrix itself cannot be formed rite ? Because as per my understanding - in a matrix, the margin totals proportion cannot vary at any cost rite ? I have one sample where I'm stuck to form a matrix -
    1. Sample size - 366
    2. Random number for cancer b/w 0 to 1 - 0.19
    3. Random number for mutated gene b/w 0 to 1 - 0.14
    Also will log() result in negative number ? I could see only positive number output when I apply log(). Please clarify

    • @statquest
      @statquest  2 года назад

      When we make a matrix that is only dependent on the over all proportion of people with cancer and the overall proportion of people with the mutated gene, and not the known proportion of people with cancer AND the gene, then the matrix will not depend on the relation between the gene and cancer. Also, as you can see on the x-axis in the histogram, the log can give us negative numbers.

    • @KishoreKumar-fv2cx
      @KishoreKumar-fv2cx 2 года назад

      @@statquest thanks Josh...understood.. so matrix margin sum for random numbers should add to 325 rite both row wise and column wise instead of all 4 cells adding up to 325 or any number b/w 300 and 400 ?

    • @statquest
      @statquest  2 года назад

      @@KishoreKumar-fv2cx Because we are randomly deciding how many samples have cancer, the column and row totals will always be different. However, the total for the entire matrix will the number between 300 and 400 that you came up with at the start.

  • @dansolpa
    @dansolpa Год назад

    Thanks a lot for your wisdom!!!
    I have one question, as the odds ratio in the example was = 6.88 and the p-value was = 0.00005, we can say that there is a relationship between the mutated gene and cancer? or in other words, that having a mutated gene increases the odds of having cancer?

    • @statquest
      @statquest  Год назад

      p-value wording is super awkward. Technically we would just say that we reject the hypothesis that there is no relationship. However, there is still a small probability that there could be no relationship.

  • @cargouvu
    @cargouvu 5 лет назад

    I love your vids. I don’t know where to begin when I look at the number of quality videos on your channel. Is there an organization of the videos that would help?

    • @statquest
      @statquest  5 лет назад +1

      I have them organized all of the videos on my home page: statquest.org/video-index/

  • @zzooyoo
    @zzooyoo 5 лет назад +3

    Hi Josh, thanks for uploading those lectures! I really appreciate that!
    I’m still confused on why we use z values in logistic regression. Is this the reason that we use log odds which has mean of 0 and symmetrically distributed?
    Or more fundamentally(?), because our response variables are ‘binomial’?
    Thanks so much!

    • @oliverkutis7956
      @oliverkutis7956 3 года назад

      I'd suggest that we use z-value because wald test uses it to calculate whether the variable (predictor) in logistic regression is zero (null hypothesis) or not (alternative hypothesis). If the variable's Z-value (according to Wald Test) in logistic regression is less than some value (for example -+1.96 which equals to 2 std. deviations) we can drop that variable from the logistic regression because it's useless. Hope my answer helps you after 2 years :D

  • @Deshammanideep
    @Deshammanideep 6 лет назад +1

    All the video, I was thinking about the families of people having cancer.

  • @anshulk302
    @anshulk302 Год назад +2

    # Possible R code corresponding to video (there might be mistakes or better ways to do it!)
    # Load data
    cancerData

    • @statquest
      @statquest  11 месяцев назад

      Did you write that?

    • @anshulk302
      @anshulk302 11 месяцев назад +1

      @@statquest - Yes, I did. The video was really good so I wanted to practice it in R. I couldn't see if anyone else had already posted something like this. I'm happy to delete/edit if needed.

  • @phonphaly_mdmscincardiolog4503
    @phonphaly_mdmscincardiolog4503 3 года назад +1

    Thanks

  • @danieldavieau1517
    @danieldavieau1517 6 лет назад +1

    Thank you for existing!!! How do I buy one of your songs?

    • @statquest
      @statquest  6 лет назад

      Hooray!!! You can get my music here: joshuastarmer.bandcamp.com/

    • @statquest
      @statquest  6 лет назад

      Thank you so much!!!

  • @ladkikavpn2606
    @ladkikavpn2606 4 года назад

    Hi Josh we need a clearly explained statquest on Fisher Test, chi square test and wald test . It will be triple BAM!!! if hypergeometric distribution is explained
    Can you also show to a demo to do these using python notebook

    • @statquest
      @statquest  4 года назад

      ruclips.net/video/udyAvvaMjfM/видео.html

  • @shamshersingh9680
    @shamshersingh9680 11 месяцев назад

    Hi Josh, at time stamp 13.59 you say that p-value that mutated gene does not have a relationship with cancer is 0.00005 which means that there is no relationship between mutated gene and cancer and the result of log(odds ratio) is not statistically significant. Now if I understand correctly, at time stamp 12.53, it is written that log(odd ratio) is statistically significant which means that mutated gene and cancer has significant relationship. Can you pse tell what am I missing here.

    • @statquest
      @statquest  11 месяцев назад

      The p-value for the hypothesis that there is no relationship between the mutated gene and cancer is 0.00005. This means that we reject the hypothesis that there is no relationship between the mutated gene and cancer. To learn more about what p-values mean and how they are interpreted, see: ruclips.net/video/vemZtEM63GY/видео.html and ruclips.net/video/JQc3yx0-Q9E/видео.html

  • @surajitchakraborty1903
    @surajitchakraborty1903 5 лет назад +1

    Thanks for the awesome video. Just had a question as to how are Odds Ratios different from Relative Risk and when to use Odds Ratios and when to use Relative Risk ?

  • @kakusniper
    @kakusniper 3 года назад

    Hi again, at 13:59 does it means: there is "no" relationship between mutated gene and cancer due to small effect size or small log(odds-ratio), which is although significant. Because I got confused at the result as odds ratio is positive in favor of mutated gene.

    • @statquest
      @statquest  3 года назад

      The small p-value tells us to reject the hypothesis that there is no difference. To learn more about how to interpret p-values in general, see: ruclips.net/video/vemZtEM63GY/видео.html and ruclips.net/video/0oc49DyA3hU/видео.html

  • @moneyman2200
    @moneyman2200 5 лет назад

    Just a heads up Josh, The Fisher's Exact Test video is not in any of the playlists for your videos. I found it by googling it, but just in case folks don't think about that solution.

    • @statquest
      @statquest  5 лет назад

      Thanks for the tip. I'll see what I can do about getting it on a playlist.

  • @rznirvana
    @rznirvana 5 лет назад +3

    What test(s) do we use to establish whether a BAM!! is small or not?

  • @kevinshao9148
    @kevinshao9148 Год назад

    Hi Josh, at 10:24 how did you draw the sample for histogram? You iterate thru 325, each time you draw random to determine which of the 4 cell it should be? (yesyes, yes no, no yes, no no) ? from your step instruction, it looks like I can only determine all cancer column and all mutated row, but I cannot derive the whole 4 cells for matrix. Many Thanks!

    • @statquest
      @statquest  Год назад

      For each sample, if the first random number is less than 0.08, then the sample has cancer, otherwise it does not. If the second random number is less than 0.39, then it is mutated, otherwise it is not.

    • @kevinshao9148
      @kevinshao9148 Год назад

      @@statquestMany thanks! I figured out too and confirmed with some python code. one more question 14:14, what you mean by the content from this point on? You generated 10000 log(ratio) data points based on observed expected probability (0.08 and 0.39). And it's not standard normal. How you apply test on it? Previously in your content Fisher's and Chi Square and even Wald's they all work on 2x2 matrix, how you manage them to apply on a normal distribution? How did you exactly calc p-value for example Chi and Wald's based on your generated 10K log(ratio)s?
      Quite confused here. Please advice at your convenience, really appreciate it!

    • @statquest
      @statquest  Год назад

      @@kevinshao9148 I generated a single matrix of values then then I applied the Fisher, chi-square, Wald's tests to it to get the p-values for the three tests. I then repeated the process, 10,000 times, and then calculated the percentage of p-values < 0.05 for each test.

  • @yuanlu5657
    @yuanlu5657 5 лет назад

    Hi Josh! I have a question. At 11:25, you mentioned that "it is more common to estimate the standard deviation from the observe values"-- Does that mean the histogram generated from 10:00 to 11:25 is only for illustration purpose? i.e., it is not part of the Wald test calculations, am I right? =)

    • @statquest
      @statquest  5 лет назад +1

      You are correct. The histogram just illustrates how it works.

  • @UsanaAdelaide
    @UsanaAdelaide 4 года назад

    Thank you for this great presentation. It seems to me that there is an error at the timing of 8:53. You saying that " 198.7 people with mutated gene are..". It supposed to be " 198.7 people WITHOUT mutated gene ...". Please check this. Thanks

    • @statquest
      @statquest  4 года назад

      Oops!! That's a mistake. I've add it to the pinned comment.

  • @iamzeeshankhan
    @iamzeeshankhan 3 года назад

    For the matrix at 10:46, I am a bit confused as to how the count for "people who have cancer but NO mutated gene" came to be?
    Because, any value

    • @statquest
      @statquest  3 года назад

      We pick two numbers. We check to see if the first number is < 0.08. Then we check to see if the second number is < 0.39. So we don't check to see if the same number is < 0.08 and < 0.39.

    • @iamzeeshankhan
      @iamzeeshankhan 3 года назад +1

      @@statquest Ahh ok, thanks a lot for the explanation.

  • @aprasuna7136
    @aprasuna7136 5 лет назад

    Dear Josh,
    Thanks for such awesome videos!!. In this video, what do you exactly mean by 3%/4% of the times you got p-value ?
    P-value less than 0.05 actually means there exists a relationship between independent and dependent variables.
    But, I could not understand, what does this 3% of the times you got this p-value mean?
    Please clarify!!

    • @statquest
      @statquest  5 лет назад +1

      The definition of a p-value is actually a little different from what you wrote. A p-value less than 0.05 means that if there is no relationship between the independent and dependent variables, then 5% of the time we do the exact same experiment we will get results as extreme or more extreme. In other words, a p-value < 0.05 tells us that if there is no relationship between the independent and dependent variables, what we observed would be rare, but not impossible. Typically we interpret the p-value by saying "since it would be very rare to see what we see if there is no relationship, we can reject the hypothesis that there is no relationship and conclude that there must some sort of relationship".
      So, in my video I wanted to compare three different statistical tests. To do this, I generated a lot of random datasets that had no relationship between the independent and dependent variables. I then applied each method to each random dataset. If the tests worked well, then 5% of the time they should result in p-values < 0.05. Does that make sense?

    • @alessandro49946
      @alessandro49946 4 года назад

      @@statquest What I could understand is that p-value is the probability of the null hypothesis, so the boring hypothesis. In this case the boring hypotesis is "there is no relationship between cancer and the mutating gene". So if we get p

    • @alessandro49946
      @alessandro49946 4 года назад

      @@statquest you didn't discuss at the end of each test (even if they have a similar result) the "biological result". You gave us the p-value and then you disappeared :(

    • @statquest
      @statquest  4 года назад +1

      @@alessandro49946 You are correct.

    • @statquest
      @statquest  4 года назад +1

      @@alessandro49946 The biological result is that the mutated gene has a relationship with cancer, however, is that biologically relevant? That's the question. And since I'm not a cancer doctor or researcher, I can't answer that question. Experts in the field have to decide for me.

  • @monoarul_islam_3
    @monoarul_islam_3 3 года назад

    I just have a question(basically two) for you Josh, I went through all of the videos of yours and they are literally clearly explained and It's quite easy to apprehend for me. But the thing is that after completing all of the videos it's hard for me to remember everything. I noted everything but seems like I tend to forget a lot of things. Is that natural? Or in that case, what should I need to do? I really want to be a machine learning engineer and researcher as well and I'll apply for fall22 for Ph.D. in the USA. So, it looks overwhelming and at the same time enthralling to learn new stuff.

    • @statquest
      @statquest  3 года назад +4

      I have a terrible memory and forget stuff all the time. Focus on remembering the main ideas.

    • @MrCk0212
      @MrCk0212 Год назад

      Jot your own notes do help. Even if you just remember the rough idea later, it is much easier to pick up by reading the notes. Using some modern notes jotting software like Notion give you extra BAM.

  • @VinVin21969
    @VinVin21969 4 года назад

    I confuse, in your video at 11:57 why you re using theta 0 as distribution? In other video i saw they re using theta hat as distribution

    • @statquest
      @statquest  4 года назад

      The null hypothesis is that there is no relationship among the different categories which implies that the mean of the log(odds ratios) should be 0.

  • @InfiniFactsDaily
    @InfiniFactsDaily 3 года назад

    Hi Josh, very great video, really appreciate your videos very much.
    I have a question regarding the estimation of standard deviation in the wald test. You said in the video that it is more common to estimate the standard deviation from the observed values. But I don't understand why the standard deviation is calculated by the square root of the sum of 1 over each observed values. May I ask for an explanation on that as I am quite confused on this part in this video. Thank you very much and sorry for the inconveniences caused.

    • @statquest
      @statquest  3 года назад

      That's just the equation for the standard deviation when you have this type of data. To explain how it is derived would take a whole video.

  • @gnsatishkumar1
    @gnsatishkumar1 6 лет назад +2

    Your Presentation is very good and Interesting. Can I know what software do you use in preparation of the video.
    I Like your voice..........

    • @statquest
      @statquest  6 лет назад

      I used to use powerpoint and iMovie. Now I use Keynote and Final Cut Pro. However, this video in particular was done using Keynote and iMovie.

  • @iara5946
    @iara5946 4 года назад +1

    Josh! is it possible you can write a small definition / explanation of what log odds are, i still dont get it and im desperate!!!!! help ):

    • @statquest
      @statquest  4 года назад

      This video talks about odds ratios and log(odds ratios), not log odds, so that might be confusing you right there. If you want to learn about log(odds), see this StatQuest instead: ruclips.net/video/ARfXDSkQf1Y/видео.html

  • @mid1chosen
    @mid1chosen 3 года назад

    hello sir ..please make a video on chi square test and wald test

    • @statquest
      @statquest  3 года назад

      Do you want more than what I say at 9:27

    • @mid1chosen
      @mid1chosen 3 года назад

      @@statquest Yes sir , I am having difficulty to grasp it . If you can then please make a video.

    • @statquest
      @statquest  3 года назад

      @@mid1chosen noted

  • @mohsenm9641
    @mohsenm9641 3 года назад

    Nice tutorial! Thank you. Can I ask how do you create this slides?

    • @statquest
      @statquest  3 года назад +1

      Thank you! For details on how I make these videos, see: ruclips.net/video/crLXJG-EAhk/видео.html

  • @Salvador_Dali
    @Salvador_Dali Год назад

    2:43 - im struggling to understand why we devide 23 over 117 instead of 23 over 140 which is the sum of all ppl who have the gene (likewise 6/216 for no gene). note: the rounded numbers are the same but still... please help - what am i missing?

    • @statquest
      @statquest  Год назад +1

      When we divide 23 by 117 we get the odds. If we divide 23 by 140, we get the probability. In this case, we are interested in the odds and not the probability, so we divide by 117. For more details on the difference between odds and probabilities, see: ruclips.net/video/ARfXDSkQf1Y/видео.html

    • @Salvador_Dali
      @Salvador_Dali Год назад +1

      that was fast! thx a bunch!@@statquest

  • @shamshersingh9680
    @shamshersingh9680 11 месяцев назад

    At time stamp 7.55, the statement should be "so, if the is not associated with (23 + 117) = 140 people with mutated gene, then..."

    • @statquest
      @statquest  11 месяцев назад

      Other than grammatical errors, I'm not sure how your version is different from mine. Can you clarify?

  • @tamzzzz1
    @tamzzzz1 6 лет назад +1

    great job !!!! I would like to ask for similar videos that explain Metropolis Hasting algorithm. thank you :)

    • @statquest
      @statquest  6 лет назад

      That's a great idea! I'll add it to my "To-Do" list.

  • @dr.hanawamin6294
    @dr.hanawamin6294 3 года назад

    Hello, it was somewhat difficult to me, especially in the case of tests. Thanks

    • @statquest
      @statquest  3 года назад +1

      I hope my video helped! :)

  • @fanzhang3746
    @fanzhang3746 2 года назад

    11:27 Why estimated SD from observed data = sqrt(1/count_00 + 1/count_01 + 1/count_10 + 1/count_11) ?

    • @statquest
      @statquest  2 года назад

      That would take a whole StatQuest to explain.

  • @Fiona0725
    @Fiona0725 5 лет назад

    this is awesome

  • @sierrasemko4505
    @sierrasemko4505 2 года назад +1

    Josh, you are the fucking man!

  • @shresthaditya2950
    @shresthaditya2950 2 года назад +1

    11:49-Wald Test Working