Corrections: 8:53 I meant to say "the remaining 198.7 people without the mutated gene". 14:27 I correctly wrote "If the tests worked as expected, 5% should have p-values less than 0.05". That is correct. However, when recording the voiceover, I said "5% should have p-values less than 0.5", which is not correct. NOTE: In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations. Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@@MM-jm1il I believe it is because the derivative of ln(x) is super simple, just 1/x. Likewise, the derivative of the exponential function, e^x, is e^x.
It is amazing how many math students who are advanced in math and algebra lack very basic foundation in statistics. Your materials are highly valuable. It’s like the Neanderthals discovering fire.
I've been excelling at math without understanding such fundamentals of statistics which I've been silently ashamed of. Your materials have significantly improved my understanding. Hats off to you sir. Thank you.
As a full-time machine learning engineer, I must say I love the animations and your clear concise explanations! Your videos really helped me understand the fundamentals to succeed in my field. Thank you
This is so helpful, I have been struggling about log(odd ratio) for a while, the meaning of log part confuses me forever, just never get why log is necessary for the calculation, now I finally understand, thank you so much
Great video! I needed this refresher. Very clear and concise! A point of clarification: at approximately 14:18 the voice over says that "...5% should have p-values of less than 0.5." I believe you meant "less than 0.05" as the slide shows. I wanted to clarify for listeners.
Fantastic video, as always clearly explained! One minor point of confusion - for the Wald simulated histogram, in each iteration of the loop, a **single sample** of size N is selected. Then, as stated in the video and as commented in @evanrushton1's R code below, a random number between 0 and 1 is selected, and if the number is < 0.08, the **sample** is described as 'having cancer'; however, in the code, this command generates a vector, of size N, containing N random numbers between 0 and 1, which seem to instead represent the probability that each **person** in the sample has cancer. Same for the **sample** having the mutation vs **each person in the sample** having the mutation...as each iteration of the loop generates a 2x2 table containing integers representing the number of **persons** in each cell, from which to calculate log(OR). Apologies if this is semantic but this distinction confused me in the video and as such, it wasn't obvious to me how to turn those sample-based probabilities into a loop until I saw the code (and thanks to both Josh and Evan for sharing their R code). Does anyone have a simple explanation if this confused them as well?
First of all, thanks for this series! It's cool how Wald's test can show relationship or absence of it. But we got the distributions of cancer people and mutated people independently, meaning we said that 8% of samples have cancer and 39% of samples have mutated gene. As I look at names of features ("cancer", "mutated gene"), I am biased to believe that there is a relationship. But if the 2nd feature had the name "wearing green shoes", I would assume there is no relationship and the wald's test would still show p value
Odds of something is itself a ratio (probability of it happening/probability of it not happening) The odds are usually odds of x given y (in this case the odds of getting cancer given that you do have a gene) Log odds ratio is a ratio of ratios (ratio of the log odds) but the given condition changes. This tells us if the “given” condition has an effect on the odds of x happening. Example: 3:40 Log odds ratio = (log odds of getting cancer | you have the gene) / (log odds of getting cancer | you don’t have the gene) This can tell us the effect of the given condition (given you have the gene vs given you don’t have the gene) on the odds of x happening (cancer vs no cancer)
One other point this applies to a case where the dependant variable is categorical because when the dependant variables is continuous we would use an F value to determine whether the model is significant
small mistake 8 minute 50s :expected values for Chi-square should be in the second row (for no mutated gene) 17.8 - cancer and 198.2 for no cancer. It's easy to check without computing cause they should add up to integers totals under each column but anyway great video and thank you for your work
How did you get 17.8? When I multiply the number of people that do not have the mutated gene, 216, by the probability that someone will have cancer, 0.08, I get 17.3.
Just to clarify for myself regarding fisher’s exact test - to calculate the p value we are calculating the sum of probabilities of getting >= 23 cases of cancers if we are choosing 140 (= 23 + 117) people from the total 356 (= 23 + 117 + 6 + 210) people, is my understanding correct? I have only used fisher’s exact test in school when sample size is really small and all the cases can be laid out in a table by hand, and have never done in using a software package. I know scipy has a fisher_exact function but admit I haven’t read its docs or used it. Is there an R package that you would recommend using for doing it? Thanks as always 😊
Hi Josh, thanks for uploading those lectures! I really appreciate that! I’m still confused on why we use z values in logistic regression. Is this the reason that we use log odds which has mean of 0 and symmetrically distributed? Or more fundamentally(?), because our response variables are ‘binomial’? Thanks so much!
I'd suggest that we use z-value because wald test uses it to calculate whether the variable (predictor) in logistic regression is zero (null hypothesis) or not (alternative hypothesis). If the variable's Z-value (according to Wald Test) in logistic regression is less than some value (for example -+1.96 which equals to 2 std. deviations) we can drop that variable from the logistic regression because it's useless. Hope my answer helps you after 2 years :D
Hello Josh, I had a confusion at 12:02 in the video. You said that the - "Wald's test typically uses the estimated standard deviation" ; but in reality we replaced the histogram with the normal curve having standard deviation of the observed values i.e 0.47. Hence, shouldn't it be - "Wald's test typically uses the observed standard deviation" instead of the expected and then we have a normal curve of STD=0.47 at 12:02. Based on my understanding the expected standard deviation will come from the earlier matrix of expected values which we calculated and here we have used the observed values matrix for getting a STD of 0.47. Let me know if I am missing something, but I am kind of lost in this section of the video and could use your help. Thanks again for an amazing video.
13:20 However, this is traditionally done using a standard normal curve (i.e. a normal curve with mean = 0 and standard deviation = 1). I am a little bit confused about the word "traditionally". Are you saying that the log(odds ratio) is not necessarily distributed according to standard normal distribution?
Ah. The log(odds ratio) is normally distributed, but the standard deviation is not always 1. Traditionally, you convert the distribution for the log(odds ratio) to a standard normal curve (mean = 0, sd = 1) by dividing by the standard deviation (which you calculate using the method of your choice). I say "traditionally", because that's how you had to do solve for the p-value back in the day. You had a table of values for a standard normal curve in the back of some book, and whenever you needed a p-value for a normal curve, you converted your normal distribution to a standard one (by subtracting the mean and dividing by the sd) so you could reference the table in the back of the book. These days, however, a computer can calculate the p-value for any normal curve, so it's not really important to transform to a standard normal distribution any more (however, everyone still does it).
I was taught that when I see the equation log(x) to assume the base of the logarithm is 10. If I want to do logarithm with base e, I have always written ln(x). It took me a while to figure out why my math didn't work out the same as in the video.
It's unfortunate that the conventions for what log() means are not consistent. In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations.
@@statquest Thanks a lot. I noticed this when I used the log function in R. I am reviewing statistical principles as I write my prospectus for my Master's thesis and your videos are extremely useful.
Your music gives me Fitz and the Tantrums vibes, which I am enjoying. A question: In this example you used a 2 x 2 contingency table and calculated the odds ratio for the positive response (cancer with a mutated gene). In my case, lets say *hypothetically* I am tracking the rate at which surgeons perform 2 procedures, A and B, over the past 10 years.
I can calculate the odds of procedure A and procedure B and then find the odds ratio. In order to say the odds are changing, what statistic would I use?
Fisher's exact test and the Chi-square test could work well with this data (they would tell you that the proportions between procedures A and B change).
Hi Josh, I just wanna make sure that I'm correct. Now you assume there is no relationship between gene and cancer, and the log(odd ratio) equal to 0 means no relationship at all. Then based on what you observed from an experience, the log odd ratio is 1.93, and the p-value is something extremely small given that there is no relationship, which means it is really rare to happen when there is no relationship. But it does happen, so we need reject that there is no relationship, and accept that there is a relationship between gene and cancer. Am I correct? really confused.
The small p-value tells us that it would be vary rare for random chance to give us the observed log(odds ratio). Thus, we reject that the observed log(odds ratio) is due to random chance. Does that make sense?
Hi Josh, at time stamp 13.59 you say that p-value that mutated gene does not have a relationship with cancer is 0.00005 which means that there is no relationship between mutated gene and cancer and the result of log(odds ratio) is not statistically significant. Now if I understand correctly, at time stamp 12.53, it is written that log(odd ratio) is statistically significant which means that mutated gene and cancer has significant relationship. Can you pse tell what am I missing here.
The p-value for the hypothesis that there is no relationship between the mutated gene and cancer is 0.00005. This means that we reject the hypothesis that there is no relationship between the mutated gene and cancer. To learn more about what p-values mean and how they are interpreted, see: ruclips.net/video/vemZtEM63GY/видео.html and ruclips.net/video/JQc3yx0-Q9E/видео.html
Hi Josh. Just started watching your videos. Currently going through your 'Statistics Fundamentals' playlist. In this video, at the 7:50 mark, you mention 'So, if the gene is not associated with the 140 people with the mutated gene...', shouldn't the assumption be that 'So, if we assume that cancer is not associated with the 140 people with the mutated gene...'. That's the only reason you used the expected value using the 'estimated' population probability of having cancer (0.08) to calculate the 'expected value' of the number of people having both the mutated gene (140) and cancer.
I love your vids. I don’t know where to begin when I look at the number of quality videos on your channel. Is there an organization of the videos that would help?
At 13:50 of the video, the p-value that mutated genes does not have relationship with cancer is 0.00005. So does it mean that mutated genes has a strong association with cancer then? P.S: It would be great that each time when you give an example, you will have a conclusion to it, so it wouldn't be confusing. Great video!
You are correct - I should have had a more obvious conclusion to this example. Often I do, but I forgot in this case. The small p-value says the association isn't just due to chance or noise. How strong that relationship is, however, is determined not by the p-value, but by the log(odds ratio). In other words, if we had a small log(odds ratio) and small p-value, we would have a significant, but weak association. If you have a large log(odds ratio) and a small p-value, you have a significant and strong association.
To calculate the p-value, we estimate the mean and standard deviation of a normal distribution and then we the area under that curve to calculate the p-value.
Dear Josh,where can I see this statquest"EnrichmentAnalysis using Fisher's Exact Test and theHypergeometric Distribution"? I have a litttle confusing about it
Which base are you using for the logarithm? The traditional base for statistics and machine learning and most programming languages is 'e', or the natural logarithm. So log_base_e ((2/4) / (3/1)) = -1.79.
Hi Josh, at 10:24 how did you draw the sample for histogram? You iterate thru 325, each time you draw random to determine which of the 4 cell it should be? (yesyes, yes no, no yes, no no) ? from your step instruction, it looks like I can only determine all cancer column and all mutated row, but I cannot derive the whole 4 cells for matrix. Many Thanks!
For each sample, if the first random number is less than 0.08, then the sample has cancer, otherwise it does not. If the second random number is less than 0.39, then it is mutated, otherwise it is not.
@@statquestMany thanks! I figured out too and confirmed with some python code. one more question 14:14, what you mean by the content from this point on? You generated 10000 log(ratio) data points based on observed expected probability (0.08 and 0.39). And it's not standard normal. How you apply test on it? Previously in your content Fisher's and Chi Square and even Wald's they all work on 2x2 matrix, how you manage them to apply on a normal distribution? How did you exactly calc p-value for example Chi and Wald's based on your generated 10K log(ratio)s? Quite confused here. Please advice at your convenience, really appreciate it!
@@kevinshao9148 I generated a single matrix of values then then I applied the Fisher, chi-square, Wald's tests to it to get the p-values for the three tests. I then repeated the process, 10,000 times, and then calculated the percentage of p-values < 0.05 for each test.
Just a heads up Josh, The Fisher's Exact Test video is not in any of the playlists for your videos. I found it by googling it, but just in case folks don't think about that solution.
I'm a bit confused, in another video (in Spanish) they use the wald test being: OR-value of h0 i.e. 1 / se(OR), giving this 12.44 and a very low p value
It's in the long range plan, but that means it won't happen for a long time... Which is a bummer! I wish I could make the videos faster, but each one takes a long time.
Hi Josh we need a clearly explained statquest on Fisher Test, chi square test and wald test . It will be triple BAM!!! if hypergeometric distribution is explained Can you also show to a demo to do these using python notebook
Hi Josh - Many thanks for your videos. At 10:46 - This gives the matrix that did not depend on the relation between mutated gene and cancer. If there is no relation, matrix itself cannot be formed rite ? Because as per my understanding - in a matrix, the margin totals proportion cannot vary at any cost rite ? I have one sample where I'm stuck to form a matrix - 1. Sample size - 366 2. Random number for cancer b/w 0 to 1 - 0.19 3. Random number for mutated gene b/w 0 to 1 - 0.14 Also will log() result in negative number ? I could see only positive number output when I apply log(). Please clarify
When we make a matrix that is only dependent on the over all proportion of people with cancer and the overall proportion of people with the mutated gene, and not the known proportion of people with cancer AND the gene, then the matrix will not depend on the relation between the gene and cancer. Also, as you can see on the x-axis in the histogram, the log can give us negative numbers.
@@statquest thanks Josh...understood.. so matrix margin sum for random numbers should add to 325 rite both row wise and column wise instead of all 4 cells adding up to 325 or any number b/w 300 and 400 ?
@@KishoreKumar-fv2cx Because we are randomly deciding how many samples have cancer, the column and row totals will always be different. However, the total for the entire matrix will the number between 300 and 400 that you came up with at the start.
How do you add multiple odds ratios? Let’s say a patient has multiple risk factors for developing schizophrenia, each risk factor with a 2 fold possibility. (Ie having an older father, smoking cannabis, birth complications). Would you add them all together or multiple them?
Unfortunately neither of those options are good (see stats.stackexchange.com/questions/12294/can-individual-odds-ratios-be-added-to-get-one-pooled-odds-ratio-to-compare-to-a ) If you have multiple factors, I think your best option is to fit a logistic regression model to your data. For details, see: ruclips.net/p/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe
Hi Josh, very great video, really appreciate your videos very much. I have a question regarding the estimation of standard deviation in the wald test. You said in the video that it is more common to estimate the standard deviation from the observed values. But I don't understand why the standard deviation is calculated by the square root of the sum of 1 over each observed values. May I ask for an explanation on that as I am quite confused on this part in this video. Thank you very much and sorry for the inconveniences caused.
I don't understand the standard deviation in the example. The log(odds ratio) stays the same regardless of the sample size, as the ratio stays the same, but the standard deviation does change with the sample size. Consider that you multiply all the above numbers by 100. 2300 cancer&mutant, 600 cancer&no_mutant, 11700 no_cancer&mutant, 21000 no_cancer&no_mutant. The odds ratio stays the same. (2300/11700) / (600/21000) = 6.88. However, the standard deviation goes from 0.47 to 0.047, which would indicated that we are 41 standard deviations away from the mean (=log(6.88)/0.047)?
That is correct. We calculating a standard error, which is a function of the sample size and the larger the sample size, the smaller the standard deviation. In other words, the larger the sample size, the more confidence we have that the ratio should be close to the population ratio.
@@statquest thank you for your response! Really amazing videos you have, you have a gift for explaining complicated topics! What would be the best way to calculate standard deviation of log(OR) distribution? SE * sqrt(sample size)?
We pick two numbers. We check to see if the first number is < 0.08. Then we check to see if the second number is < 0.39. So we don't check to see if the same number is < 0.08 and < 0.39.
Thank you for this great presentation. It seems to me that there is an error at the timing of 8:53. You saying that " 198.7 people with mutated gene are..". It supposed to be " 198.7 people WITHOUT mutated gene ...". Please check this. Thanks
@@statquest - Yes, I did. The video was really good so I wanted to practice it in R. I couldn't see if anyone else had already posted something like this. I'm happy to delete/edit if needed.
@@statquest Since the p-value we get from a chi-squared test only tells us whether there is an association between two variables, could you make a video on how to calculate the strength of this relationship. Thanks.
@@tribikramadhikari2470 That's what this video is all about. Odds and log(odds) ratios are like R^squared. They give you a sense of the magnitude of the association.
If we were to calculate log(odds ratio) of many random samples of same size from a population and plot their frequency distribution, we will get a sampling distribution. Though the odds ratio of each sample are different, won't the resulting sampling distribution (plotted using odds ratio of many such samples) have just one standard error irrespective of the outcomes of each sample? If that is true, why is the standard error of sampling distribution a function of sample outcomes?
You could ask the exact same question about calculating the mean and it's standard error from a single collection of observations. The reason we can do this is that as the sample size increases, the difference between the standard error calculated from a single collection of observations and the standard error calculated from the distribution of millions of means (or log(odds ratios)) goes to 0. So we could do either one to get the estimate we want. However, calculating the standard error from a single collection of observations is much more practical.
@@statquest That helps a lot sir. It amazes me how you manage to get time to respond to all the questions in the comments. This channel has made learning stats so much easier. Thank you so much! Your dedication is very inspiring.
I actually chatgpted it. Here's what it says: "The histogram of log(odds ratios) of two unrelated variables appearing as a normal distribution can be explained by the Central Limit Theorem (CLT). The Central Limit Theorem states that when independent random variables are summed or averaged, their distribution tends to approach a normal distribution, regardless of the shape of the original distribution, as long as certain conditions are met. In the case of log(odds ratios) of two unrelated variables, assuming that the individual variables are independent and have finite means and variances, the CLT applies. The log(odds ratio) is a transformation applied to the individual variables, which may have their own distributions. However, when the log(odds ratios) are calculated, the values from the two variables are combined in a way that follows the principles of addition and subtraction. Since the variables are unrelated, their contributions to the log(odds ratios) are independent. As a result, when the log(odds ratios) of two unrelated variables are calculated repeatedly and plotted in a histogram, the CLT suggests that the distribution of the log(odds ratios) will tend towards a normal distribution. This means that the histogram will exhibit the characteristic bell-shaped curve commonly associated with a normal distribution. It's important to note that the applicability of the CLT assumes a sufficiently large sample size. For small sample sizes, the normality assumption may not hold, and other factors might influence the shape of the distribution."
I think I get it now: If two variables are not related, the the odds should be the same and hence odds ratio equal to one. With 1) central limit theorem, the rate will center around 1 2) log transformation, the rate will center around 0 3) both CLT and log, the resulting distribution is a symmetrical normal curve Hopefully this is the correct intuition.
I just have a question(basically two) for you Josh, I went through all of the videos of yours and they are literally clearly explained and It's quite easy to apprehend for me. But the thing is that after completing all of the videos it's hard for me to remember everything. I noted everything but seems like I tend to forget a lot of things. Is that natural? Or in that case, what should I need to do? I really want to be a machine learning engineer and researcher as well and I'll apply for fall22 for Ph.D. in the USA. So, it looks overwhelming and at the same time enthralling to learn new stuff.
Jot your own notes do help. Even if you just remember the rough idea later, it is much easier to pick up by reading the notes. Using some modern notes jotting software like Notion give you extra BAM.
When you take the log will it always a result in the data be normally distributed? In the case of a dependant variable where they data is not normally distributed would transforming it always guarantee normality?
2:43 - im struggling to understand why we devide 23 over 117 instead of 23 over 140 which is the sum of all ppl who have the gene (likewise 6/216 for no gene). note: the rounded numbers are the same but still... please help - what am i missing?
When we divide 23 by 117 we get the odds. If we divide 23 by 140, we get the probability. In this case, we are interested in the odds and not the probability, so we divide by 117. For more details on the difference between odds and probabilities, see: ruclips.net/video/ARfXDSkQf1Y/видео.html
Because this p-value is < 0.05 we can reject the hypothesis that there is no relationship between the mutated genes and cancer. Thus, we can conclude that there is some (possibly indirect) relationship.
12:07 I am sure I am missing something trivial here, but why do we calculate standard deviation as under root (summation 1/observation). Are we not supposed to calculate mean and calculate deviation from mean?
Think about this data some more. It's not a bunch of individual measurements (like measuring how tall a bunch of people are). Instead, this data is more like a summary of counts. For example, we have 23 people who have cancer and the mutated gene and 117 people who do not have cancer but have the mutated gene. What would the mean of those two numbers represent? I have no idea, and that tells us that this data is different, and thus, we need a different way to calculate the standard error.
3:46 To check the relationship between mutated gene and cancer, why not just take the odds like this: p(mutated gene and cancer) : p(mutated gene and no cancer). We would like our odds ratio to give us a picture of correlation between the presence of the gene and cancer; and not the absence of the gene and cancer. So why do we take 6/210 in the denominator?
We are actually very interested in the relationship between the absence of the gene and cancer - this is the background noise and models the random chance of getting cancer without the mutation. If there is a relationship between the mutation and cancer, then it should be stronger than the background noise and the random chance of getting cancer without the mutation. So we are comparing the relationship relative to background. Does that make sense?
@@statquest The logic you gave makes sense. However, I would like to know why did we calculate the p value of only the sample in the first row at 6:51. If we were interested both in presence and absence of the mutated gene , we should have calculated the p value of the bottom row as well. After all, our odds ratio consists of both the presence and absence of mutated gene like I previously mentioned. Thank you for replying, as always! :)
@@doubletoned5772 The calculating p-value for the top row requires that we use all of the data, including the bottom row. For more details on how this p-value was calculated, see the StatQuest: Enrichment Analysis using Fisher’s Exact Test and the Hypergeometric Distribution: ruclips.net/video/udyAvvaMjfM/видео.html
Hi Josh, Was the odds ratio of outcome and not Odds ratio of exposure? The value may be the same but i think the principle is wrong. please help me understand.
Very useful video. I have a question, How do i calculate a p value with the sd of the OR. I tried with your example using pnorm in R, but i got 0,00004.
With OR you can say Y was 6 times as likely for those who got treatment X compared to those that didn't. How can you write a sentence like this using log(OR)? Would it be incorrect to use the Log(OR) in a sentence like this?
Thanks for the video! Standard error is defined for sampling distribution, right? So it shouldn't vary with our sample. But if we are taking observed values from our samples in the formula to calculate standard error, wouldn't standard error vary with each sample?
I'm not sure I understand your question. As far as I can tell, the standard deviation of the log(odds ratio) changes depending on what values you get when you collect your sample.
@@statquest Sorry, I meant standard error of sampling distribution of log(odds ratio) not standard deviation. Does standard error vary with our sample?
@@farheenahmed4683 Ah, I think I see. Unfortunately the terminology is a little confusing in this situation. The "standard error" = the standard deviation of the mean. In this case, we are using the log(odds ratio) as the mean, thus, the standard deviation of the log(odds ratio) is also the standard error. In other words, both terms refer to the same variation in the log(odds ratio)'s.
Thank you so much for clarifying it so promptly! I went ahead and watched your video on standard deviation vs standard error. From what I understand, standard deviation is variance within "a set" of measurements (a sample) and standard error is the variance of samples statistics from the mean of sample statistics obtained from "multiple sets" of measurements (all the samples in our sampling distribution). But for log (odds ratio) , we cannot really calculate variance "within" a set of measurements or a sample (in case of binary/dichotomous variables) therefore standard deviation cannot be calculated for a sample. But when it comes to "a set" of measurements/sampling distribution, variance can be calculated in this step and standard deviation of a sampling distribution here is the same as its standard error. Is that right?
Ok, for anyone else wondering about this as well, it is because Josh is using base "e" for the log. Usually is denoted as ln, but since base is is commonly used in statistics and machine-learning, he is using log base e.
In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations.
In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations
Corrections:
8:53 I meant to say "the remaining 198.7 people without the mutated gene".
14:27 I correctly wrote "If the tests worked as expected, 5% should have p-values less than 0.05". That is correct. However, when recording the voiceover, I said "5% should have p-values less than 0.5", which is not correct.
NOTE: In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations.
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
thanks for clarification. tiny bam!
Why do we take the natural log vs log base 10?
@@MM-jm1il I believe it is because the derivative of ln(x) is super simple, just 1/x. Likewise, the derivative of the exponential function, e^x, is e^x.
Chi Squared test!!
@@speakers159 Noted!
It is amazing how many math students who are advanced in math and algebra lack very basic foundation in statistics. Your materials are highly valuable. It’s like the Neanderthals discovering fire.
Thanks!
I've been excelling at math without understanding such fundamentals of statistics which I've been silently ashamed of. Your materials have significantly improved my understanding. Hats off to you sir. Thank you.
Thanks! :)
As a full-time machine learning engineer, I must say I love the animations and your clear concise explanations! Your videos really helped me understand the fundamentals to succeed in my field. Thank you
Wow, thanks!
Graduate students across the globe thank you for these excellent videos. Top marks sir.
This is the best series of stats videos I have ever seen happening on youtube. You sir is a legend
Thank you so much!!! I'm so happy to hear that you like the videos. :)
Five years later and I still feel this is absolutely true!
With videos like these, the odds of learning something are good, and it's easy to log a lot of time here!
TRIPLE BAM! :)
Actually StatQuest is on my mind all night long
BAM! :)
This is so helpful, I have been struggling about log(odd ratio) for a while, the meaning of log part confuses me forever, just never get why log is necessary for the calculation, now I finally understand, thank you so much
Glad it helped!
You are the man Josh. Wish I had a teacher like you when I was in school.
Thank you so much!! I'm glad you like the videos. :)
Great video! I needed this refresher. Very clear and concise! A point of clarification: at approximately 14:18 the voice over says that "...5% should have p-values of less than 0.5." I believe you meant "less than 0.05" as the slide shows. I wanted to clarify for listeners.
Thanks for catching that. I just added a pinned comment with that note so that it will be easy to find by future viewers.
Fantastic video, as always clearly explained! One minor point of confusion - for the Wald simulated histogram, in each iteration of the loop, a **single sample** of size N is selected. Then, as stated in the video and as commented in @evanrushton1's R code below, a random number between 0 and 1 is selected, and if the number is < 0.08, the **sample** is described as 'having cancer'; however, in the code, this command generates a vector, of size N, containing N random numbers between 0 and 1, which seem to instead represent the probability that each **person** in the sample has cancer. Same for the **sample** having the mutation vs **each person in the sample** having the mutation...as each iteration of the loop generates a 2x2 table containing integers representing the number of **persons** in each cell, from which to calculate log(OR).
Apologies if this is semantic but this distinction confused me in the video and as such, it wasn't obvious to me how to turn those sample-based probabilities into a loop until I saw the code (and thanks to both Josh and Evan for sharing their R code). Does anyone have a simple explanation if this confused them as well?
Thanks!
Man I love the intro song to your video, so calming... my exam is on 15th of April pray for me guys...
My most awaited topic, thank u very much.
Big fan😊
Hooray!!! You're welcome. I'm glad you like the video! :)
Another confusion clarified, thanks Josh.
BAM! :)
First of all, thanks for this series! It's cool how Wald's test can show relationship or absence of it. But we got the distributions of cancer people and mutated people independently, meaning we said that 8% of samples have cancer and 39% of samples have mutated gene. As I look at names of features ("cancer", "mutated gene"), I am biased to believe that there is a relationship. But if the 2nd feature had the name "wearing green shoes", I would assume there is no relationship and the wald's test would still show p value
That's interesting. I guess you have to try to ignore the names. Maybe just label them "a" and "b".
You deserve waaaay more subscribers!
Thank you very much! :)
U r a god in explaining difficult things so easy
Thanks a lot 😊
Thanks for another great video! Looking forward to the StatQuest about the Chi-square test))
It's coming!
Odds of something is itself a ratio (probability of it happening/probability of it not happening)
The odds are usually odds of x given y (in this case the odds of getting cancer given that you do have a gene)
Log odds ratio is a ratio of ratios (ratio of the log odds) but the given condition changes. This tells us if the “given” condition has an effect on the odds of x happening.
Example: 3:40
Log odds ratio = (log odds of getting cancer | you have the gene) / (log odds of getting cancer | you don’t have the gene)
This can tell us the effect of the given condition (given you have the gene vs given you don’t have the gene) on the odds of x happening (cancer vs no cancer)
This channel is gold
Thank you!
Horray,
I've made it to the end of the stats playlist ...
triple bam! :)
Hey Josh--looking for a great explanation of odds ratios for my students, and bam ! there you are. Please say hi to Jack for me!
BAM!!! Wow! It's Al Bardi!!!! Cool. I'll definitely pass the word on to Jack.
Amazing video. Could you please, one day, make a video explaining why the standard deviation is the inverse of the observed values? Thank you!!:)
I'll keep that in mind.
Came for the content, stayed for the intro
bam!
One other point this applies to a case where the dependant variable is categorical because when the dependant variables is continuous we would use an F value to determine whether the model is significant
(waiting for that statquest on chi-square test) but still thank you
small mistake 8 minute 50s :expected values for Chi-square should be in the second row (for no mutated gene) 17.8 - cancer and 198.2 for no cancer. It's easy to check without computing cause they should add up to integers totals under each column
but anyway great video and thank you for your work
How did you get 17.8? When I multiply the number of people that do not have the mutated gene, 216, by the probability that someone will have cancer, 0.08, I get 17.3.
I have one word for your videos, Lucid!
Thank you! :)
Just to clarify for myself regarding fisher’s exact test - to calculate the p value we are calculating the sum of probabilities of getting >= 23 cases of cancers if we are choosing 140 (= 23 + 117) people from the total 356 (= 23 + 117 + 6 + 210) people, is my understanding correct?
I have only used fisher’s exact test in school when sample size is really small and all the cases can be laid out in a table by hand, and have never done in using a software package. I know scipy has a fisher_exact function but admit I haven’t read its docs or used it. Is there an R package that you would recommend using for doing it?
Thanks as always 😊
Yes, that is the idea. And Fisher's exact test is built into R: fisher.test(). So you can get help with ?fisher.test
Hi Josh, thanks for uploading those lectures! I really appreciate that!
I’m still confused on why we use z values in logistic regression. Is this the reason that we use log odds which has mean of 0 and symmetrically distributed?
Or more fundamentally(?), because our response variables are ‘binomial’?
Thanks so much!
I'd suggest that we use z-value because wald test uses it to calculate whether the variable (predictor) in logistic regression is zero (null hypothesis) or not (alternative hypothesis). If the variable's Z-value (according to Wald Test) in logistic regression is less than some value (for example -+1.96 which equals to 2 std. deviations) we can drop that variable from the logistic regression because it's useless. Hope my answer helps you after 2 years :D
What test(s) do we use to establish whether a BAM!! is small or not?
Great question!!!
11:49-Wald Test Working
yep
Good explainer!
Glad you think so!
Hello Josh, I had a confusion at 12:02 in the video. You said that the - "Wald's test typically uses the estimated standard deviation" ; but in reality we replaced the histogram with the normal curve having standard deviation of the observed values i.e 0.47. Hence, shouldn't it be - "Wald's test typically uses the observed standard deviation" instead of the expected and then we have a normal curve of STD=0.47 at 12:02. Based on my understanding the expected standard deviation will come from the earlier matrix of expected values which we calculated and here we have used the observed values matrix for getting a STD of 0.47. Let me know if I am missing something, but I am kind of lost in this section of the video and could use your help. Thanks again for an amazing video.
In this situation "estimated" means "estimated from the observed data", so the estimated standard deviation comes from the observed data.
To infinity (and beyond!) nice
Thanks!!!! :)
You're the best, Josh!
De nada! :)
13:20 However, this is traditionally done using a standard normal curve (i.e. a normal curve with mean = 0 and standard deviation = 1). I am a little bit confused about the word "traditionally".
Are you saying that the log(odds ratio) is not necessarily distributed according to standard normal distribution?
Ah. The log(odds ratio) is normally distributed, but the standard deviation is not always 1. Traditionally, you convert the distribution for the log(odds ratio) to a standard normal curve (mean = 0, sd = 1) by dividing by the standard deviation (which you calculate using the method of your choice). I say "traditionally", because that's how you had to do solve for the p-value back in the day. You had a table of values for a standard normal curve in the back of some book, and whenever you needed a p-value for a normal curve, you converted your normal distribution to a standard one (by subtracting the mean and dividing by the sd) so you could reference the table in the back of the book. These days, however, a computer can calculate the p-value for any normal curve, so it's not really important to transform to a standard normal distribution any more (however, everyone still does it).
Thank you so much for making it clear.
Excellent video! Thnx for the work!
You're welcome! :)
Your Presentation is very good and Interesting. Can I know what software do you use in preparation of the video.
I Like your voice..........
I used to use powerpoint and iMovie. Now I use Keynote and Final Cut Pro. However, this video in particular was done using Keynote and iMovie.
I was taught that when I see the equation log(x) to assume the base of the logarithm is 10. If I want to do logarithm with base e, I have always written ln(x). It took me a while to figure out why my math didn't work out the same as in the video.
It's unfortunate that the conventions for what log() means are not consistent. In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations.
@@statquest Thanks a lot. I noticed this when I used the log function in R. I am reviewing statistical principles as I write my prospectus for my Master's thesis and your videos are extremely useful.
Your music gives me Fitz and the Tantrums vibes, which I am enjoying.
A question: In this example you used a 2 x 2 contingency table and calculated the odds ratio for the positive response (cancer with a mutated gene). In my case, lets say *hypothetically* I am tracking the rate at which surgeons perform 2 procedures, A and B, over the past 10 years.
So my table is Procedure A, Procedure B x 2010, 2011, 2012....2020.
I can calculate the odds of procedure A and procedure B and then find the odds ratio. In order to say the odds are changing, what statistic would I use?
Fisher's exact test and the Chi-square test could work well with this data (they would tell you that the proportions between procedures A and B change).
Josh, great content, at 7.50 should it be "if cancer is not associated with the people with the mutated gene" ? Am confused 🤔!
Since we're looking at the data in terms of the rows, we are thinking about the gene's association with cancer.
Thanks
any time!
Thanks!
TRIPLE BAM!!! Thank you for supporting StatQuest!!! :)
Hi Josh, I just wanna make sure that I'm correct. Now you assume there is no relationship between gene and cancer, and the log(odd ratio) equal to 0 means no relationship at all. Then based on what you observed from an experience, the log odd ratio is 1.93, and the p-value is something extremely small given that there is no relationship, which means it is really rare to happen when there is no relationship. But it does happen, so we need reject that there is no relationship, and accept that there is a relationship between gene and cancer. Am I correct? really confused.
The small p-value tells us that it would be vary rare for random chance to give us the observed log(odds ratio). Thus, we reject that the observed log(odds ratio) is due to random chance. Does that make sense?
Hi Josh, at time stamp 13.59 you say that p-value that mutated gene does not have a relationship with cancer is 0.00005 which means that there is no relationship between mutated gene and cancer and the result of log(odds ratio) is not statistically significant. Now if I understand correctly, at time stamp 12.53, it is written that log(odd ratio) is statistically significant which means that mutated gene and cancer has significant relationship. Can you pse tell what am I missing here.
The p-value for the hypothesis that there is no relationship between the mutated gene and cancer is 0.00005. This means that we reject the hypothesis that there is no relationship between the mutated gene and cancer. To learn more about what p-values mean and how they are interpreted, see: ruclips.net/video/vemZtEM63GY/видео.html and ruclips.net/video/JQc3yx0-Q9E/видео.html
Hi Josh.
Just started watching your videos. Currently going through your 'Statistics Fundamentals' playlist.
In this video, at the 7:50 mark, you mention 'So, if the gene is not associated with the 140 people with the mutated gene...', shouldn't the assumption be that 'So, if we assume that cancer is not associated with the 140 people with the mutated gene...'.
That's the only reason you used the expected value using the 'estimated' population probability of having cancer (0.08) to calculate the 'expected value' of the number of people having both the mutated gene (140) and cancer.
Yes, that's a typo. It should be "If cancer is not associated with the 140 people with the mutated gene."
This clarification was desperately needed. My head was spinning trying to make sense out of typo text, where none could be made
I love your vids. I don’t know where to begin when I look at the number of quality videos on your channel. Is there an organization of the videos that would help?
I have them organized all of the videos on my home page: statquest.org/video-index/
At 13:50 of the video, the p-value that mutated genes does not have relationship with cancer is 0.00005. So does it mean that mutated genes has a strong association with cancer then?
P.S: It would be great that each time when you give an example, you will have a conclusion to it, so it wouldn't be confusing. Great video!
And how did you get 0.00005 p-value anyway? It's an estimate value you gave based on the normal distribution?
You are correct - I should have had a more obvious conclusion to this example. Often I do, but I forgot in this case. The small p-value says the association isn't just due to chance or noise. How strong that relationship is, however, is determined not by the p-value, but by the log(odds ratio). In other words, if we had a small log(odds ratio) and small p-value, we would have a significant, but weak association. If you have a large log(odds ratio) and a small p-value, you have a significant and strong association.
To calculate the p-value, we estimate the mean and standard deviation of a normal distribution and then we the area under that curve to calculate the p-value.
Perfect! Thank you so much! really grateful for clear explanation best statistics series on youtube
Hooray! You're welcome! :)
Good job, man
Thanks!
You're amazing!!!!!!
Thank you!
Dear Josh
Thank you for the great videos. I had a question. Is there a particular formula to calculate standard error of log hazard ratio?
Dear Josh,where can I see this statquest"EnrichmentAnalysis using Fisher's Exact Test and theHypergeometric Distribution"? I have a litttle confusing about it
See: ruclips.net/video/udyAvvaMjfM/видео.html Also, all of my videos are here: statquest.org/video-index/
What I feel is 30% of machine learning concepts are enough to get 70% of the ML jobs done ....what do think Josh...
I think that's pretty true. Once you have about 30% of the concepts, you can learn the remaining 70% pretty quickly when you need them.
wanted to see how you calculate the confidence interval :(
That will be 0+SE*1.96 and 0-SE*1.96 where SE=0.47
I can't seem to get the figures "1.79" and "-1.79" using log... Help please?
Which base are you using for the logarithm? The traditional base for statistics and machine learning and most programming languages is 'e', or the natural logarithm. So log_base_e ((2/4) / (3/1)) = -1.79.
@@statquest ohhh i see
Hi Josh, at 10:24 how did you draw the sample for histogram? You iterate thru 325, each time you draw random to determine which of the 4 cell it should be? (yesyes, yes no, no yes, no no) ? from your step instruction, it looks like I can only determine all cancer column and all mutated row, but I cannot derive the whole 4 cells for matrix. Many Thanks!
For each sample, if the first random number is less than 0.08, then the sample has cancer, otherwise it does not. If the second random number is less than 0.39, then it is mutated, otherwise it is not.
@@statquestMany thanks! I figured out too and confirmed with some python code. one more question 14:14, what you mean by the content from this point on? You generated 10000 log(ratio) data points based on observed expected probability (0.08 and 0.39). And it's not standard normal. How you apply test on it? Previously in your content Fisher's and Chi Square and even Wald's they all work on 2x2 matrix, how you manage them to apply on a normal distribution? How did you exactly calc p-value for example Chi and Wald's based on your generated 10K log(ratio)s?
Quite confused here. Please advice at your convenience, really appreciate it!
@@kevinshao9148 I generated a single matrix of values then then I applied the Fisher, chi-square, Wald's tests to it to get the p-values for the three tests. I then repeated the process, 10,000 times, and then calculated the percentage of p-values < 0.05 for each test.
Thank you for existing!!! How do I buy one of your songs?
Hooray!!! You can get my music here: joshuastarmer.bandcamp.com/
Thank you so much!!!
Just a heads up Josh, The Fisher's Exact Test video is not in any of the playlists for your videos. I found it by googling it, but just in case folks don't think about that solution.
Thanks for the tip. I'll see what I can do about getting it on a playlist.
Sir u are the best..!!
Thanks! :)
I'm a bit confused, in another video (in Spanish) they use the wald test being: OR-value of h0 i.e. 1 / se(OR), giving this 12.44 and a very low p value
What time point, minutes and seconds, are you asking about?
Out of topic question : Any plans to make Deep learning series with R.... just curious
It's in the long range plan, but that means it won't happen for a long time... Which is a bummer! I wish I could make the videos faster, but each one takes a long time.
Hi Josh we need a clearly explained statquest on Fisher Test, chi square test and wald test . It will be triple BAM!!! if hypergeometric distribution is explained
Can you also show to a demo to do these using python notebook
ruclips.net/video/udyAvvaMjfM/видео.html
wow, I love you man
:)
I confuse, in your video at 11:57 why you re using theta 0 as distribution? In other video i saw they re using theta hat as distribution
The null hypothesis is that there is no relationship among the different categories which implies that the mean of the log(odds ratios) should be 0.
Hi Josh - Many thanks for your videos. At 10:46 - This gives the matrix that did not depend on the relation between mutated gene and cancer. If there is no relation, matrix itself cannot be formed rite ? Because as per my understanding - in a matrix, the margin totals proportion cannot vary at any cost rite ? I have one sample where I'm stuck to form a matrix -
1. Sample size - 366
2. Random number for cancer b/w 0 to 1 - 0.19
3. Random number for mutated gene b/w 0 to 1 - 0.14
Also will log() result in negative number ? I could see only positive number output when I apply log(). Please clarify
When we make a matrix that is only dependent on the over all proportion of people with cancer and the overall proportion of people with the mutated gene, and not the known proportion of people with cancer AND the gene, then the matrix will not depend on the relation between the gene and cancer. Also, as you can see on the x-axis in the histogram, the log can give us negative numbers.
@@statquest thanks Josh...understood.. so matrix margin sum for random numbers should add to 325 rite both row wise and column wise instead of all 4 cells adding up to 325 or any number b/w 300 and 400 ?
@@KishoreKumar-fv2cx Because we are randomly deciding how many samples have cancer, the column and row totals will always be different. However, the total for the entire matrix will the number between 300 and 400 that you came up with at the start.
How do you add multiple odds ratios? Let’s say a patient has multiple risk factors for developing schizophrenia, each risk factor with a 2 fold possibility. (Ie having an older father, smoking cannabis, birth complications). Would you add them all together or multiple them?
Unfortunately neither of those options are good (see stats.stackexchange.com/questions/12294/can-individual-odds-ratios-be-added-to-get-one-pooled-odds-ratio-to-compare-to-a )
If you have multiple factors, I think your best option is to fit a logistic regression model to your data. For details, see: ruclips.net/p/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe
Hi Josh, very great video, really appreciate your videos very much.
I have a question regarding the estimation of standard deviation in the wald test. You said in the video that it is more common to estimate the standard deviation from the observed values. But I don't understand why the standard deviation is calculated by the square root of the sum of 1 over each observed values. May I ask for an explanation on that as I am quite confused on this part in this video. Thank you very much and sorry for the inconveniences caused.
That's just the equation for the standard deviation when you have this type of data. To explain how it is derived would take a whole video.
I don't understand the standard deviation in the example. The log(odds ratio) stays the same regardless of the sample size, as the ratio stays the same, but the standard deviation does change with the sample size. Consider that you multiply all the above numbers by 100. 2300 cancer&mutant, 600 cancer&no_mutant, 11700 no_cancer&mutant, 21000 no_cancer&no_mutant. The odds ratio stays the same. (2300/11700) / (600/21000) = 6.88. However, the standard deviation goes from 0.47 to 0.047, which would indicated that we are 41 standard deviations away from the mean (=log(6.88)/0.047)?
That is correct. We calculating a standard error, which is a function of the sample size and the larger the sample size, the smaller the standard deviation. In other words, the larger the sample size, the more confidence we have that the ratio should be close to the population ratio.
@@statquest thank you for your response! Really amazing videos you have, you have a gift for explaining complicated topics! What would be the best way to calculate standard deviation of log(OR) distribution? SE * sqrt(sample size)?
@@Anpa940 The most common way is described at 11:30
For the matrix at 10:46, I am a bit confused as to how the count for "people who have cancer but NO mutated gene" came to be?
Because, any value
We pick two numbers. We check to see if the first number is < 0.08. Then we check to see if the second number is < 0.39. So we don't check to see if the same number is < 0.08 and < 0.39.
@@statquest Ahh ok, thanks a lot for the explanation.
Josh, you are the fucking man!
Thanks! :)
hello sir ..please make a video on chi square test and wald test
Do you want more than what I say at 9:27
@@statquest Yes sir , I am having difficulty to grasp it . If you can then please make a video.
@@mid1chosen noted
Could you please explain CCA(canonical correlation analysis)? Ty!
I'll put it on my "to-do" list. :)
Thank you for this great presentation. It seems to me that there is an error at the timing of 8:53. You saying that " 198.7 people with mutated gene are..". It supposed to be " 198.7 people WITHOUT mutated gene ...". Please check this. Thanks
Oops!! That's a mistake. I've add it to the pinned comment.
# Possible R code corresponding to video (there might be mistakes or better ways to do it!)
# Load data
cancerData
Did you write that?
@@statquest - Yes, I did. The video was really good so I wanted to practice it in R. I couldn't see if anyone else had already posted something like this. I'm happy to delete/edit if needed.
hi Josh please do video on chi square test
OK. I've bumped the chi-square test closer to the top of my to-do list.
@@statquest Since the p-value we get from a chi-squared test only tells us whether there is an association between two variables, could you make a video on how to calculate the strength of this relationship. Thanks.
@@tribikramadhikari2470 That's what this video is all about. Odds and log(odds) ratios are like R^squared. They give you a sense of the magnitude of the association.
If we were to calculate log(odds ratio) of many random samples of same size from a population and plot their frequency distribution, we will get a sampling distribution. Though the odds ratio of each sample are different, won't the resulting sampling distribution (plotted using odds ratio of many such samples) have just one standard error irrespective of the outcomes of each sample? If that is true, why is the standard error of sampling distribution a function of sample outcomes?
You could ask the exact same question about calculating the mean and it's standard error from a single collection of observations. The reason we can do this is that as the sample size increases, the difference between the standard error calculated from a single collection of observations and the standard error calculated from the distribution of millions of means (or log(odds ratios)) goes to 0. So we could do either one to get the estimate we want. However, calculating the standard error from a single collection of observations is much more practical.
@@statquest That helps a lot sir. It amazes me how you manage to get time to respond to all the questions in the comments. This channel has made learning stats so much easier. Thank you so much! Your dedication is very inspiring.
Hello, it was somewhat difficult to me, especially in the case of tests. Thanks
I hope my video helped! :)
Hi Josh, Are you uploading any Video for Chi-Square Test?
One day I hope to do that.
At time stamp 7.55, the statement should be "so, if the is not associated with (23 + 117) = 140 people with mutated gene, then..."
Other than grammatical errors, I'm not sure how your version is different from mine. Can you clarify?
9:48 Why does the simulated data (i.e. keeping randomly generate values below the thresholds) has a normal distribution?
Unfortunately I can't derive this for you, but the normal distribution represents the null distribution.
@@statquest Thanks, Josh. I'll Google "null distribution".
I actually chatgpted it. Here's what it says:
"The histogram of log(odds ratios) of two unrelated variables appearing as a normal distribution can be explained by the Central Limit Theorem (CLT). The Central Limit Theorem states that when independent random variables are summed or averaged, their distribution tends to approach a normal distribution, regardless of the shape of the original distribution, as long as certain conditions are met.
In the case of log(odds ratios) of two unrelated variables, assuming that the individual variables are independent and have finite means and variances, the CLT applies. The log(odds ratio) is a transformation applied to the individual variables, which may have their own distributions. However, when the log(odds ratios) are calculated, the values from the two variables are combined in a way that follows the principles of addition and subtraction. Since the variables are unrelated, their contributions to the log(odds ratios) are independent.
As a result, when the log(odds ratios) of two unrelated variables are calculated repeatedly and plotted in a histogram, the CLT suggests that the distribution of the log(odds ratios) will tend towards a normal distribution. This means that the histogram will exhibit the characteristic bell-shaped curve commonly associated with a normal distribution.
It's important to note that the applicability of the CLT assumes a sufficiently large sample size. For small sample sizes, the normality assumption may not hold, and other factors might influence the shape of the distribution."
I think I get it now:
If two variables are not related, the the odds should be the same and hence odds ratio equal to one.
With
1) central limit theorem, the rate will center around 1
2) log transformation, the rate will center around 0
3) both CLT and log, the resulting distribution is a symmetrical normal curve
Hopefully this is the correct intuition.
I just have a question(basically two) for you Josh, I went through all of the videos of yours and they are literally clearly explained and It's quite easy to apprehend for me. But the thing is that after completing all of the videos it's hard for me to remember everything. I noted everything but seems like I tend to forget a lot of things. Is that natural? Or in that case, what should I need to do? I really want to be a machine learning engineer and researcher as well and I'll apply for fall22 for Ph.D. in the USA. So, it looks overwhelming and at the same time enthralling to learn new stuff.
I have a terrible memory and forget stuff all the time. Focus on remembering the main ideas.
Jot your own notes do help. Even if you just remember the rough idea later, it is much easier to pick up by reading the notes. Using some modern notes jotting software like Notion give you extra BAM.
great job !!!! I would like to ask for similar videos that explain Metropolis Hasting algorithm. thank you :)
That's a great idea! I'll add it to my "To-Do" list.
Thank uuuuuuuuuuuuuu soooooo much 😊😊😊😊😊😊😊😊😊
You are so welcome!
Ha ha - i see what ya did there with the circa 1976 red "cancer" m&m. LOL Starmer..
:)
thank you for this but how to calculate crude odd ratio and adjusted odd ratio
When you take the log will it always a result in the data be normally distributed? In the case of a dependant variable where they data is not normally distributed would transforming it always guarantee normality?
When you have random log(odds ratios), like demonstrated here, you will always get a normal distribution.
2:43 - im struggling to understand why we devide 23 over 117 instead of 23 over 140 which is the sum of all ppl who have the gene (likewise 6/216 for no gene). note: the rounded numbers are the same but still... please help - what am i missing?
When we divide 23 by 117 we get the odds. If we divide 23 by 140, we get the probability. In this case, we are interested in the odds and not the probability, so we divide by 117. For more details on the difference between odds and probabilities, see: ruclips.net/video/ARfXDSkQf1Y/видео.html
that was fast! thx a bunch!@@statquest
hi amazing content , I'm pretty sure i got it but just to be clear this p-value
Because this p-value is < 0.05 we can reject the hypothesis that there is no relationship between the mutated genes and cancer. Thus, we can conclude that there is some (possibly indirect) relationship.
12:07 I am sure I am missing something trivial here, but why do we calculate standard deviation as under root (summation 1/observation). Are we not supposed to calculate mean and calculate deviation from mean?
Think about this data some more. It's not a bunch of individual measurements (like measuring how tall a bunch of people are). Instead, this data is more like a summary of counts. For example, we have 23 people who have cancer and the mutated gene and 117 people who do not have cancer but have the mutated gene. What would the mean of those two numbers represent? I have no idea, and that tells us that this data is different, and thus, we need a different way to calculate the standard error.
The reply is still not clear. Would really appreciate, if you explain the logic in a bit detailed manner. Thanks.
3:46 To check the relationship between mutated gene and cancer, why not just take the odds like this: p(mutated gene and cancer) : p(mutated gene and no cancer). We would like our odds ratio to give us a picture of correlation between the presence of the gene and cancer; and not the absence of the gene and cancer. So why do we take 6/210 in the denominator?
We are actually very interested in the relationship between the absence of the gene and cancer - this is the background noise and models the random chance of getting cancer without the mutation. If there is a relationship between the mutation and cancer, then it should be stronger than the background noise and the random chance of getting cancer without the mutation. So we are comparing the relationship relative to background. Does that make sense?
@@statquest The logic you gave makes sense. However, I would like to know why did we calculate the p value of only the sample in the first row at 6:51. If we were interested both in presence and absence of the mutated gene , we should have calculated the p value of the bottom row as well. After all, our odds ratio consists of both the presence and absence of mutated gene like I previously mentioned. Thank you for replying, as always! :)
@@doubletoned5772 The calculating p-value for the top row requires that we use all of the data, including the bottom row. For more details on how this p-value was calculated, see the StatQuest: Enrichment Analysis using Fisher’s Exact Test and the Hypergeometric Distribution:
ruclips.net/video/udyAvvaMjfM/видео.html
Hi Josh, Was the odds ratio of outcome and not Odds ratio of exposure? The value may be the same but i think the principle is wrong. please help me understand.
Can you give me more details about your question? I don't fully understand it.
At 10:46, how did you come up with the Matrix ... ? In your example you got 17 with cancer (with
I explain how I created the matrix of random values starting at 10:00
Very useful video. I have a question, How do i calculate a p value with the sd of the OR. I tried with your example using pnorm in R, but i got 0,00004.
The difference is just rounding error.
se
With OR you can say Y was 6 times as likely for those who got treatment X compared to those that didn't. How can you write a sentence like this using log(OR)? Would it be incorrect to use the Log(OR) in a sentence like this?
With OR, you can say "The odds (or log(odds)) for Y were 6 times greater than the odds (or log(odds)) for X".
Thanks for the video! Standard error is defined for sampling distribution, right? So it shouldn't vary with our sample. But if we are taking observed values from our samples in the formula to calculate standard error, wouldn't standard error vary with each sample?
I'm not sure I understand your question. As far as I can tell, the standard deviation of the log(odds ratio) changes depending on what values you get when you collect your sample.
@@statquest Sorry, I meant standard error of sampling distribution of log(odds ratio) not standard deviation. Does standard error vary with our sample?
@@farheenahmed4683 Ah, I think I see. Unfortunately the terminology is a little confusing in this situation. The "standard error" = the standard deviation of the mean. In this case, we are using the log(odds ratio) as the mean, thus, the standard deviation of the log(odds ratio) is also the standard error. In other words, both terms refer to the same variation in the log(odds ratio)'s.
Thank you so much for clarifying it so promptly! I went ahead and watched your video on standard deviation vs standard error. From what I understand, standard deviation is variance within "a set" of measurements (a sample) and standard error is the variance of samples statistics from the mean of sample statistics obtained from "multiple sets" of measurements (all the samples in our sampling distribution).
But for log (odds ratio) , we cannot really calculate variance "within" a set of measurements or a sample (in case of binary/dichotomous variables) therefore standard deviation cannot be calculated for a sample.
But when it comes to "a set" of measurements/sampling distribution, variance can be calculated in this step and standard deviation of a sampling distribution here is the same as its standard error. Is that right?
@@farheenahmed4683 I believe that is correct.
I thin at 2:13 the log(odds) = log(1/6) = -0.77815 not -1.79
Ok, for anyone else wondering about this as well, it is because Josh is using base "e" for the log. Usually is denoted as ln, but since base is is commonly used in statistics and machine-learning, he is using log base e.
In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations.
Hi Josh , you are great at 2.26 how are calculating 1.79 or -1.79
In statistics, machine learning and computer programming, the default base for the log() function is 'e'. Thus, throughout this video I use the natural logarithm, or log base 'e', to do the calculations