I've spent a day looking for an intuitive and satisfactory explanation for n-1 and this is the only one that really did it for me. For some reason, no one else bothered to explain why exactly the numerator with x̄ would always yield the least possible variance. Thanks a lot!
@@ankurpriyadarshan my proffersor at IIT Bombay explained this ..if you find expectation of variance of sample (n-1) then it will come out same as expectation of variance of population
Thanks so much for this. It is mostly very clear. My only comment is that in the degrees of freedom section, it should have been made clear that the table on the left is just 3 observations from the wider population and not representative of the entire population. Otherwise, I (and at least a few people in the comments) assumed that that table *was* the population and could not understand why the third value could be equal to 50 while the Mu still remained 53. I had to look through the comments to get clarification that in fact that table on the left is just the *first three* observations from a *wider* population.
"Degrees of Freedom tend to be handwaved away by lecturers and tutors alike" => Amen to that ! I still remember how real satisfactory explanations to that were so lacking. Thanks
Dude, you're great your explanations of these concepts are terrific and very easy to follow. As an actuarial science major this is one of the most helpful videos I've ever found.
hey good luck man! i have heard actuarial science is really tough, it was one of the majors i was considering for uni as i graduate from HS this year and i got a friend in SA who's also doing actuarial science how's it been going so far?
Answered my questions about absolute value! Just re-learning Mathematical statistics currently and these videos are really helpful for motivation and understanding.
I watched many such videos, all said almost the same stuff what you said but I ended up all videos with confusions. You explained it so well that finally I understand the main concept. Thanks a lot.
I was very sceptical about this video at first since i watched about 100 videos to explain this same topic!! and boom this was the video that summarised and explained an entire lecture in 13 mins!! and i actually understand toooo .... you deserve all the subscribers ever !!
I am so grateful to you for such a crystal clear explanation of the concepts. I really appreciate your efforts in spending the time for such carefully thought out details. Thank you again. All your videos area great.
Thanks for explaining that. Especially the quick degrees of freedom at the end. I knew conceptually why I had to do n-1 with sample sets for getting closer to the real answer, but the degrees of freedom helped me know why that is a thing. Cheers
I always thought one of the reasons for dividing with n-1 could be that since we're using the sample mean which could be one of the possible values for the population mean so subtracting that one value from the total population thus n-1. That is my way of rationalizing this fact as you mentioned lecturers tend to shrug away from having conversation but since you've explained it so well that that is not the case, I wonder what could be the rationale behind it and not just the fact that it gives the best possible estimate. Nevertheless, I really like your videos it answers all of my big as well as small doubts I could think of which didn't always have a straight forward answer. Thank you and keep up the good work!
Sorry I may not have understood this fully at 11:40 - why can the 3rd observation be whatever it wants to be given the population average is 53? Shouldn't it be 53*3-41-59=59? Thank you!
@5:42, I searched up and kinda found an intuitive explanation about why we don't use absolute value: "Standard deviation is a statistical measurement of how spread out a data set is relative to its mean. When data points are further away from the mean, the data set has a higher deviation and a greater standard deviation. This is because the data points become more dissimilar and extreme values become more likely." And I assume this also has to do with the shape of the bell curve. If it were a piecewise linear curve, i.e., an angle shape, then absolute values would probably be enough. Let me know what you think.
At 12:35, is the reason why the last row could be anything it wants to be is for the case where we know the population because it’s isn’t an estimate like x bar.
I like the video. You mention that we shouldn't use the absolute value for describing the spread of the data. The reason why this isn't done is not because it is incompatible with the "higher-order" statistics, but rather because most of statistics was developed with variance in mind. You could just as well develop the parts of statistics that lack looking at the absolute value, which is the L-1 norm. Netflix used an optimization algorithm which made use of this type of norm, which proves that it has practical application. You could also say that if the absolute value squared and cubed, etc. are important, then the absolute value itself must be important as well. They might have different uses, but you cannot say one is better than the other.
A lower order of integrability would be required for L1 norms, which with power laws of some choice of parameters might exist as a first moment while the second moments such as variance would not exist.
Thank you so much. I have been trying all morning to research this and you are the first person I have found who has directly and clearly said that the squaring method isn't better than the absolute value approach, it's just something that people often find useful when they want to do other things with the data later on. Every other resource that I have found on this topic seemed to be implying that there was some unexplained other reason why the squaring method was *better* than just taking the absolute value.
@@galenseilis5971A lower order of integrability would be required for what exactly? I might be missing something, but taking a norm of data is just a kind of aggregation (summation). So, whether you take an aggregation of an L1 norm shouldn't prevent another aggregation that is an L2 norm (variance).
@@chasemcintyre3528 I'm glad I could provide some comfort 😄 Most often if someone can't give you an answer to the "why" it's likely that they are just parroting what they've been taught or heard. Independent thought is the only way to fill those gaps in knowledge. Good on you for searching all morning despite the resistance.
@@gustavstreicher4867 Reviewing your comments and the video, you are apparently missing the distinction between a sample and a(n infinite) population. I'll spare a few minutes to give you a more detailed explanation. But before getting to your question, I want to point out something misleading in the video above. They present a handy-wavy explanation of why we use n-1 degrees of freedom instead of n degrees of freedom in the denominator of variance. Many people call the former the "sample variance" and the "population variance", but this is misleading because they're both sample statistics that can be used to estimate the population variance when it exists. The reason we often prefer using the variance estimator with n-1 degrees of freedom is because it is corrected for estimator bias at small sample sizes assuming the data are sampled from a normal distribution. Both estimators are consistent estimators for the variance of a normal distribution, meaning that they both eventually converge to the population variance. You have not said anything that makes me believe you fell for this misunderstanding, but I am offering the caution just in case. Now let's head in the direction of you question. As you describe, you can calculate either of the (sample) mean absolute deviation (MAD) or a sample variance on a finite collection of real numbers. And as you mentioned, the L1 and L2 norms are closely related to these sample statistics. The L1 and L2 norms induce the Taxicab and Euclidean metrics respectively. The MAD is a rescaling of the Taxicab distance from the arithmetic mean. The variance is a rescaling of the square of the Euclidean metric from the mean. There is not particular issue with doing this on a sample, but that wasn't the substance of my comment which concerns the population. Let's go over some population statistics now. In mathematical statistics the population mean is the expected value of the random variable, often denoted as E[X] for a random variable X. I don't mean that some value is to be expected in an intuitive sense per se, but rather that there is a mathematical operator called the "expected value" that can be applied to a random variable. A random variable is a measurable function (i.e. its preimage exists) of the outcome space of the probability space. Which is to say, you should think of random variables as mathematical tools rather than something that is intuitively "unpredictable". A random variable is a type of mathematical model of a part of your data. In special cases an expected value of a random variable is an arithmetic mean, but it is more general than that. The population variance is likewise defined as E[(X - E[X])^2], so the expected value is relevant to understanding both the population mean and the population variance. The population analog of MAD is the expected value of the absolute difference of the expected value subtracted from the random variable, denoted E[|X - E[X]|]. For continuous random variables, like a normal random variable, you'll see that the expected value is defined in terms of an integral which is just a convenient notation for referring to certain infinite series. Okay, that's an overview of the definitions. But what's the problem then? The problem is that these population quantities do not always exist. Fortunately they do exist for many distributions, including the normal distribution. One example where none of the population statistics we have discussed so far would exist is for a Cauchy distribution. I invite you to try computing the MAD and variance (either flavor) on samples of increasing sample size from a standard Cauchy distribution. You'll find that neither of these statistics will show convergence behaviour in long term. The sample quantities will exist, but they will not estimate any stable population quantity. Instead they will just jump around aimlessly. The wikipedia page on the Cauchy distribution currently has some information on this unstable behaviour for the mean. Let's consider that "order of integrability" part of my earlier comment now. There is a statistic which generalizes both the MAD and variance. Instead of considering an L1 or an L2 norm, we can consider an Lp norm. It induces a metric which we can take to a pth power to obtain the generalization. In terms of population statistics we can consider E[|X - E[X]|^p] to be the formal generalization. There is a downward closure property that if for two orders p > q then if E[|X - E[X]|^p] exists so will E[|X - E[X]|^q]. The smallest order p in which the functional (E[|X - E[X]|^p])^(1/p) exists is what I called the order of integrability. So the population MAD might exist even when the population variance doesn't, which was the point I was making in the first place. Why doesn't the population variance always exist for any distribution? Well, the quick handy-wavy answer is that some infinite series represented by these integrals don't converge. We already touched on that above that estimating something that doesn't exist isn't really meaningful or helpful. I mentioned before about power laws, e.g. the Pareto distribution, which are interesting cases in this regard because sometimes these population statistics exist and sometimes they do not depending on the parameters. But I won't labor that as this comment is getting long. If my explanation isn't clear, I suggest you go to a site more suited to discussions about math to get clarification. An example is Stack Exchange's Cross Validated community which have support for mathematical notation and have members who are familiar with this topic.
This kind educator should be a millionaire! If you read comments on his videos, he's clearly cleaning up after thousands of (unhelpful) Stats and Data Analytics professors around the globe!!!!
Yo! I'm really enjoying these videos so far. It's nice to be able to grasp something that seemed inaccessible for so long. One note on your spreadsheet, though. Two sentences have typos. "Note: this is now three alternate esimtations of the standard deviation for each sample"
First of all you are already a diamond in statistics. So thank you for such a extreme hard work.Can you please make a video that on median ,mean deviation is least.
I have literally scoured youtube for months to understand a ridiculously poor written textbook that I have no idea how it got published - (Statistics for Health Care Management and Administration by Kros and Rosenthal) - and I now feel that I am starting to conceptually understand the "why" and not just memorize formulas. Thank you for teaching these concepts!
the reason in both cases in mainly historical there is no real reason not to use the more intuitive average deviation (AKA mean absolute deviation) when differentiability is not a requirement - in fact the logical thing when one is looking for mean deviations would be to do just that, and the argument often given in text books is that stdev also works, which is true of course, but a logically flippant reason there is also no reason to use n-1 specifically for most purposes when calculating population variance, which is kind of implied by the fact that the -1 makes a tiny difference for any significant amount of samples
Great video! I personally find the idea of "degrees of freedom" to be a confusing and overall nonsense way of describing why we divide by n-1 for sample variance. Inherently, when you are taking a sample n from a population, each observation is independant and could be anything, so there are n degrees of freedom. Its not until you posit that "given the sample mean x_, and these n-1 observations, you can determine what the last ungiven observation is". I think that using the term "degrees of freedom" here makes no sense, and seems to imply that only n-1 of the observations were truly random/independant, which is obviously not the case. Unless the idea od "degrees of freedom" has some other application that I'm not aware of, I think it hould be thrown out entirely, as the way you explained why we divide by n-1 for samples makes far more sense and doesn't imply anything that isn't true.
Really appreciate the way you teach statistics in a simplified manner. Myself a real fan of you..... I have a doubt, if we derive population mean with all the data points of the population (not from some of the sample points), would the degree of freedom be N-1? Thanks in advance.
A more direct explanation of using n-1 in the calculation of sample variance is that the variance computer with n is a biased estimator of the population mean. Look up Bessel's correction for the derivation that proves that the correction is n-1 rather than other choice such as n-2, n-3, ..., etc.
Your videos are so useful, thank you so much! One thing I can't get my head around here though. So, we divide by n-1 (as opposed to n) to account for the variance needing to be larger as our sample mean is just an approximation of the population mean and the variance of the population mean is as small as it can be. But, we don't know the population mean so our sample mean could be the same as the population mean and thus we would be over estimating the variance by dividing by n-1 and not n. Is this true?
Good explanation. Thanks. But I still do not get it. I downloaded your excel sheet, and OK, it shows that for a sample of 10, the sample's stdev is closer to the population's stdev. But what about samples of other sizes? For large samples, it will tend to the the divided by n value (which is fine). but, as the sample get smaller, the (n-1) stdev will tend to overshoot. Shouldn't there be a sweet spot for dividing by n-1 ? I still think the whole thing is very arbitrary "We don't know where the population mean is, it may be anywhere, so we divide by n-1 instead of n to get a more precise stdev" I could propose subtracting anything from the denominator in order to broaden the variance. For instance: (|Fib(n)/Fib(n-1) - phi| ) where Fib(x) is the Fibonacci sequence number for x. and phi is the golden number. This will tend to 0 for large enough values of n, although creating some havoc for small sample sizes. But why not?
I have a question why you just took any number for 3rd observation in population and you did not take any random number for 3rd observation of sample. If we know the population mean we can also use the same logic to calculate the third observation in population unless we are assuming sample is of size 3 and population size is not given.
I’m new to this whole subject so maybe I’m wrong in my understanding. But, assume you have been tracking your fuel economy in your car every week for several months. You have a large data set and have the mean of that data set. The next time you go get fuel and calculate your fuel economy, none of those previous data points will have any bearing on this new economy. Maybe this week you idled more, reducing your economy, or maybe you went on a road trip which tends to improve your economy. So, this week your economy can be any value from 0 to the theoretical best fuel economy for your car, not restricted to previously experienced fuel economy averages. In the sample set, you’re taking already experienced fuel economy numbers, which make up the previously calculated mean, and therefore any additional data point is restricted to that set. In other words, the first is forward looking, let’s see what happens next, and the second is looking back at what has happened, and let’s use that to guess what could happen next. If I’m wrong in this, someone please correct me.
I did not understand some things: 1. Why is Σ(x-x')=0? here x' is sample mean? this is during the calculating of degrees of freedom 2. Why is population mean fictional? Why can't we find it in reality? Can't we calculate it sum of observation divided by num of observation? 3. I did not clearly understand why did we need to inflate the estimated value and not decrease it? What if the sample mean was to the left of both the points. Still do we need to inflate the estimated variance?
before starting discussion about the sample and population mean you could just explain what is difference between population and sample ? sample is just subset of all population observations.
1:33, your notation is a little off. It should include the last element when you have a finite list. For example, to list the numbers from one to ten we could write 1, 2, … ,10. Writing a1, a2, a3,… means that there are infinitely many a’s. Hope it helps!
Zed, hope you can answer this. In your excel file you write: "Imagine taking a sample of 10 students in your class and asking them to write down the final digit of their student number. NOTE: This is like a random selection between digits 0-9. Thus, a known population mean of 4.5. " Couldn't the student IDs all end in 1 or skew from 4.5? It seems like you're either taking the mean of the set of available values or assuming that this sample has this particular mean.
Good question. By "known population mean" I'm suggesting that this 10 person sample (which can skew, as you say) is nonetheless taken from a population that has a mean of 4.5. You can consider the population to be ALL The students in the university (or even the world, if you like). So you need to separate the notion of a population pool from which we are selecting AND the actual selection. The population average height might be 175 cm . But that doesn't mean a sample of 10 people will have this average.
@@zedstatistics Ah understood. Thanks for the reply. Great job on the spreadsheet btw, that and this video are the clearest explanations I've come across
Please help me with my confusion here. If you decrease the denominator, n-1, you increase or "adjust" the numerator. So, does it increase the variance? I don't even know what I'm asking? (so confused &^%(Q^#%#)
Thank you so much Zed for your teaching materials. For the attachment, is it possible if we have the password to unprotect the sheet? Because I would like to type something on the file to experiment. Thank you!
so I think this was great. It sounded like there was a bit of an implicit rule though, like how a population mean doesnt predetermine (or just say have a causal relationship) with the data set. so ".../n". But with sample mean, the mean does predetermine the data set. Thus the values can be anything they want (have complete freedom) where as the last term of the data set has no degree of freedom (this is in order to adjust the data set to match the already given sample mean). Now I'm wondering why pop mean doesnt have the same relationship as the sample mean has to its own data set.
Hi my friend, I have the same doubt as yours, and it´s killing me! Have you figured it out why the hell the population have more degrees of freedom, if the sum of all squared deviation is equal 0 too?? I mean, you also can determine the last value, as you do using the sample data set! It´s a conspiracy!
Very interesting observation. Here's my very interesting answer: Imagine you have a whole population at your disposal. The mean for that population (call it mu if you like) exists BEFORE you measure each of the individual members of the population. Measuring each observation is merely a way to REVEAL what already existed objectively (if you are religious you might say that God knew the population mean before I started taking each observation). The same cannot be said about a sample. Imagine I intend on sampling 10 people from a much larger population. By the time I have sampled and measured 9 people, not even God knows what the sample mean is going to be. This, ultimately, depends on the final person that I randomly take into my sample. (Athiests can simply say that there is no objective mean that exists for the sample until that final observation is taken). So, to summarise: for populations the mean exists logically BEFORE the measuring of the observations (ie. there is actually no ESTIMATION going on). For samples, the observations PRECEDE the mean - in other words the mean depends on the random observations chosen. For that reason we only associate degrees of freedom with ESTIMATION. So only with samples. As there is no estimation going on in populations (only calculation) then we dont bother ourselves with Degrees of Freedom. Told you it was interesting :)
@@18despues Yes! Well believe it or not this is actually an important principle in frequentist statistics (ie. the statistics you see in text books): Mu exists, and it is some exact single number whether you measure it or not. If you find that uncomfortable, then never fear! Bayesian statisticians are with you! They (roughly) treat mu as having some probabilistic range.
Good video. However, my only issue with N-1 is that, I think subtracting 1 from the sample size is not going to cause any significant effect or difference, especially when the sample size is large. N-1 as a denominator will bring about a very small correction to the variance. This concept of degrees of freedom, to a lay person like me, seems more academic than useful. Please correct me if I'm wrong.
Hey, how do the statisticians calculate the population mean when they collect samples and study them just because they can't deal with the whole population??
Thank you very much for your video, it was very very good at explaining. But I have one more question, If descriptive statistics do not try to generalize to a population (since there is no uncertainty in descriptive statistics), then why does the sample standard deviation try to best estimate the population mean? Yet it is still considered a descriptive statistic
I've spent a day looking for an intuitive and satisfactory explanation for n-1 and this is the only one that really did it for me. For some reason, no one else bothered to explain why exactly the numerator with x̄ would always yield the least possible variance. Thanks a lot!
i agree.....
@@ankurpriyadarshan my proffersor at IIT Bombay explained this ..if you find expectation of variance of sample (n-1) then it will come out same as expectation of variance of population
True, still, it makes more sence when you meet idea of biased/unbiased estimators.
Thanku
Thanks so much for this. It is mostly very clear. My only comment is that in the degrees of freedom section, it should have been made clear that the table on the left is just 3 observations from the wider population and not representative of the entire population. Otherwise, I (and at least a few people in the comments) assumed that that table *was* the population and could not understand why the third value could be equal to 50 while the Mu still remained 53. I had to look through the comments to get clarification that in fact that table on the left is just the *first three* observations from a *wider* population.
"Degrees of Freedom tend to be handwaved away by lecturers and tutors alike" => Amen to that ! I still remember how real satisfactory explanations to that were so lacking. Thanks
many things are waved away these days haha...they definitely assume we know the purpose of everything
You deserve waaaay more subscribers than you currently have. Really well-made videos and nice explanations. Thank you!
Dude, you're great your explanations of these concepts are terrific and very easy to follow. As an actuarial science major this is one of the most helpful videos I've ever found.
hey good luck man! i have heard actuarial science is really tough, it was one of the majors i was considering for uni as i graduate from HS this year and i got a friend in SA who's also doing actuarial science
how's it been going so far?
THREE FREAKING MONTHS OF CLASS!! 10:00 You ended my frustration in 5 minutes.. THANK YOU!!
Answered my questions about absolute value! Just re-learning Mathematical statistics currently and these videos are really helpful for motivation and understanding.
I teach programming for a living. I need to learn stats for ML. This vid is AWESOME! SO clear and well presented! Amazing teaching!
I watched many such videos, all said almost the same stuff what you said but I ended up all videos with confusions.
You explained it so well that finally I understand the main concept. Thanks a lot.
You are doing a great job really. Please continue doing it irrespective of the fact of the number of subscribers or likes.You are just amazing.❤
I googled why need to divide by n-1, browsed several sites until I landed here.
Thanks for great explanation.
The best ever explanation I found after searching hundreds of sites and links...keep it up man!!!
you have won my heart... i couldnt find an explanation anywhere for this ... Thank you so much
Simply great, even brings up questions and clarifys them that I haven't even thought about, but which are kind of important for understanding.
Thanks Zed, the way you laid out the first and second thoughts were quite literally exactly what was going through my head! you're a champ!
Yay ZedStatistics. These videos are so very valuable to help understand concepts. Great supplemental to classwork! Thanks Justin!
One of the best video for understanding the actual meaning of the variance. Thanks a lot.
I was very sceptical about this video at first since i watched about 100 videos to explain this same topic!! and boom this was the video that summarised and explained an entire lecture in 13 mins!! and i actually understand toooo .... you deserve all the subscribers ever !!
you are absolutely the goat in explaining stats
congrats on the views bro
Really superb explanation! It makes a huge difference to understanding when things are explained so clearly! Many thanks.
I am so grateful to you for such a crystal clear explanation of the concepts. I really appreciate your efforts in spending the time for such carefully thought out details. Thank you again. All your videos area great.
I'm gonna ship you a dozen packs of golden gaytime ice creams ! Thanks a bunch !
You're on... though please ship in winter lest it arrive as Golden Gaytime soup.
Note to self: Golden Gaytime Soup.
I prefer Weis Bars!! :D
melting...
This is the best explanation of these concepts. Thank you!
I have never seen a better explanation for degrees of freedom , it gave me chills . Thank you
Thanks for explaining that. Especially the quick degrees of freedom at the end. I knew conceptually why I had to do n-1 with sample sets for getting closer to the real answer, but the degrees of freedom helped me know why that is a thing.
Cheers
I always thought one of the reasons for dividing with n-1 could be that since we're using the sample mean which could be one of the possible values for the population mean so subtracting that one value from the total population thus n-1. That is my way of rationalizing this fact as you mentioned lecturers tend to shrug away from having conversation but since you've explained it so well that that is not the case, I wonder what could be the rationale behind it and not just the fact that it gives the best possible estimate.
Nevertheless, I really like your videos it answers all of my big as well as small doubts I could think of which didn't always have a straight forward answer. Thank you and keep up the good work!
Omg I've been pondering this for so long! I'm ever grateful
SIR YOU ARE THE BEST TEACHER EVER
Sorry I may not have understood this fully at 11:40 - why can the 3rd observation be whatever it wants to be given the population average is 53? Shouldn't it be 53*3-41-59=59? Thank you!
The best teaching of statistics I ever found!
@12:55 is the plurar for formula in Australia formuli love it. or is it a diminuitive? But great video. Thanks
@5:42, I searched up and kinda found an intuitive explanation about why we don't use absolute value: "Standard deviation is a statistical measurement of how spread out a data set is relative to its mean. When data points are further away from the mean, the data set has a higher deviation and a greater standard deviation. This is because the data points become more dissimilar and extreme values become more likely."
And I assume this also has to do with the shape of the bell curve. If it were a piecewise linear curve, i.e., an angle shape, then absolute values would probably be enough.
Let me know what you think.
Excellent explanation...crisp, precise and easily understandable. Thank you.
Thank you! Your videos are better than 9 units of statistics in uni!
At 12:35, is the reason why the last row could be anything it wants to be is for the case where we know the population because it’s isn’t an estimate like x bar.
This is the channel I have been looking!
Well done! Very intuitive, good refresher when I had mostly forgotten my undergrad course...
An awesome explanation of the idea of degree of freedom. Thank you.
I like the video. You mention that we shouldn't use the absolute value for describing the spread of the data. The reason why this isn't done is not because it is incompatible with the "higher-order" statistics, but rather because most of statistics was developed with variance in mind. You could just as well develop the parts of statistics that lack looking at the absolute value, which is the L-1 norm. Netflix used an optimization algorithm which made use of this type of norm, which proves that it has practical application. You could also say that if the absolute value squared and cubed, etc. are important, then the absolute value itself must be important as well. They might have different uses, but you cannot say one is better than the other.
A lower order of integrability would be required for L1 norms, which with power laws of some choice of parameters might exist as a first moment while the second moments such as variance would not exist.
Thank you so much. I have been trying all morning to research this and you are the first person I have found who has directly and clearly said that the squaring method isn't better than the absolute value approach, it's just something that people often find useful when they want to do other things with the data later on. Every other resource that I have found on this topic seemed to be implying that there was some unexplained other reason why the squaring method was *better* than just taking the absolute value.
@@galenseilis5971A lower order of integrability would be required for what exactly?
I might be missing something, but taking a norm of data is just a kind of aggregation (summation). So, whether you take an aggregation of an L1 norm shouldn't prevent another aggregation that is an L2 norm (variance).
@@chasemcintyre3528 I'm glad I could provide some comfort 😄
Most often if someone can't give you an answer to the "why" it's likely that they are just parroting what they've been taught or heard.
Independent thought is the only way to fill those gaps in knowledge.
Good on you for searching all morning despite the resistance.
@@gustavstreicher4867 Reviewing your comments and the video, you are apparently missing the distinction between a sample and a(n infinite) population. I'll spare a few minutes to give you a more detailed explanation.
But before getting to your question, I want to point out something misleading in the video above. They present a handy-wavy explanation of why we use n-1 degrees of freedom instead of n degrees of freedom in the denominator of variance. Many people call the former the "sample variance" and the "population variance", but this is misleading because they're both sample statistics that can be used to estimate the population variance when it exists. The reason we often prefer using the variance estimator with n-1 degrees of freedom is because it is corrected for estimator bias at small sample sizes assuming the data are sampled from a normal distribution. Both estimators are consistent estimators for the variance of a normal distribution, meaning that they both eventually converge to the population variance. You have not said anything that makes me believe you fell for this misunderstanding, but I am offering the caution just in case.
Now let's head in the direction of you question. As you describe, you can calculate either of the (sample) mean absolute deviation (MAD) or a sample variance on a finite collection of real numbers. And as you mentioned, the L1 and L2 norms are closely related to these sample statistics. The L1 and L2 norms induce the Taxicab and Euclidean metrics respectively. The MAD is a rescaling of the Taxicab distance from the arithmetic mean. The variance is a rescaling of the square of the Euclidean metric from the mean. There is not particular issue with doing this on a sample, but that wasn't the substance of my comment which concerns the population. Let's go over some population statistics now.
In mathematical statistics the population mean is the expected value of the random variable, often denoted as E[X] for a random variable X. I don't mean that some value is to be expected in an intuitive sense per se, but rather that there is a mathematical operator called the "expected value" that can be applied to a random variable. A random variable is a measurable function (i.e. its preimage exists) of the outcome space of the probability space. Which is to say, you should think of random variables as mathematical tools rather than something that is intuitively "unpredictable". A random variable is a type of mathematical model of a part of your data. In special cases an expected value of a random variable is an arithmetic mean, but it is more general than that. The population variance is likewise defined as E[(X - E[X])^2], so the expected value is relevant to understanding both the population mean and the population variance. The population analog of MAD is the expected value of the absolute difference of the expected value subtracted from the random variable, denoted E[|X - E[X]|]. For continuous random variables, like a normal random variable, you'll see that the expected value is defined in terms of an integral which is just a convenient notation for referring to certain infinite series. Okay, that's an overview of the definitions. But what's the problem then?
The problem is that these population quantities do not always exist. Fortunately they do exist for many distributions, including the normal distribution. One example where none of the population statistics we have discussed so far would exist is for a Cauchy distribution. I invite you to try computing the MAD and variance (either flavor) on samples of increasing sample size from a standard Cauchy distribution. You'll find that neither of these statistics will show convergence behaviour in long term. The sample quantities will exist, but they will not estimate any stable population quantity. Instead they will just jump around aimlessly. The wikipedia page on the Cauchy distribution currently has some information on this unstable behaviour for the mean. Let's consider that "order of integrability" part of my earlier comment now.
There is a statistic which generalizes both the MAD and variance. Instead of considering an L1 or an L2 norm, we can consider an Lp norm. It induces a metric which we can take to a pth power to obtain the generalization. In terms of population statistics we can consider E[|X - E[X]|^p] to be the formal generalization. There is a downward closure property that if for two orders p > q then if E[|X - E[X]|^p] exists so will E[|X - E[X]|^q]. The smallest order p in which the functional (E[|X - E[X]|^p])^(1/p) exists is what I called the order of integrability. So the population MAD might exist even when the population variance doesn't, which was the point I was making in the first place. Why doesn't the population variance always exist for any distribution? Well, the quick handy-wavy answer is that some infinite series represented by these integrals don't converge. We already touched on that above that estimating something that doesn't exist isn't really meaningful or helpful. I mentioned before about power laws, e.g. the Pareto distribution, which are interesting cases in this regard because sometimes these population statistics exist and sometimes they do not depending on the parameters. But I won't labor that as this comment is getting long.
If my explanation isn't clear, I suggest you go to a site more suited to discussions about math to get clarification. An example is Stack Exchange's Cross Validated community which have support for mathematical notation and have members who are familiar with this topic.
This kind educator should be a millionaire! If you read comments on his videos, he's clearly cleaning up after thousands of (unhelpful) Stats and Data Analytics professors around the globe!!!!
Why squared deviation take it as the euclidienne distance between the 2 points ( mean and each point ) : the distance is always >= 0
Amazingly explained this complex subject in every video, thank you sooooo much
Loved it. Been trying to undertsand this concept for sometime now...
Thanks, finally after 35 years, I understand it.
Don't ever stop making videos.
Ah this was bothering me for the longest time! Thanks for the explanation!
1:40 you are expressing it in "square dollars" actually, to be precise
after watching 3 different videos,I understand this from yours,so u are the best.
my brain was "problem loading page" after watching this.. this isnt easy :(
Dead set legend. Pretty much replaced my unit's content with your videos. Cant thank you enough.
Yo! I'm really enjoying these videos so far. It's nice to be able to grasp something that seemed inaccessible for so long. One note on your spreadsheet, though. Two sentences have typos. "Note: this is now three alternate esimtations of the standard deviation for each sample"
The second question was answered, and answered most clearly.
Loved the explanation about degrees of freedom
First of all you are already a diamond in statistics. So thank you for such a extreme hard work.Can you please make a video that on median ,mean deviation is least.
I have literally scoured youtube for months to understand a ridiculously poor written textbook that I have no idea how it got published - (Statistics for Health Care Management and Administration by Kros and Rosenthal) - and I now feel that I am starting to conceptually understand the "why" and not just memorize formulas. Thank you for teaching these concepts!
5:55-Why we don't use Mean Absolute deviation for describing set of data:
You're an excellent teacher
Absolutely amazing explanation. May Allah bless you and grant you guidance.
Amazing work, well done!
the reason in both cases in mainly historical
there is no real reason not to use the more intuitive average deviation (AKA mean absolute deviation) when differentiability is not a requirement - in fact the logical thing when one is looking for mean deviations would be to do just that, and the argument often given in text books is that stdev also works, which is true of course, but a logically flippant reason
there is also no reason to use n-1 specifically for most purposes when calculating population variance, which is kind of implied by the fact that the -1 makes a tiny difference for any significant amount of samples
Absolutely brilliant explanation 🔥
wowowwww thanks!!! never watched a better explanation of DoF
Great video! I personally find the idea of "degrees of freedom" to be a confusing and overall nonsense way of describing why we divide by n-1 for sample variance. Inherently, when you are taking a sample n from a population, each observation is independant and could be anything, so there are n degrees of freedom. Its not until you posit that "given the sample mean x_, and these n-1 observations, you can determine what the last ungiven observation is". I think that using the term "degrees of freedom" here makes no sense, and seems to imply that only n-1 of the observations were truly random/independant, which is obviously not the case. Unless the idea od "degrees of freedom" has some other application that I'm not aware of, I think it hould be thrown out entirely, as the way you explained why we divide by n-1 for samples makes far more sense and doesn't imply anything that isn't true.
Please keep making videos its quite helpful
Fabulous explanation sir! Thank you very much!
Your videos are the goldest and the gayest of times. I thank you good sir for making stats not only more understandable but also more fun 😂
Makes me miss my days down under….that shit is so fire 🔥
Really appreciate the way you teach statistics in a simplified manner. Myself a real fan of you..... I have a doubt, if we derive population mean with all the data points of the population (not from some of the sample points), would the degree of freedom be N-1?
Thanks in advance.
A more direct explanation of using n-1 in the calculation of sample variance is that the variance computer with n is a biased estimator of the population mean. Look up Bessel's correction for the derivation that proves that the correction is n-1 rather than other choice such as n-2, n-3, ..., etc.
Great videos even though i have yet to fully absorb all the interesting content,since i am a beginner. Very informative videos. Thank you
Another reason why we follow L2 Norm instead of L1 Norm is that L2 Norm is differentiable....
Your videos are so useful, thank you so much! One thing I can't get my head around here though. So, we divide by n-1 (as opposed to n) to account for the variance needing to be larger as our sample mean is just an approximation of the population mean and the variance of the population mean is as small as it can be. But, we don't know the population mean so our sample mean could be the same as the population mean and thus we would be over estimating the variance by dividing by n-1 and not n. Is this true?
Good explanation. Thanks. But I still do not get it. I downloaded your excel sheet, and OK, it shows that for a sample of 10, the sample's stdev is closer to the population's stdev. But what about samples of other sizes? For large samples, it will tend to the the divided by n value (which is fine). but, as the sample get smaller, the (n-1) stdev will tend to overshoot.
Shouldn't there be a sweet spot for dividing by n-1 ?
I still think the whole thing is very arbitrary "We don't know where the population mean is, it may be anywhere, so we divide by n-1 instead of n to get a more precise stdev"
I could propose subtracting anything from the denominator in order to broaden the variance. For instance: (|Fib(n)/Fib(n-1) - phi| ) where Fib(x) is the Fibonacci sequence number for x. and phi is the golden number. This will tend to 0 for large enough values of n, although creating some havoc for small sample sizes. But why not?
In last 2 example what will be the values of N and n respectively
I have a question why you just took any number for 3rd observation in population and you did not take any random number for 3rd observation of sample. If we know the population mean we can also use the same logic to calculate the third observation in population unless we are assuming sample is of size 3 and population size is not given.
I’m new to this whole subject so maybe I’m wrong in my understanding. But, assume you have been tracking your fuel economy in your car every week for several months. You have a large data set and have the mean of that data set. The next time you go get fuel and calculate your fuel economy, none of those previous data points will have any bearing on this new economy. Maybe this week you idled more, reducing your economy, or maybe you went on a road trip which tends to improve your economy. So, this week your economy can be any value from 0 to the theoretical best fuel economy for your car, not restricted to previously experienced fuel economy averages.
In the sample set, you’re taking already experienced fuel economy numbers, which make up the previously calculated mean, and therefore any additional data point is restricted to that set.
In other words, the first is forward looking, let’s see what happens next, and the second is looking back at what has happened, and let’s use that to guess what could happen next.
If I’m wrong in this, someone please correct me.
I did not understand some things:
1. Why is Σ(x-x')=0? here x' is sample mean? this is during the calculating of degrees of freedom
2. Why is population mean fictional? Why can't we find it in reality? Can't we calculate it sum of observation divided by num of observation?
3. I did not clearly understand why did we need to inflate the estimated value and not decrease it? What if the sample mean was to the left of both the points. Still do we need to inflate the estimated variance?
before starting discussion about the sample and population mean
you could just explain what is difference between population and sample ?
sample is just subset of all population observations.
This video is so helpful! Thank you!!
VERY INFORMATIVE. EXCELLENT EXPLANATION.
4:40 why would you not just take the absolute value of this instead of squaring it?
1:33, your notation is a little off. It should include the last element when you have a finite list. For example, to list the numbers from one to ten we could write 1, 2, … ,10. Writing a1, a2, a3,… means that there are infinitely many a’s. Hope it helps!
Thank you for this..you did a great job at explaining this..
For degrees of freedom why does the sample spread have to equal 0?
Thanks for clear and well delivered explanations! ....... How on earth did I study before youtube?????
Zed, hope you can answer this. In your excel file you write:
"Imagine taking a sample of 10 students in your class and asking them to write down the final digit of their student number.
NOTE: This is like a random selection between digits 0-9. Thus, a known population mean of 4.5.
"
Couldn't the student IDs all end in 1 or skew from 4.5? It seems like you're either taking the mean of the set of available values or assuming that this sample has this particular mean.
Good question. By "known population mean" I'm suggesting that this 10 person sample (which can skew, as you say) is nonetheless taken from a population that has a mean of 4.5.
You can consider the population to be ALL The students in the university (or even the world, if you like). So you need to separate the notion of a population pool from which we are selecting AND the actual selection.
The population average height might be 175 cm . But that doesn't mean a sample of 10 people will have this average.
@@zedstatistics Ah understood. Thanks for the reply. Great job on the spreadsheet btw, that and this video are the clearest explanations I've come across
Please help me with my confusion here. If you decrease the denominator, n-1, you increase or "adjust" the numerator. So, does it increase the variance? I don't even know what I'm asking? (so confused &^%(Q^#%#)
Is there anything wrong with using a weighted average as the mean in a variance or standard deviation equations?
hello Justin.. love love your video! The spreadsheet can not be downloaded however..
I'd pay for tickets to the cinema if this video was on.
Thank you so much Zed for your teaching materials. For the attachment, is it possible if we have the password to unprotect the sheet? Because I would like to type something on the file to experiment. Thank you!
Finally I got my answer of the second question after searching a lot..thanks a lot
so I think this was great. It sounded like there was a bit of an implicit rule though, like how a population mean doesnt predetermine (or just say have a causal relationship) with the data set. so ".../n". But with sample mean, the mean does predetermine the data set. Thus the values can be anything they want (have complete freedom) where as the last term of the data set has no degree of freedom (this is in order to adjust the data set to match the already given sample mean). Now I'm wondering why pop mean doesnt have the same relationship as the sample mean has to its own data set.
Hi my friend, I have the same doubt as yours, and it´s killing me! Have you figured it out why the hell the population have more degrees of freedom, if the sum of all squared deviation is equal 0 too?? I mean, you also can determine the last value, as you do using the sample data set!
It´s a conspiracy!
Very interesting observation. Here's my very interesting answer:
Imagine you have a whole population at your disposal. The mean for that population (call it mu if you like) exists BEFORE you measure each of the individual members of the population. Measuring each observation is merely a way to REVEAL what already existed objectively (if you are religious you might say that God knew the population mean before I started taking each observation).
The same cannot be said about a sample. Imagine I intend on sampling 10 people from a much larger population. By the time I have sampled and measured 9 people, not even God knows what the sample mean is going to be. This, ultimately, depends on the final person that I randomly take into my sample. (Athiests can simply say that there is no objective mean that exists for the sample until that final observation is taken).
So, to summarise: for populations the mean exists logically BEFORE the measuring of the observations (ie. there is actually no ESTIMATION going on). For samples, the observations PRECEDE the mean - in other words the mean depends on the random observations chosen.
For that reason we only associate degrees of freedom with ESTIMATION. So only with samples. As there is no estimation going on in populations (only calculation) then we dont bother ourselves with Degrees of Freedom.
Told you it was interesting :)
@@zedstatistics that's interesting. I just need to imagine how I should have a mean before I've had a chance to measure the members of a population.
@@matheusmf4135 I think zero is mu for pop, and x bar for sample mean
@@18despues Yes! Well believe it or not this is actually an important principle in frequentist statistics (ie. the statistics you see in text books): Mu exists, and it is some exact single number whether you measure it or not.
If you find that uncomfortable, then never fear! Bayesian statisticians are with you! They (roughly) treat mu as having some probabilistic range.
what is the difference between population mean and sample mean? Is sample mean just the mean of FEWER elements of the same population?
Fantastic video! Glad I found your channel!
hello @zedstatistics, why do we need s.d when we have variance?
Brilliant explanation!
Hi Zed, the downloadable spreadsheet link isn't working. Where can I find the working link.
Good video. However, my only issue with N-1 is that, I think subtracting 1 from the sample size is not going to cause any significant effect or difference, especially when the sample size is large. N-1 as a denominator will bring about a very small correction to the variance.
This concept of degrees of freedom, to a lay person like me, seems more academic than useful. Please correct me if I'm wrong.
Thanks for this amazing explanation.
Hey, how do the statisticians calculate the population mean when they collect samples and study them just because they can't deal with the whole population??
Thank you very much for your video, it was very very good at explaining. But I have one more question, If descriptive statistics do not try to generalize to a population (since there is no uncertainty in descriptive statistics), then why does the sample standard deviation try to best estimate the population mean? Yet it is still considered a descriptive statistic