The weirdest paradox in statistics (and machine learning)

Поделиться
HTML-код
  • Опубликовано: 20 июн 2024
  • 🌏 AD: Get Exclusive NordVPN deal here ➼ nordvpn.com/mathemaniac. It's risk-free with Nord's 30-day money-back guarantee! ✌
    Second channel video: • Why James-Stein estima...
    Stein's paradox is of fundamental importance in modern statistics, introducing concepts of shrinkage to further reduce the mean squared error, especially in higher dimensional statistics that is particularly relevant nowadays, in the world of machine learning, for example. However, this is usually ignored, because it is mostly seen as a toy problem. Precisely because it is such a simple problem that illustrates the problem of maximum likelihood estimation! This paradox is the subject of many blogposts (linked below), but not really here on RUclips, except in some lecture recordings, so I have to bring this up to RUclips.
    This is not to say that maximum likelihood estimator is not useful - in most situations, especially in lower dimensional statistics, it is still good, but to hold it to such a high place, as statisticians did before 1961? That is not a healthy attitude to this theory.
    One thing I did not say, but perhaps a lot of people will want me to, is that this is an emprical Bayes estimator, but again, more links below.
    Video chapters:
    00:00 Introduction
    04:38 Chapter 1: The "best" estimator
    09:48 Chapter 2: Why shrinkage works
    15:51 Chapter 3: Bias-variance tradeoff
    18:45 Chapter 4: Applications
    Further reading:
    The “baseball paper”: efron.ckirby.su.domains//othe...
    Wikipedia: en.wikipedia.org/wiki/Stein%2...
    Dominating the (positive-part) James-Stein estimator: projecteuclid.org/journals/an...
    Wikipedia (Empirical Bayes): en.wikipedia.org/wiki/Empiric...
    Other writeups:
    www.ime.unicamp.br/~veronica/M...
    joe-antognini.github.io/machi...
    www.jchau.org/2021/01/29/demy...
    www.naftaliharris.com/blog/st...
    austinrochford.com/posts/2013...
    duphan.wordpress.com/2016/07/...
    www.statslab.cam.ac.uk/~rjs57/...
    (Philosophical implications) philsci-archive.pitt.edu/13303...
    Other than commenting on the video, you are very welcome to fill in a Google form linked below, which helps me make better videos by catering for your math levels:
    forms.gle/QJ29hocF9uQAyZyH6
    If you want to know more interesting Mathematics, stay tuned for the next video!
    SUBSCRIBE and see you in the next video!
    If you are wondering how I made all these videos, even though it is stylistically similar to 3Blue1Brown, I don't use his animation engine Manim, but I will probably reveal how I did it in a potential subscriber milestone, so do subscribe!
    Social media:
    Facebook: / mathemaniacyt
    Instagram: / _mathemaniac_
    Twitter: / mathemaniacyt
    Patreon: / mathemaniac (support if you want to and can afford to!)
    Merch: mathemaniac.myspreadshop.co.uk
    Ko-fi: ko-fi.com/mathemaniac [for one-time support]
    For my contact email, check my About page on a PC.
    See you next time!

Комментарии • 899

  • @mathemaniac
    @mathemaniac  Год назад +68

    Go to nordvpn.com/mathemaniac to get the two year plan with an exclusive deal PLUS 4 months free. It’s risk free with NordVPN’s 30 day money back guarantee!
    Please sign up because it really helps the channel!
    [My pinned comment gets removed by RUclips AGAIN!!!]

    • @JCResDoc94
      @JCResDoc94 Год назад +2

      bc everything is related, eventually. in the oneness of God. right? _JC

    • @andsalomoni
      @andsalomoni Год назад

      This paradox should mean that you can't have 3 or more independent distributions. The maximum is 2.

    • @qkktech
      @qkktech Год назад

      there is better estimator when do furier transformation and go single dimenaion on system

    • @terrywilder9
      @terrywilder9 10 месяцев назад

      @@andsalomoni That doesn't work! Any three elements of a functional basis are independent. That's why when you are making a maximum likelyhood estimate you are assuming a distribution also.

  • @ludomine7746
    @ludomine7746 Год назад +1566

    This is insane. The demonstration with the points in 3d and 2d space not only made it clear why it works, but also made it clear why it doesnt work as well in 2d. Going from the paradox being magic to somewhat understandable is beautiful. I loved this video.

    • @mathemaniac
      @mathemaniac  Год назад +48

      Thanks for the kind words!

    • @mrbutish
      @mrbutish Год назад +5

      Also when I use mse and lme with the ordinary estimator I PCA the n dimensions into 2D so that this situation never arises and mse is effective and dominates. Instead of PCA, lda, svm also works. If no PCA go RMS prop + momentum, Adam does well/dominates

    • @arnoldsander4600
      @arnoldsander4600 Год назад +2

      @@mrbutish I hoped for a similar moment but the accent really hurt my brain. couldnt concentrate on anything but the pronounciation of estematourr.
      Darn my brain.

    • @user-jb8yv
      @user-jb8yv 10 месяцев назад +1

      @@arnoldsander4600not even a strong accent

    • @john-ic5pz
      @john-ic5pz 10 месяцев назад

      ​@@arnoldsander4600i like the way he says "sure". 😊

  • @marshallc6215
    @marshallc6215 Год назад +778

    For a layman, I think the worry after first seeing this explained (given the *very* fast hand waving with the errors at the beginning) is that you might suddenly be able to estimate something better by adding your own random data to the question, which by definition, makes the three data points not independent. The thing is, and I'm surprised you never clarified this, we aren't talking about a better estimation for any given distribution. We're talking about the best estimator for *all three* distributions as a collective. We're no longer asking 3 questions about 3 independent data sets, but 1 question about 1 data set containing 3 independent series. There is no paradox here, because it is pure numbers gamesmanship and is no longer the intuitive problem we asked at the beginning.
    When we went to multiple data sets, the phrasing of the question is the same, but the semantic meaning changes.

    • @Achrononmaster
      @Achrononmaster Год назад +67

      That is a good summary. One should not geometrize independent variables. A similar "paradox" occurs in uncertainties for complex numbers, so even the 2D case. If you Monte Carlo sample the modulus of a random z ϵ ℂ with 2D Z-dist centered at 0 + 0i the average |z| is something like a random walk distance, sqrt(N). But if you sample real and imaginary parts, average them, then compute the mean z then take |·| of that , it'll converge to zero as N → ∞. The average |z| ∼ sqrt(N), but the average z = 0 + 0i.

    • @guillaumecharrier7269
      @guillaumecharrier7269 Год назад +51

      Well put - I think this would have deserved at least a sentence or two in the video.

    • @sender1496
      @sender1496 Год назад +21

      I think the only thing that might need clarifying is the definition of "better". Still though, I think the video made it clear that this estimator won't be better on average for the individual collections, but rather for this new cost function which adds the individual costs collectively. You're right however that it gets hard to phrase it as three independent questions, because they would be like: "Find the estimator f(x1, x2, x3) that minimizes the cost", when said "cost" would also involves the other collections.

    • @xyzbesixdouze
      @xyzbesixdouze Год назад +3

      if you include an own random set to get beyond 2 dimensions, then those fake data with their influence on the mean error will take over, so that there is no meaningfull conclusion on the original sets. on the other hand is you just duplicate a set 3times to go from 1d to 3d then you didn't introduce other data and still get another mean while the original mean is proven to be the best?

    • @sender1496
      @sender1496 Год назад +13

      @@xyzbesixdouze But duplicating the set wouldn't generate a new independent set, would it? There would be correlation. This changes the distribution completely (won't be circles/spheres/etc. around the mean point), meaning that the justification for the James-Stein estimator won't work.

  • @Achrononmaster
    @Achrononmaster Год назад +38

    Lesson: One should not geometrize independent variables. A similar "paradox" occurs in uncertainties for complex numbers, so even the 2D case. If you Monte Carlo sample the modulus of a random z ϵ ℂ with 2D Z-dist centred at 0 + 0i the average |z| is something like sqrt(π)/2 (Rayleigh distribution). But if you sample real and imaginary parts, average them, then compute the mean z then take |·| of that , it'll converge to zero as N → ∞. The average |z| ∼ sqrt(π)/2, but the average z = 0 + 0i.

    • @roromaniac8
      @roromaniac8 10 месяцев назад

      What is this “paradox” called?

    • @cubing7276
      @cubing7276 10 месяцев назад +1

      they don't feel the same tbh, i think a more similar comparison would be to compute the average distance traveled in the real and imaginary component and then add them up

  • @SirGisebert
    @SirGisebert Год назад +321

    The bias-variance decomposition is Part of my PhD thesis and i just gotta say your visualizations and explanations are very clean and intuitive. Good job!

    • @mathemaniac
      @mathemaniac  Год назад +13

      Wow, thank you!

    • @FirdausIsmail1
      @FirdausIsmail1 Год назад +10

      This presentation is phd level and beyond! So clear and easily digestible

    • @dukeingreen7980
      @dukeingreen7980 Год назад

      I am glad it is still of relevance. It was one key element of my Doctorate dissertation 30 years ago even if I did not fully understood the relevance at that point. Best wishes fro your career if you are young and thank you for sharing.

    • @maxwornowizki422
      @maxwornowizki422 Год назад

      Another great real life visualization of the concept is the following: Imagine two people playing darts. One of them hits all parts of the dartboard more or less symmetrically. They are on average in the middle, but each individual arrow might land oclose to the edges of the board. This is low or even zero bias but high variance. The other player's arrows always land very close to each other, but they don't center around The bullseye. The person is very focused and consistent, but can't get around the systematic missjudgement of the bulleye's position. Still, If they are close enought, they might win the majority of matches.

    • @brendawilliams8062
      @brendawilliams8062 Год назад

      I am not a PhD. I would divide 7408 by 3. Then I would take 2469333…. And the square root is very close to pi. If you times it by two. That’s why the denominator I will do best with the largest no. You are not avoiding crystals.

  • @ChatSceptique
    @ChatSceptique Год назад +12

    I'm a PhD in statistics, never heard of that one before. It's really cool, thanks for sharing

  • @mathemaniac
    @mathemaniac  Год назад +5

    Go to nordvpn.com/mathemaniac to get the two year plan with an exclusive deal PLUS 4 months free. It’s risk free with NordVPN’s 30 day money back guarantee!
    It will also greatly help the channel, so do sign up!

  • @logician1234
    @logician1234 Год назад +250

    Does this paradox have any connection to the fact that random walk in 1 or 2 dimensions almost always returns, while in 3 and more dimensions it has a finite probability that it may never return? Proof for this uses normal distribution but I may be terribly wrong lol

    • @mathemaniac
      @mathemaniac  Год назад +121

      Have you seen my idea list? (I mean I did post it on Patreon)
      Yes, there is a connection! But the next video is just about the random walk itself (without using normal distribution / central limit theorem), because the connection is explored in a very involved paper by Brown:
      projecteuclid.org/journals/annals-of-mathematical-statistics/volume-42/issue-3/Admissible-Estimators-Recurrent-Diffusions-and-Insoluble-Boundary-Value-Problems/10.1214/aoms/1177693318.full

    • @logician1234
      @logician1234 Год назад +19

      Cool, I haven't seen your list, I don't use patreon. Can't wait for the next video

    • @leif1075
      @leif1075 Год назад +3

      @@mathemaniac any tips on how to pay attention and stay interested and focused in statistics especially when it gets sso looonng and tedious??

    • @enbyarchmage
      @enbyarchmage Год назад +10

      @@leif1075 As someone with ADHD, I know very well how long and tedious lectures can make focusing literally impossible. Thus, I've given myself the liberty to give you a tip: try doing most of your research using resources that actually make the subject seem interesting to you. There surely are books that can teach even advanced college-level Statistics in simultaneously accessible and rigorous ways.

    • @leif1075
      @leif1075 Год назад +2

      @@mathemaniac why is p there in p minusv2.. yiu didn't mention that at all

  • @ej3281
    @ej3281 Год назад +4

    this was really good, thank you! I used to work in a machine learning/DSP shop and did a lot of reading about estimators but I'm not sure I ever fully understood until I saw this video.

  • @dcterr1
    @dcterr1 Год назад +3

    I'm not all that familiar with advanced statistics, but I was pretty blown away by this paradox when you first presented it! However, once you started explaining how we normally throw out outliers in any case, It began to make a lot more sense. Good video!

  • @CampingAvocado
    @CampingAvocado Год назад +140

    The fact that I'm not particularly interested in statistics and also on my only 3 weeks of holidays from my maths-centric studies, yet I still was really excited to watch this video speaks for its quality. Thank you again for the amazing free content you provide to everyone!!

    • @mathemaniac
      @mathemaniac  Год назад +6

      Thanks for the kind words!

    • @peterlustig2048
      @peterlustig2048 Год назад

      Eth-Student?

    • @CampingAvocado
      @CampingAvocado Год назад

      @@peterlustig2048 indeed

    • @peterlustig2048
      @peterlustig2048 Год назад

      @@CampingAvocado Cant wait to finally complete my master, I had so little free time the last few years...

    • @CampingAvocado
      @CampingAvocado Год назад +1

      @@peterlustig2048 Congrats to your soon to be acquired freedom then :)

  • @abdulmasaiev9024
    @abdulmasaiev9024 Год назад +7

    This is very good. The only notes I have for how it might be improved are:
    1. Make it clearer that when we have the 3 data points early in the video, we know from which distribution each of them comes, rather than just having 3 numbers. So, we know that we have say 3 generated from X_1, 9 generated from X_2 and 4 generated from X_3 rather than knowing that there's X_1, X_2 and X_3 and each generated a number and the set of the numbers that were generated is 3, 9, 4 but have no idea which comes from which. It can be sort of inferred from them ending up in a vector, but still.
    2. "Near end" vs "far end", the near end being finite vs far end being infinite is a bit ehh as a point. It invites the thought of "well who cares how big the effect is in the finite area or how small it is in the infinitely large area, there will be more total shift in the latter anyway - it's infinite after all!". What matters is the probability mass for each of those areas (and it's distribution and what happens to it), and that's finite either way.
    Other than that, excellent video. Nice and clear for some relatively high level concepts.

  • @djtwo2
    @djtwo2 Год назад +18

    The relevance of "mean square error" here can be considered in the context of the probability distribution of the error. Here "error" means total squared error as in the video. Applying the shrinkage moves part of range of errors towards zero, while moving a small part of the range away from zero. The "mean squared error" measure doesn't care enough about the few possibly extremely large errors resulting from being moved away from zero to counterbalance the apparent benefit from moving some errors towards zero, But any other measure of goodness of estimation has the same problem. There are approaches to ranking estimation methods (other ways of defining "dominance") that are based on the whole of the probability distribution of errors not just a summary statistic. This is similar to the idea if "uniformly most powerful" for significance tests. The practical worry here is that a revised estimation formula can occasionally produce extremely poor estimates, as is illustrated in the slides in this video.

  • @jadegrace1312
    @jadegrace1312 Год назад +37

    I don't think you did a very good job in the introduction of giving motivation for why it would even be possible to find a better estimator than our naive guess. As the video went on it made sense, but at the beginning when you were introducing the concept of multiple independent distributions, I wish you had included a line like "we are trying to find the best estimator overall for the system of three independent distributions, which may not be the same as the best estimator for each independent distribution".

    • @mathemaniac
      @mathemaniac  Год назад +10

      Thanks for the feedback! I did initially want to include this into the script but eventually decided against it. This is because when I first read about Stein's paradox, and that it is because of reducing the overall error rather than individual errors, I just moved on, because I immediately felt the paradox is resolved. But when I read about James-Stein estimator again (because of the connection with the next video), I realised it was a much bigger deal than I thought it would be, like the idea of shrinkage and bias-variance tradeoff. In my opinion, this would be a much, much more important concept.
      In other words, if I said the line that you suggested, in the beginning of the video, my past self just would not continue to learn the much more important lessons later on in the video. So perhaps if given the second chance, I could have said it at the end of the video, but I would still not put this in the beginning.

    • @afterthesmash
      @afterthesmash Год назад

      @@mathemaniac Ah, but you must also know that burying the lead for tactical reasons is a very dangerous game.
      My formal math education predates Moses, but I think I still have good instincts, most of the time. In my own writing practice I often take wildly unconventional paths, to help break people out of established cognitive grooves. It's a useful posture, and sometimes it's not bad to inform the process from an introspective stance on _your own_ foibles and aversions.
      But you also have to be as honest as possible up front, and not go "hey, surprise, bias!" in the third act, when the gun was already smoking at the first rise of the curtain. Surely there's only one possible unbiased estimator for a symmetric distribution. You know, that first screen you introduced. Which way would you deviate? It's symmetric, you can't choose.
      Having but one unbiased estimator on the store shelf, if you have no bias tolerance, you are done, done, done in the first act. This was making me scream inside for the first ten minutes. And then if you go on to show that least squares estimation steers you into a biased estimator, what you _ought_ to conclude is that least squares (as applied here) is _totally inappropriate_ for use in regimes with zero bias tolerance. Which is an interesting result on its own terms.
      Furthermore, I had a lot of trouble with the starting point where you know the variance for certain, but you're scrabbling away with one data point to estimate the mean. Variance is the higher moment, which means we are operating in a moment inversion (like a temperature inversion over Los Archangeles), where our certitude in higher moments precedes our certitude in lower moments, which is pretty weird in real life. So I mentally filed this as follows: in an Escherian landscape where you know your higher order moments before your lower order moments (weird), then sometimes grabbing for least squares error estimation by knee-jerk habit will either A) lead you badly astray (zero bias tolerance); or B) lead you to a surprising glade in the promised land (you managed to pawn some bias tolerance for a dominating error estimator).
      I admire your thought process to take a motivated, pedagogical excursion. But failing to state that the naive estimator is the only possible unbiased estimator at first opportunity merely opened you up to a different scream from a different bridge. Because this whole thing was The Scream for me for the first ten minutes. So then your early segue is "but look at the surprising result you might obtain if you relax your knee-jerk fetish for zero bias" and _then_ I would have settled in to enjoy the ride, exactly as you steered it.

    • @afterthesmash
      @afterthesmash Год назад

      @@mathemaniac I had to get that first point out of my system, before I could gather my thoughts about the other aspect of this that was driving me nuts.
      It was pretty clear to me from early on that if your combined least squares estimator imposed a Euclidean metric, that you could win the battle on the kind of volumetric consideration we ended up with. I'm am _totally_ schooled on the volumetric paradox of high-dimensional spaces (e.g. all random pairs of points, on average, become equidistant in the limit; I usually visualize this as vertices of discrete hypercubes, with distance determined by bit vector difference counting - it's my view of continuous mathematics that has degraded greatly since the time of Moses).
      But then I had a minor additional scream: why should our combined estimator be allowed to impose a Euclidean metric on this problem space? When did this arranged marriage with Euclid first transpire, and why wasn't I notified? Did Gauss himself ever apply least squares with a Euclidean overlay informed by independent free parameters? It seems to me that if you just have many instances of the same thing with a _shared_ free parameter, and complete indifference about where your error falls, this amounts to an obvious heuristic, without much need for additional justification.
      But then when you have independent free parameters, the unexpected arrival of a Euclidean metric space needs to be thoroughly frisked at first contact, like Miracle Max, before entering Thunderdome, to possibly revive the losing contestant.
      Tina Turner: "True Love". You heard him? You could not ask for a more noble cause than that.
      Miracle Max: What's love got to do with it? But in any case that’s not what he said-he distinctly said “To blave”-
      Valerie: Liar! Liar! Liar!
      Miracle Max: And besides, my impetuous harridan, he was worked over by a chainsaw strung from a bungee cord, and now most of his body is scattered around like pink wedding confetti.
      Valerie: Ah, shucks.

    • @afterthesmash
      @afterthesmash Год назад

      @@mathemaniac Final comment, sorry for the many fragments.
      1) you're willing to sell bias up the river (but only for a good price)
      2) you're in an Escherian problem domain where a higher order moment is fixed in stone by some magic incantation (e.g. Excaliber) while a lower order moment is anybody's guess
      3) you don't find it odd that your aggregated error function imposes a Euclidean metric space
      then
      4) you arrive at this weird, counterintuitive, nay, positively _paradoxical_ result
      But, actually, for me, by the time I've swallowed all three numbered swords, any lingering whiff of paradox has left the building with all limbs normally attached.

    • @mathemaniac
      @mathemaniac  Год назад +1

      @@afterthesmash Re: the variance point. If you use a lot of data points to estimate the mean for each distribution, then you will still be able to obtain an estimation of variance, and use that to construct the (modified) James-Stein estimator, and it will still dominate the ordinary estimator. More details on the Wikipedia page for James-Stein estimator.

  • @asdf56790
    @asdf56790 Год назад +4

    What a great video! For me you perfectly hit the pace. I was never bored but still didn't need to rewatch sections, because they were too fast.
    This is one of those beautiful paradoxes which you can't beleive, if you haven't seen the explanation.

  • @ssvis2
    @ssvis2 Год назад +32

    This is a great explanation of estimators and non-intuitive relations. I like that you highlighted its importance in machine learning. It would be worth doing another video about how the variance/bias relation and subsequent weightings adjustments affect those models, especially in the context of overfitting.

    • @mathemaniac
      @mathemaniac  Год назад +4

      Will have to think about how to do it though... thanks for the suggestion.

  • @tanvach
    @tanvach Год назад +11

    I think shrinkage isn’t widely discussed is because choosing MSE as a metric for goodness of parameter estimation is an arbitrary choice. It makes sense that introducing this metric would couple the individual estimations together, so it’s not really a paradox (in hindsight). In some sense, you want to see how well the model works, not how accurate the parameters are, since a model is usually too simplistic. But I do see this used in econometrics.
    I think I’m seeing more L1 norm used in deep learning as the regularizer, wonder what form of shrinkage factor that will have?

    • @eugeybear
      @eugeybear Год назад +3

      I was wondering the same thing. The paradox seems to arise from the fact that our error is calculated using an L2 metric, but the two coordinates are being treated independently.
      Aside from wondering how using an L1 norm would affect this, I was also thinking that rather than using two independent normal distributions whether this paradox would still exist if we used a 2-dimensional gaussian distribution. Because in this case, all points with the same distance from the center would now all have the same probability, which wouldn't be true using two independent normal distributions.

    • @nodrance
      @nodrance 10 месяцев назад

      I was thinking the same thing. This isn't a better estimation, this is a trick that takes advantage of how we measure things.

  • @amphicorp4725
    @amphicorp4725 Год назад +5

    I kept forgetting that the distributions were unrelated and every time I remembered, it blew my mind. Absolutely fantastic video

  • @frankjohnson123
    @frankjohnson123 Год назад +80

    Statistics seems to shun elegance for practicality more than most branches of mathematics. The ordinary estimator is clean and intuitive while the James-Stein one is like a machine held together by duct tape, yet the latter works better in many cases.

    • @Wence42
      @Wence42 Год назад +10

      I feel like you might be missing out on something if the James-Stein Estimator doesn't seem elegant by the end of this video.
      I would say this formula is more transparent in terms of what it does and why it works than most of the stuff we memorize in algebra.
      It is entirely possible I'm the weird one for looking at this and thinking "yeah, that looks like the right way." Different brains understand things in different ways.

    • @matthewliu1800
      @matthewliu1800 Год назад +28

      No, the James-Stein estimator is biased and practically useless. Note that it doesn't matter which point you shrink towards, it will lower the error. That by itself should tell you how ridiculous this is.
      What we are truly looking for is the minimum-variance unbaised estimator. That is the definition of the "best" estimator.
      All this video shows is that MSE is insufficient to determine the best estimator. There are biased estimators with less MSE than unbiased ones.

    • @extagram
      @extagram Год назад +13

      @@matthewliu1800 Really reminded me of Goodhart's law here " When a measure becomes a target, it ceases to be a good measure." James Steins estimator chase the target of being "best" estimator which resulted in the failure of this "best" estimator.

    • @panner11
      @panner11 Год назад +3

      @@matthewliu1800 Of course the James-Stein estimator is very rough and rudimentary, but the point of the video is how it served as inspiration for the idea of Bias-Variable tradeoff. So back to the point of elegance vs practicality. Minimum-variance unbiased estimator might be what you are "looking" for, but in reality that is just a conceptual dream. Bias-Variable tradeoff and how it's widely used in real world machine learning applications for regularization is the practical part that can't be dismissed and already applied everywhere.

  • @mingliangang8221
    @mingliangang8221 Год назад +71

    It is pretty awesome that you covering one of the most counterintuitive examples in statistics. This example motivates many exciting ideas in modern statistics like empirical Bayes. Keep up the good work.

    • @mathemaniac
      @mathemaniac  Год назад +14

      Originally Stein's paradox was just a bit of a footnote in my class in statistics, but when I dived a little bit deeper into it, it is actually a much bigger deal than I first thought, so I decided to share it here!

    • @mingliangang8221
      @mingliangang8221 Год назад +7

      @@mathemaniac Yup, it is. Maybe next time, you can cover something from stein as well, like stein's identity, which is a pretty powerful tool for proving the central limit theorem and its generalisations. Sadly, there aren't many videos explaining it to a wider audience except to other graduate students.

    • @randyzeitman1354
      @randyzeitman1354 Год назад +3

      I’m a layman but this doesn’t seem counterintuitive because the distributions are the same. So what if they’re unrelated … they share the same reality. Are you surprised that mass is measured the same way for a rock or water? It’s simply recursive…the more data sets you have the more likely one of the points will be to center. It’s a weighted distribution of a normal distribution.

    • @mingliangang8221
      @mingliangang8221 Год назад +11

      ​@@randyzeitman1354 I am not entirely what you mean by "sharing the same reality" and the "weighted distribution of a normal distribution". However, this estimator would work when x_1, x_2, x_3 come from different datasets for example, X_1 can be from a dataset for the height of building, X_2 can be from a dataset for the average lifetime of a fly and X_3 can be from a dataset of the number of times a cat meows. If we want to find the average of each of these datasets, it turns out it is better to use the James stein estimator then if we were to take the average of each of these things. That is what makes it counterintuitive for me. I would like to hear your intuition though,

  • @GeorgeZoto
    @GeorgeZoto 10 месяцев назад

    Excellent content, research, pace and presentation. Thank you for putting this together and explaining it in simpler terms than the paper :)

  • @ahmad_asep
    @ahmad_asep Год назад +2

    Nice video! I have studied machine learning since 2014, I have heard the term "bias-variance tradeoff" multiple times and only now I understand. Thank you so much for the explanation.

  • @JamesSCavenaugh
    @JamesSCavenaugh Год назад +9

    This was my first time to encounter Mathemaniac, and I was impressed with this video. Good job!

  • @stevepittman3770
    @stevepittman3770 Год назад +25

    I have to admit that as someone not very familiar with statistics I was starting to get lost until you got to the 2D vs 3D visualization and I immediately grasped what was going on. That was an excellent way to explain it, and reminded me a lot of 3blue1brown's visual maths videos.

  • @nikolasscholz7983
    @nikolasscholz7983 Год назад +43

    The paradox stopped feeling paradoxically to me as soon as i realised that it all comes from adding all the errors together with equal weights. That already assumes that the estimated values are all on the same scale, are worth the same. There is not a lot more steps from there to assuming all the samples estimate the same value.
    We could for example have had one estimated value being in the magnitude of 10^24 and the other around 10^-24 and one would clearly decide against just adding the estimation errors together like one does here.

    • @vishesh0512
      @vishesh0512 Год назад +9

      The variance from the mean is the same for all (1). So even if one mean is 10^24, the samples you collect will most likely be within +/- 1. And similarly the 10^-24 guy will still give you samples in 10^-24 +/- 1

    • @vishesh0512
      @vishesh0512 Год назад +7

      The reason the Stein guy performs better is that the error is sum of 3 things. And there is a way to adjust your "estimator" so that it isn't the best for any one of the 3 variables, but the total is still less.

    • @nikolasscholz7983
      @nikolasscholz7983 Год назад +3

      @@vishesh0512 oh yeah you're right, i forgot the fact that the variance of each is 1. Thank you, your explanation is better.
      That does make the JS estimator pretty powerful though. Evem though one could think of other ways of combining the errors other than summing, summing seems to be the very obvious choice.

    • @vinny5004
      @vinny5004 Год назад +16

      Yes. The OP kept saying “completely independent distributions,” but that is an inaccurate description of the problem. A vector in n-dims is a single object, not the same as n separate distributions on n axes. The latter has nothing to do with Stein’s paradox, and actually the way this video begins is incorrect and does have an answer of the naive estimates as presented.

    • @vinny5004
      @vinny5004 Год назад +16

      In fact, one can even read on Wikipedia: “In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent. If one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.” For a 21+ min video, you would think the author would at least spend the effort to accurately present the problem at the beginning.

  • @kel3747
    @kel3747 10 месяцев назад

    Currently studying ML and went over Thompson Sampling recently . This is a great video as i immediately saw the similarities and was able to follow along even though i knew nothing about ML before i got started. Definitely subscribing .

  • @scraps7624
    @scraps7624 Год назад +87

    This is a masterclass in how to teach statistics, absolutely incredible work. Scripting, visualization, pacing, everything was on point

  • @haritoshpatel4216
    @haritoshpatel4216 Год назад +2

    This is an well made video. Clear visualizations and amazing explanation. Keep it up

  • @amaarquadri
    @amaarquadri Год назад +8

    This is one of the most counterintuitive things I've ever seen! Statistics is crazy.

  • @Ewuilibrium
    @Ewuilibrium Год назад

    Thanks for the video, I learned something new. I thought it was really interesting seeing the generalized formula for the MSE being derived from the variance formulas I learned in school and the visualizations helped make the variance bias trade make intuitive sense.

  • @robbielualhati1731
    @robbielualhati1731 Год назад +4

    Incredible video! I never fully understood why regularisation works especially with penalised regression but this video explains it very well.

  • @gerrychen
    @gerrychen Год назад +1

    Amazing video - perfectly paced and exactly right amount of background info!

  • @TRex-fu7bt
    @TRex-fu7bt Год назад +1

    Ooh I use a lot of smoothing/shrinkage stats models and have seen the JS estimator a few times mentioned in my reference books. Excited to see cool video about it.

    • @TRex-fu7bt
      @TRex-fu7bt Год назад

      The original baseball example (that you link to in the description) is still really good. The players’ batting averages are independent and a player’s past performance should be the best predictor of their future performance but the shrinkage smooths some noise out.

  • @anibalismaelfermandois6943
    @anibalismaelfermandois6943 Год назад +104

    Really great video, incredibly paced. The question that occurred to me is: Are we just abusing the definition of mean square error passed it's useful/intended use? Are we sure that lowering it is ALWAYS desirable?

    • @jsupim1
      @jsupim1 Год назад +7

      Good point. I think it's pointless to minimize the mse if the estimator you are using is biased (the James-Stein estimator is).

    • @chrislankford7939
      @chrislankford7939 Год назад +46

      @@jsupim1 This is a really naive thought that, sadly, pervades much of even professional science. While I can see your thinking on this in the context of a "broad-use" estimator like James-Stein--I disagree, but I see it--this thought simply falls apart when applied to a more nuanced scenario.
      Imagine a situation where you want to use relatively little data to infer something about a highly complex system. Say, data from an MRI to infer something about brain vasculature. There are dozens upon dozens of parameters that might affect even the simplest model of blood flow in the brain: vessel size distributions, arterial/venous blood pressure, blood viscosity, body temperature, and mental and physical activity levels. If you leave all of those as fitted, unbiased parameters, you do not have enough information to solve the inverse problem and retrieve your answer. (For the sake of argument, let's say average vessel size is what you're interested in.) So the unbiased estimator totally fails, as the mse is many times larger than the parameters.
      Now open up the idea of parametric constraint, a special case of the broader "regularization" described in this video. Let's say you measure blood pressure before someone enters the scanner, use 37C for temperature, go to literature to find the average blood viscosity, and assume all vessels are one unknown size in a small region. None of these will be _exactly accurate_ to the patient during the scan. What you've done is created a biased estimator that might just be able to work out the one thing you're interested in: average vessel size. Unless your guesses are very, very wrong, it will almost certainly have a lower vessel size mse than the unbiased estimator.

    • @phatrickmoore
      @phatrickmoore Год назад +16

      Thank you, this is exactly how I feel. As soon as MSE leads us to use information from non-correlated, independent distributions to make deductions on the one under focus means MSE is wrong. That needs to be an axiom of statistics or something. Valid Error systems cannot have dominant approximators that use info from outside, non correlated systems.

    • @phatrickmoore
      @phatrickmoore Год назад +10

      @@chrislankford7939 all of those distributions will be correlated, so your example doesn’t apply.

    • @simongunkel7457
      @simongunkel7457 Год назад +3

      @@phatrickmoore I think your intuition leads you astray, just consider genetic algorithms for optimization problems. These can often outperform any deterministic approach, even though they use stochasticity (hence random variables drawn from distributions that are independent from the optimization problem).

  • @xorenpetrosyan2879
    @xorenpetrosyan2879 Год назад +26

    such a cool video, I am a Machine Learning engineer and use regularisation techniques like shrinkage daily yet I didn't know it's origins were rooted in a paradox!

    • @mathemaniac
      @mathemaniac  Год назад +4

      Great to hear!

    • @klausstock8020
      @klausstock8020 Год назад +18

      Never did anything like "shrinkage", and didn't get how all of this connects with machine learning. Until 45 seconds before the end, when suddenly all the pieces connected and I realized that I had been using shrinkage. And that the five-dimensional data in the database (which gets aggregated into four-dimensional data, which is then fed into the ML algorithm as a two-dimensial field) actually consists of 50,000-dimensional vectors. Ah, yes, the happy blissfully unaware life of an engineer!
      Anecdotal evidence:
      A group of engineers and a group of mathematicians meet in a a train, both travelling to a congress. The engineers are surprised to learn that the mathematicians only bought one ticket for the whole group of mathematicians, but the mathematicians won't explain.
      Suddenly, one mathematicians yells "conductor!". All mathematicians run to the toilet and cram themselves into the tiny room before locking the door. The conductor appears, checks the tickets of the engineers and then goes to the toilet, knocks at the door and says "ticket, please!". The mathematicians slide their single under the door to the conductor, and the conductor leaves, satisfied.
      When the mathematicians return to the group of engineers, the engineers complement the mathematicians on their method and say that they will use it themselves on the return trip.
      On the return trip, the engineers arrive with their single ticket, but are surprised to learn that the mathematicians had bought no ticket at all this time.
      Suddenly, one mathematicians yells "conductor!". All engineers run to the toilet and cram themselves into the tiny room before locking the door. One mathematician walks to the toilet, knocks at the door and says "ticket, please!".
      TL;DR version: the engineers use the methods of the mathematicians, but they don't understand them.

    • @newerstillimproved
      @newerstillimproved Год назад +3

      @@klausstock8020 This joke made the video all the more worthwhile.

    • @TUMENG-TSUNGF
      @TUMENG-TSUNGF Год назад +2

      @@klausstock8020 Good story! I had thought the mathematicians would cram into the same bathroom with the engineers, but the actual ending was even more brilliant!

  • @rserserserse
    @rserserserse Год назад +1

    I saw a talk on this at my uni about a year ago. This paradox is so fascinating imo

  • @dananskidolf
    @dananskidolf Год назад +2

    The way hypervolumes have such dense neighbourhoods seems to be very interesting and useful in many places - I suspected it'd be involved as soon as you mentioned 'in 3 or more dimensions'. And that stems from a little personal experience I had.
    I was working on a quality optimisation computation in 32 dimensions a while ago and opted to use simulated annealing algorithm, on a hunch that stochastic algorithms would scale best in this higher number of dimension.
    I had to laugh when trying to figure out a sensible distance function (used to govern how far the sample picker would jump in an iteration). We had felt overwhelmed by the size of the sample space since the start, but I began to realise that all these trillions of coordinates were in fact within only a few nearest neighbours of each other.

  • @johanneshendriks9602
    @johanneshendriks9602 Год назад

    Really great video and some great intuition. I did feel that one extra concept could have been added. The concept of a "typical set" for probability distributions. For example, for a high dimensional Gaussian distribution the typical set ends up being a shell like volume some distance away from the mean. This could add to the explanation as to why taking just the point is not ideal, and also as to why it's more 'likely' that you will be in the 'far end' rather than the 'near end'

  • @mrbeancanman
    @mrbeancanman Год назад +2

    never knew the link between shrinkage and regularisation... good stuff.

  • @miguelcampos867
    @miguelcampos867 Год назад +1

    Amazing video. What does it come next? Cant wait for it

  • @ostrodmit
    @ostrodmit Год назад +1

    I like to give deriving the James-Stein estimator as a homework problem when teaching Math 541b at USC. Cool stuff!

  • @fluffigverbimmelt
    @fluffigverbimmelt Год назад +41

    I found it a bit funny how recently statistics has become interesting (again), by referring to machine learning.
    But hands down: Great concept of two channels for "the engineer version" as well as the full details and your general style of teaching.
    Very understandable, good to grasp and intriguing. Subbed

    • @42isthemeaningoflife
      @42isthemeaningoflife 11 месяцев назад

      It was always interesting to us scientists and people who are interested in making empirical deductions. Transformer models aren't the only reason to be interested in statistics.

  • @henriquemagalhaessoares8739
    @henriquemagalhaessoares8739 Год назад +15

    I've been using regularization on a daily basis and this is the best explanation on why shrinkage might be desirable I've ever seen. Bravo.

    • @mathemaniac
      @mathemaniac  Год назад

      Great to hear!

    • @switen
      @switen 6 месяцев назад

      As a male who swims in cold water, I agree.

  • @russellsharpe288
    @russellsharpe288 Год назад +35

    I haven't thought about this in detail at all, but is this counterintuitive result dependent on the use of the mean squared error? Would it be avoided if one used eg the mean absolute error instead? (If so, doesn't it amount to a reductio ad absurdum refutation of the use of mean squared error?)

    • @coreyyanofsky
      @coreyyanofsky Год назад +15

      It happens because MSE treats errors in each parameter as comparable. If you think about actually estimating quantities of interest you'll see that the MSE as expressed here isn't dimensionally consistent: there's an implicit conversion factor that says that whatever the variance in the individual components is, that sets the scale for how errors in different components are traded off against one another. It's the way this trading off of errors in the different components works that leads to the the shrinkage estimator dominating the maximum likelihood estimator. I haven't checked but using mean absolute error would require an the same trading off of estimation errors so I'd expect to have a James-Stein-style result with that loss function too.

    • @terdragontra8900
      @terdragontra8900 Год назад +1

      @@coreyyanofsky If you had some data set where errors in dimensions aren't comparable because, say, you weigh error twice as heavily in x_1 than in x_2, then you can just scale x_1 by a factor of two and try to estimate 2mu_1, and the paradox still happens. I suppose instead you may be completely unwilling to compare the dimensions, but then "best estimator" for the set is meaningless. This is strange.

    • @coreyyanofsky
      @coreyyanofsky Год назад +3

      @@terdragontra8900 If you change the weighting so that you're no longer variance 1 in some component then the loss function is weighted MSE and the sphere in the video becomes an ellipsoid; this will make the math more complicated for no real gain because the JS phenomenon was supposed to be a counter-example of sorts and not applied statistics.

    • @SolomonUcko
      @SolomonUcko Год назад +2

      Wouldn't reweighting the MSE just lead to a weighted JS estimator?

    • @orangereplyer
      @orangereplyer Год назад +1

      I think they key insight is that, in higher dimensions, it's not like you're getting a better estimate *for each separate dimension* than you would've if you'd estimated each separately. But the, like, "length" of the error vector will be less.
      The problem might be how we ought to be interpreting that length.

  • @jan.kowalski
    @jan.kowalski Год назад +1

    One of the best teaching experiences. Amazing!

  • @raywang5619
    @raywang5619 Год назад

    Fantastic intuition elaboration. Thank you so much

  • @PunmasterSTP
    @PunmasterSTP Год назад +11

    This just blew my mind. I kept expecting to see some disclaimer come up that would relegate this paradox to purely an academic context. But dang, this concept is incredible!

  • @cmilkau
    @cmilkau Год назад +4

    The fact that this method treats the origin special should already be a red flag that something is off. The only thing that can be off is the way we measure how "good" an estimator is. There are several options that seem equally valid. Why do we take the square deviation? Why do we take the sum of the expected values? Why not the expected value of the Euklidean norm of the deviation? Or maybe we shouldn't take any squares at all?

    • @mathemaniac
      @mathemaniac  Год назад +3

      It does not need to be the origin - you can equally shrink towards some other point (but pre-picked), James-Stein estimator still dominates the ordinary estimator.
      As to the mean squared error, I agree that this is somewhat arbitrary, but it is partly due to convenience - the calculations would be, normally, the easiest if we just take the squares; and without these calculations, we wouldn't be able to verify that James-Stein is indeed better. But if you adopt the view of Bayesian statistics, then mean squared error has a meaning there - by minimising it, you are taking the mean of the posterior distribution.

    • @djtwo2
      @djtwo2 Год назад +3

      The relevance of "mean square error" here can be considered in the context of the probability distribution of the error. Applying the shrinkage moves part of range of errors towards zero, while moving a small part of the range away from zero. The "mean squared error" measure doesn't care enough about the few possibly extremely large errors resulting from being moved away from zero to counterbalance the apparent benefit from moving some errors towards zero, But any other measure of goodness of estimation has the same problem. There are approaches to ranking estimation methods (other ways of defining "dominance") that are based on the whole of the probability distribution of errors not just a summary statistic. The practical worry here is that a revised estimation formula can occasionally produce extremely poor estimates, as is illustrated in the slides in this video.

    • @cmilkau
      @cmilkau Год назад

      @@djtwo2 That's what the video itself says.
      But there is no explanation given for that awkward quality metric over several dimensions. It's just a sum over each dimension without any further justification. Honesty, I would expect a norm on the higher-dimensional space on the bottom of the formula, then taking expectation of the squares like in 1D. But that's not what's happening. I mean expectation value is a linear operator so it may boil down to the Euclidean norm.

  • @alangivre2474
    @alangivre2474 Год назад +1

    You are exceptionally clear!!!!! I hope this channel grows!!!

    • @mathemaniac
      @mathemaniac  Год назад +1

      Thank you so much!

    • @alangivre2474
      @alangivre2474 Год назад

      @@mathemaniac I am doing my PhD in Information Theory in Biophysics and I have never heard about this estimator!! Very enriching.

  • @MDMAx
    @MDMAx Год назад +6

    Idk what I expected by watching it or why I watched it having a nonexistent education of statistics.
    At least now I know that I don't understand yet another semi-complicated concept in this universe.
    Judging by the comments you did a decent job of explaining and visualizing this topic.
    Keep up with the good effort!

  • @4dtoaster819
    @4dtoaster819 Год назад +1

    There is something satisfying about an idea going from ridicules to obvious in a short span of time.

  • @kylewilson6425
    @kylewilson6425 Год назад +1

    Great demonstration! You've earned a subscriber! Thank you very much! 👍

  • @Icenri
    @Icenri Год назад +5

    It made sense to me that the variance was the cause of the paradox but the real reason is mind boggling.

  • @nvs3221
    @nvs3221 Год назад +6

    Awesome video, would love some more statistics content. Pure maths people don't pay it enough respect :)

  • @damonjalali8669
    @damonjalali8669 Год назад

    Ohh fantastic!! This video tutorial is really interesting and amazing. Thanks a lot .

  • @dima_math
    @dima_math Год назад +1

    Congratulations on 100K! You are the best!

  • @spillfish4327
    @spillfish4327 Год назад

    I’m studying MAS-I right now and this was super helpful!

  • @Fred-yq3fs
    @Fred-yq3fs Год назад +5

    very unintuitive. Outstanding content. Thought provoking. Love it! Keep it up.

  • @kasuha
    @kasuha Год назад +4

    What disturbs me on this method is that it is not scale invariant. Let's say we have three random measurements of distance, 1 m, 2 m, and 3 m. Then the estimates would be 0.92, 1.85, and 2.78. But if we express the same measurements in feet, calculate the estimates and then convert them back to meters, they will be 0.99, 1.98, and 2.98. That does not sound right. Or did I miss something?

    • @coreyyanofsky
      @coreyyanofsky Год назад +4

      The MSE as expressed in the video is dimensionally inconsistent for measurements with units. Implicitly the variance is setting the scale here -- you measure in units such that the standard deviation is 1, and this scaling eats the units.

    • @sternmg
      @sternmg Год назад +3

      The estimator requires that all component quantities be normalized, i.e., to be dimensionless and have variance 1. This means real-world input components must all be scaled as x_i := x_i/σ_i, which means that all component _variances must be known beforehand_ . That is not exactly practical and also makes the estimator less miraculous.

    • @mathemaniac
      @mathemaniac  Год назад +2

      You can use the usual estimate for the variances (if you have more data points, in which case, the means still follow normal distribution, just with different variances), and the James-Stein estimator still dominate the ordinary estimate, so you don't have to know the variances actually.

  • @michaelhiggins9188
    @michaelhiggins9188 Год назад +4

    Congratulations on reaching 100 K subscribers! I think this channel will continue to grow because the content is very high quality and there aren't many like this.

  • @johnchessant3012
    @johnchessant3012 Год назад +31

    That's a really cool paradox, great video!
    Question about the "best estimator": Would this definition mean always guessing 7 is also an admissible estimator because no other estimator can have mean squared error = 0 in the case that the actual mean is 7?

    • @mathemaniac
      @mathemaniac  Год назад +21

      Yes! I originally wanted to say this in the video but decided against it to make it a bit more concise. Indeed, your observation adds fire to the anger by those statisticians who really believed in Fisher - admissibility (what I called "best" estimator) is a weak criteria for estimators, but our ordinary estimate fails this!

    • @leif1075
      @leif1075 Год назад +3

      @@mathemaniac around 14:30 you just mean a higher distance results I smaller shrinkage because since the denominator is getting larger, the entire term p Mina 2 over tbst distance will shrink since the numerator stays the same..that's all you meanr right?

    • @mathemaniac
      @mathemaniac  Год назад +4

      @@leif1075 Yes - if the original distance is large, then the absolute reduction in distance will be small, because the original distance is in the denominator.

    • @viliml2763
      @viliml2763 Год назад +2

      @@mathemaniac I read somewhere that the James-Stein estimator is itself also inadmissible. Is there any "good" admissible estimator?

  • @hellohey8088
    @hellohey8088 Год назад +1

    Nice video. I guess the graphical explanation for how the JS estimator "might" work does not apply if the shrinkage factor is negative. I wonder if there is an intuitive explanation for the case when the shrinkage factor is negative too?

  • @adrienadrien5940
    @adrienadrien5940 Год назад +1

    All this paradox comes from trying to minimize the squared errors.
    The squared errors are used mostly because its easy to compute for most of classical statistics law and it fit prety well with most minimization algorithms. But in real world,in many cases, one will be more interested of the average absolute errors instead of squared errors.
    I think the "paradox" is there, we are using a arbitrary metric, and we never question it.
    When I used to be a quantitative analyst I often used the abs value instead of squared for error minimization, I found the result way more relevant despite some slight difficulty to run some algorithms.

  • @NewtonianT
    @NewtonianT Год назад +2

    Nice video, I would like to ask, could you recommend me to a book to begin to understand statistics and probability?

  • @nathanoupresque4017
    @nathanoupresque4017 Год назад +1

    Since the problem seems to me invariant by change of origin, one could also pull the estimated point towards another point than (0,0,0)? What would be the formula in this case?
    Should we replace the naive estimate by λ*naive_estimate+(1-λ)*shrinkage_target ; with λ being the shrinkage coefficient : (1 - 1/||naive_estimate - shrinkage_target||²)?

  • @ziyangxie8607
    @ziyangxie8607 Год назад +7

    A fantastic demonstration of the Stein's paradox. Literally one of the best math videos I've watched

  • @KpxUrz5745
    @KpxUrz5745 Год назад +2

    Well-made video. Smartly written script. Interesting stuff.

  • @iliya-malecki
    @iliya-malecki Год назад +1

    please keep making these videos, you are great!

  • @112BALAGE112
    @112BALAGE112 Год назад +3

    This is another great example of how higher dimensional space defies intuition.

  • @hwangsaessi2335
    @hwangsaessi2335 Год назад +1

    Great video! Paradoxes like this are why I like the Bayesian formulation of estimation theory a lot; you can essentially also get regularization effects by choosing appropriate priors and estimators, but without many of the same conceptual pitfalls. (I am no math/statistics expert, but I do work with applied estimation so not a total layman either.)

  • @SapereAude625
    @SapereAude625 Год назад

    I have actually enjoyed this video so much. Thank you!

  • @charliethomas6317
    @charliethomas6317 Год назад

    In 1982 I contacted Dr Ephron at Stanford University and on his help used the JS estimates for stands of bottom land forest in Arkansas, Louisiana and? Mississippi. These stands were residual acres of valuable cypress and oaks?

  • @sternmg
    @sternmg Год назад +12

    To my physics-trained eyes, the formula at 3:00 looks incorrect or at least incomplete for general variables having units. Are all _x_ components expected to be dimensionless and normalized to σ_i = 1? But where would one get the σ_i from?

    • @frankjohnson123
      @frankjohnson123 Год назад

      I believe all that's required is the inputs are dimensionless, so you can do the naïve thing and divide by the unit or be more precise by using some physical scale for that dimension if it's known.

    • @sternmg
      @sternmg Год назад +7

      Aha, on Wikipedia the James-Stein estimator is shown with σ² in the numerator, which would indeed take care of units and scale. Alas, this makes the estimator _dramatically less useful_ in real-world situations because it can only be applied if σ² is known _a priori_ .

    • @Pystro
      @Pystro Год назад +3

      I was thinking the same thing. If you wanted to define a shrinkage factor that works for data sets with variances that aren't normalized to 1, you'd need to explicitly write that into the equation. I.e. every time there's an x_i in the shrinkage factor, you'd replace it with x_i/sigma_i.
      One consequence is that the James Stein estimator can only be used if you know (or have an estimate for) the variance. And if you have only an estimate for the variance (which is the best you can hope for if you don't know the true distribution already), then that can deteriorate the quality of the estimator.

    • @mathemaniac
      @mathemaniac  Год назад +10

      No, that's not true. Also on Wikipedia, you can apply the James-Stein estimator if the variance is unknown - you just replace it with the standard estimator of variance.

    • @coreyyanofsky
      @coreyyanofsky Год назад +7

      @@sternmg the JS phenomenon was only ever meant to be a counter-example of sorts, not applied statistics -- that's why they didn't bother defining an obvious improvement that dominates the JS estimator (to wit, the "positive-part JS estimator" that sets the estimate to zero when the shrinkage factor goes negative). If you want practical shrinkage methods use penalized maximum likelihood with L1 ("lasso") or L2 ("ridge") penalties (or both, "elastic net") or Bayes.

  • @Anis_Hdd
    @Anis_Hdd Год назад +6

    I did my PhD on shrinkage estimators of a covariance matrix. This is the best vulgarization of Stein's paradox I have ever seen! Thanks

    • @toniokettner4821
      @toniokettner4821 Год назад +3

      people might read the word "vulgar" and assume you're negatively criticizing the video

  • @inothernews
    @inothernews Год назад +5

    As a graduate student who has poured through countless math explanation youtube videos in the past years, this has to be one of the most beautiful! The writing, the story, the visuals, and the PACE --- all skillfully designed and executed. Definitely recommending this to my peers. Great fun to learn something new in this way. I appreciate your work greatly!

    • @mathemaniac
      @mathemaniac  Год назад

      Thank you so much for the compliment! Really encouraging!

  • @justinlowenthal3208
    @justinlowenthal3208 Год назад +6

    I am wondering…
    If I had a single measurement to estimate in one dimension. Could I use a random number generator to create data sets in two more dimensions, then use the James Stien estimator to get a more accurate result? Basically shoehorn the estimator into a one dimensional problem?

    • @Smo1k
      @Smo1k Год назад +2

      Heh. Good thought, but nope: This is about the "best" overall guess for the whole set of variables with the same variance; there's no saying which mean you will have the biggest error guessing. If you think of the p-2 over the division line as your degrees of freedom, and you do the J-S equation for 4 numbers, then run a second number on each variable and remove the worst fit to get down to 3, chances are equal that it's the variable you wanted to shoehorn which gets tossed.

  • @ronalddobos8390
    @ronalddobos8390 Год назад

    Amazing video! But I have one nitpicky comment:
    at 15:00 your arrows are misleading, the shrinkage factor is actually the same for the bottom left arrow and for the "near end" arrow

  • @miguelcampos867
    @miguelcampos867 Год назад

    I would love the explanation of density estimation with normalizing flow

  • @chrislankford7939
    @chrislankford7939 Год назад

    As much as I'd like to say my own work involving the bias-variance tradeoff is a must-read on the topic, the absolute MVP paper on this subject is:
    Kay, S and Eldar, YC. Rethinking Biased Estimation. IEEE Signal Processing Magazine. 2008.
    It's rooted in Steven Kay's excellent "Fundamentals of Statistical Signal Processing" textbook series and does some quick and dirty proofs of multiple biased estimators that are actually superior to their unbiased counterparts.

  • @charlesshaw223
    @charlesshaw223 Год назад +1

    A very nice explanation. Subscribed.

  • @gowrissshanker9109
    @gowrissshanker9109 Год назад +1

    Hlo Sir, How complex analysis is useful in special theory of relativity?(as you have mentioned in your complex analysis intro vedio)
    Thank you

  • @rangjungyeshe
    @rangjungyeshe Год назад +1

    Interesting tutorial, but what on earth does the statement at 20:28 mean ? Thanks.

  • @kylebowles9820
    @kylebowles9820 Год назад

    I like how you can see in higher dimensions the volume of the error sphere becomes less relevant

  • @GerardSans
    @GerardSans 11 месяцев назад

    It seems than the reasoning behind is where the gains happen for errors (closer to mean). For 2 each unknown mean can fall either on the right or left but when we introduce a third this will fall into right or left making it closer after applying the inverted proportion. For n=4 then either the new mean fall either right/left making the new value closer to all in the group where the mean is positioned right/left of the value and so on. P-2 corrects the initial 2 best and the squares allow for a conservative approach vs straight sum or ˆ3.

  • @matteogirelli1023
    @matteogirelli1023 Год назад

    For some very important statistical applications though, we would never adopt a biased estimator for a more precise one, for example where we want to make a causal inference

  • @porglezomp7235
    @porglezomp7235 Год назад +4

    As soon as you started talking about bias-variance tradeoff I started thinking about biased sampling in Monte Carlo methods (and in rendering in particular). Sometimes it's worth losing the eventual convergence guarantees of the unbiased estimators if it also kills the sampling noise that high variance introduces.

  • @ChrisContin
    @ChrisContin Год назад

    Statistics is all related, is the conclusion! Given more data, more conclusions can be made, is always true in statistics. Nice video!

  • @noplan113
    @noplan113 Год назад +2

    I have a naive question about why this works: So given the original setup, you basically draw numbers (mu) in the range from [-infinity,+infinity]. If all numbers are equally likely, the expected value for this drawing should be zero? Then we get a second information, that is the single confirmed value that we know for each distribution.
    Given that the expected value of all mus should be zero, can we just assume that it is more likely that the actual mu is slightly closer to zero than the number we know? However if you shrink too much you will also lose out on accuracy. Therefore there could be an optimal "amount" of shrinkage?
    Does this make sense?

    • @Temari_Virus
      @Temari_Virus Год назад

      I think the expected error will always be the same no matter what the shrinkage factor is? A uniform distribution is basically a straight line, so it'll look the same no matter how you stretch or shrink it.
      The variance of the distributions is (infinity - infinity) / 2 = ...dammit.
      Ok let's draw numbers from the range [-x, x] instead. So now the variance of the distributions is (x - x) / 2 = 0, which approaches 0 as x approaches infinity. The shrinkage factor basically multiplies this variance, and 0 multiplied by anything is still 0.
      (Don't quote me on this, I don't know much about statistics, but this just made sense to me)

  • @praveenb9048
    @praveenb9048 Год назад +2

    Has this principle been absorbed into other algorithms like the Kalman filter etc?

  • @Euler108
    @Euler108 Год назад

    Is the same paradox still there if, in the definition of the Mean squared error, instead of taking the sum of the squares of errors you would take the maximum of those squares? Is my intuition right, and in the case of maximum the best estimator is the ordinary one?

  • @broccoloodle
    @broccoloodle Год назад +1

    Hi mathemaniac, I’ve just graduated with a bachelor in computer science, could you introduce some common textbooks for modern statistics?

  • @cmyk8964
    @cmyk8964 Год назад +6

    It reminds me of the Curse of Dimensionality. Some stuff works well in 2D but not in higher dimensions.
    It’s like the “sphere between 1-unit spheres packing a 2-unit cube”. If you draw a circle that touches the inside of 4 unit circles forming a square, it would have a radius of √2-1 ≈ 0.414 units; if you draw a sphere that touches the inside of 8 unit spheres forming a cube, it would have a radius of √3-1 ≈ 0.732 units. But for 4D and up, the center hypersphere is the same size as the corner hyperspheres (√4-1=1), and in 5D and above, the center hypersphere is bigger, and eventually becomes uncontainable in the hypercube.

  • @fergalmdaly
    @fergalmdaly Год назад +1

    Also, don't forget that the mean squared error is an arbitrary definition of error, used mostly because squaring something makes it positive without making a huge mess of the algebra. It arguably has nothing to do with intuition, it puts far more weight on large errors than our intuition might. I feel like my intuition is closer to mean-absolute than mean squared.
    Would the JS-estimator or anything else be better if we used mean-absolute error?

    • @mathemaniac
      @mathemaniac  Год назад +1

      The intuitive explanation given in this video does not really have anything to do with the exact form of error that we consider. It might not be the JS estimator, but some other shrinkage estimator might dominate the ordinary estimator, e.g. www.jstor.org/stable/2670307#metadata_info_tab_contents
      But as you noted, the algebra is going to be messy, and it will be very difficult to obtain a definitive answer, just empirical evidence.

    • @fergalmdaly
      @fergalmdaly Год назад

      @@mathemaniac Thanks. I could be missing it (there's a lot in there I cannot parse) but it's a bit unclear to me what they have found there, it doesn't seem to claim that it dominates in LAD error. They say "Finally, using stock return data, we present some empirical evidence that the combination estimators have the potential to improve out-of-sample prediction in terms of both mean squared error and mean absolute error." which seems like a much weaker claim.
      Anyway, thanks for your video, it was very interesting and well presented. Just LS-error has always bugged me, it was chosen for convenience, we should expect unintuitive results sometimes.

  • @chaitanyalodha3948
    @chaitanyalodha3948 Год назад

    I somehow feel this is really connected to the concept of higher dimensional spheres, which 3b1b hadmade a video on. About their volumes and shapes

  • @troyfrei2962
    @troyfrei2962 Год назад

    WOW when you look at your Image at time 17:58 it looks like "Sommerfeld’s Atom" electrons shell. WOW

  • @melody3741
    @melody3741 Год назад

    My first thoufht was the multiple sets and points could be anywhere on the mean, and the more you have, the more likely they are a good distribution within that mean, so you find the means with them all together to take advantage of their randomness. then split them apart again with multipliers to put them back to their real mu

  • @billkowalsky
    @billkowalsky Год назад +1

    Really fantastic video, I'm glad the YT algorithm sent it my way. Thanks so much!

  • @nathangamble125
    @nathangamble125 Год назад +1

    I'm curious as to if and how this "paradox" changes if you change the magnitude distribution of the set of normal distributions.
    i.e. are the magnitudes themselves normally distributed (e.g. a few where the centre of the distribution is about 0.1, a lot where the centre of the distribution is about 10, and a few where the centre of the distribution is about 10,000), or does any magnitude have an equal probability (so a distribution centred around 0.1 is equally as likely as one with a centre of 10, or 10000, or 10 quadrillion)?

    • @mathemaniac
      @mathemaniac  Год назад

      The order of magnitude has nothing to do with this. James-Stein estimator still dominates the ordinary estimator. That's what "dominate" means - for **every** possible set of "true means", James-Stein estimator has a lower mean squared error.
      By the way, there is no "distribution" per se of the magnitudes, it is fixed, just unknown - think of it as "I know it, but you don't".

  • @joooooooooooe
    @joooooooooooe Год назад

    your accent is so unique! had me super focused on the concepts.

  • @rossjennings4755
    @rossjennings4755 Год назад +2

    A lot of people say that they find the Banach-Tarski theorem to be upsetting, but this result is so much worse than that. You can make the Banach-Tarski phenomenon go away with some pretty weak continuity assumptions, but this is a really strong result that applies in real-world situations and isn't going to go away no matter what you throw at it. In fact I suspect you can make some pretty sweeping generalizations of it. I think the main reason I find it so hard to accept is that I have a really strong intuitive sense that there should be a unique "best" estimator -- i.e., you shouldn't be able to get a better estimator by biasing it in an arbitrary direction, which is exactly what happens with the James-Stein estimator. I suspect that, based on similar reasoning to what's presented in this video, you can show that, in these kinds of situations, there can be no unique "best" estimator. (Edit: I originally had "admissible" where I now have "best", but I've since realized that's not really what I meant.)