Hi Professor, thank you so much for the excellent explanation!! I learned bias variance decomposition long time ago but never fully understand it until I watch this video! Detailed explanation of each definition helps a lot. Also, with the code implementation, it helps me not only understand the concepts but also be able to implement into the real application which is the part I always struggle with! I'll definitely find time to watch other videos to make my ML foundation more solid.
Thank you so much for the bias variance videos. Though I intuitively understood it, these equations never made sense to me before I watched the videos. Truly appreciated!!
At 10:20, the bias comes out backward because the error should be y_hat - y, not y - y_hat. The "true value" in an error is substracted from the estimate. Not the other way around. This is easily remembered from thinking of a a simple random variable with mean mu and error e: y = mu + e. Thus, e = y - mu.
This is an absolutely brilliant video Sebastian - thank you. I have no problem deriving the Bias-Variance Decomposition mathematically, but no one seems to explain what the variance or expectation is with respect to - is it just on one value? over multiple training sets? different values within one training set? You explained it excellently.
Thanks for the great video! One question: 8:42 why y is constant? y=f(x) here also has distribution, is a R.V. is that correct? and when you say "apply expectation on both sides, this expectation over y or over x?
@@SebastianRaschka Thank you so much for the reply! yeah that's where my confuse sticks. So what do you expectation over of? If you expectation over all the x value, then you cannot do this assumption right?
I have a couple of questions: Regarding the variance, is this calculated across different parameter estimates given the same functional form of the model? Also, these parameter estimates depend on the optimization algorithm used, right, ie., implying the model predictions are 'empirically-derived models' vs. some sort of theoretically optimal parameter combinations, given a particular functional form? If so, would this mean that _technically speaking_, there is an additional source of error in the loss calculation, which could be something like 'implementation variance' due to our model likely not having the most optimal parameters, compared to some theoretical optimum? Hope this makes sense, I'm not a mathematician. Thanks!
When you say bias^2+variance that is for a single model In the beginning you said bias and variance for different models trained on different datasets which one is it? If we consider single model then bias is nothing but mean error and variance is mean squared error?
Not sure if this is the best way, but personally I approached that by manually specifying bins for the target variable and then proceeding with stratification like for classification. There may be more sophisticated techniques out there, though, e.g., based on KL divergence or so.
@@SebastianRaschka hm, given a sufficiently large number of bins this should be a sensible approach, and easy to implement. I will play around with that. I am trying some of the things taught in this course on the Walmart Store Sales dataset (available from Kaggle), a naive training of LGM already returns marginally better results than what the instructor on udemy had (he used xgboost with hyperparameters returned by the AWS Sagemaker auto tuner).
Professor, does your bias_variance_decomp work in google colab? It did not with me. It worked just fine in Jupyter. But the problem with Jupyter is that bagging is way slower (that's my computer) than what I could get in colab.
this is how you teach machine leanring, respectfully the prof. at my university needs to take notes!
This was life saving. Thank you so much Sebastian. Especially for explaining why 2ab = 0 while deriving the decomposition
Hi Professor, thank you so much for the excellent explanation!! I learned bias variance decomposition long time ago but never fully understand it until I watch this video! Detailed explanation of each definition helps a lot. Also, with the code implementation, it helps me not only understand the concepts but also be able to implement into the real application which is the part I always struggle with! I'll definitely find time to watch other videos to make my ML foundation more solid.
Wohoo, glad this was so useful! 😊
This was wonderful Sebastian after looking no such video available on you tube with such explanation
Wohoo, thanks so much for the kind comment!
The best explanation of bias & variance I've countered so far.
it would be helpful if you could include the "noise" too.
Thanks! Haha, I would defer the noise term to my statistics class but yeah, maybe I should do a bonus video on that. A director's cut. :)
Do you know that you are doing truly good work! clear to every single details
Thanks, this is very nice to hear!
Thank you so much for the bias variance videos. Though I intuitively understood it, these equations never made sense to me before I watched the videos. Truly appreciated!!
Awesome, I am really glad to hear that I was able to explain it well :)
Thanks for this! Provides one of the best explanations👏
Thanks! Glad to hear!
@@SebastianRaschka Hi Sebastian, visited your awesome website resource for ML/DL. Thanks again. Can't wait for the Bayesian part to be completed.
Thank you so much for the intuitive explanation! The notations are clear to understand and it just instantly clicked.
Thank you so much. This helps me to understand perfectly about Bias-Variance mathmetically.
Awesome! Glad to hear!
Thank you for this great lecture series!
glad to hear that you are liking it!
Hi, thanks for teaching, really helpful 😊
At 10:20, the bias comes out backward because the error should be y_hat - y, not y - y_hat. The "true value" in an error is substracted from the estimate. Not the other way around. This is easily remembered from thinking of a a simple random variable with mean mu and error e: y = mu + e. Thus, e = y - mu.
This is an absolutely brilliant video Sebastian - thank you.
I have no problem deriving the Bias-Variance Decomposition mathematically, but no one seems to explain what the variance or expectation is with respect to - is it just on one value? over multiple training sets? different values within one training set? You explained it excellently.
Thanks for the kind words! Glad it was useful!
Thanks for the great video! One question: 8:42 why y is constant? y=f(x) here also has distribution, is a R.V. is that correct? and when you say "apply expectation on both sides, this expectation over y or over x?
Good point. For simplicity, I assumed that y is not a random variable but a fixed target value instead
@@SebastianRaschka Thank you so much for the reply! yeah that's where my confuse sticks. So what do you expectation over of? If you expectation over all the x value, then you cannot do this assumption right?
So helpful😭😭😭
I have a couple of questions: Regarding the variance, is this calculated across different parameter estimates given the same functional form of the model? Also, these parameter estimates depend on the optimization algorithm used, right, ie., implying the model predictions are 'empirically-derived models' vs. some sort of theoretically optimal parameter combinations, given a particular functional form? If so, would this mean that _technically speaking_, there is an additional source of error in the loss calculation, which could be something like 'implementation variance' due to our model likely not having the most optimal parameters, compared to some theoretical optimum? Hope this makes sense, I'm not a mathematician. Thanks!
When you say bias^2+variance that is for a single model
In the beginning you said bias and variance for different models trained on different datasets which one is it?
If we consider single model then bias is nothing but mean error and variance is mean squared error?
any good sources or hints on dataset stratification for regression problems ?
Not sure if this is the best way, but personally I approached that by manually specifying bins for the target variable and then proceeding with stratification like for classification. There may be more sophisticated techniques out there, though, e.g., based on KL divergence or so.
@@SebastianRaschka hm, given a sufficiently large number of bins this should be a sensible approach, and easy to implement. I will play around with that. I am trying some of the things taught in this course on the Walmart Store Sales dataset (available from Kaggle), a naive training of LGM already returns marginally better results than what the instructor on udemy had (he used xgboost with hyperparameters returned by the AWS Sagemaker auto tuner).
Professor, does your bias_variance_decomp work in google colab? It did not with me. It worked just fine in Jupyter. But the problem with Jupyter is that bagging is way slower (that's my computer) than what I could get in colab.
I think Google Colab has a very old version of MLxtend as the default. I recommend the following:
!pip install mlxtend --upgrade
@@SebastianRaschka It works now. Thanks for the prompt response
I don’t understand why you can’t multiply ‘E’ the expectation by ‘y’ the constant