Full course is now available on my private website. Become a member and get full access: meerkatstatistics.com/courses... * 🎉 Special RUclips 60% Discount on Yearly Plan - valid for the 1st 100 subscribers; Voucher code: First100 🎉 * “GLM in R” Course Outline: Administration * Administration Up to Scratch * Notebook - Introduction * Notebook - Linear Models * Notebook - Intro to R Intro to GLM’s * Linear Models vs. Generalized Linear Models * Least Squares vs. Maximum Likelihood * Saturated vs. Constrained Model * Link Functions Exponential Family * Definition and Examples * More Examples * Notebook - Exponential Family * Mean and Variance * Notebook - Mean-Variance Relationship Deviance * Deviance * Notebook - Deviance Likelihood Analysis * Likelihood Analysis * Numerical Solution * Notebook - GLM’s in R * Notebook - Fitting the GLM * Inference Code Examples: * Notebook - Binary/Binomial Regression * Notebook - Poisson & Negative Binomial Regression * Notebook - Gamma & Inverse Gaussian Regression Advanced Topics: * Quasi-Likelihood * Generalized Estimating Equations (GEE) * Mixed Models (GLMM) Why become a member? * All video content * Extra material (notebooks) * Access to code and notes * Community Discussion * No Ads * Support the Creator ❤
mode is frequency of a point and it can seen in smaller smaples of data but for larger panel of data we try to constraint to lets as u said a line or any given model to find more convergent results mean could be best possible way wrt mode.
I really appreciate the question at the end; I find the best teachers are open about what they don't fully understand because it fosters a desire in the students to really understand what's going on. In the last video, MLE seemed like the "obvious" thing to do but I'm not so sure. Let's say our model is linear with a skewed distribution of the error. And let's say our data magically is entirely collinear. Then drawing "distributions" with the mean centered at each point, the MLE approach would actually say to pick the line that passes through the modes, not the line you see right in front of your eyes! And maybe that's okay, but it's something I never would have even thought to question.
IMHO, symmetrical or not, the mean is still your expected value. If you want to minimise error, you can only work with the expectation, since that's what you can expect. If you optimised for something else, you may use the mode or whatever else that fits the optimization.
I think the answer to the question would be: if you are using a GLM to analyze the contribution(correlation) of x to the y, the mode of x makes it look like positively correlated, but they are actually no correlation or negatively correlated.
Regarding the mode vs mean, it might be that I do not understand it fully, I'm just trying to refresh my knowledge of glm watching your videos. But, I would think that if a model is optimized using the mode, it would try to predict the most likely value, so it would get the prediction exactly right more often, but by using the mean, the model tries to minimize the error: It would do better on average at being close to points taken from a population with that distribution.
yeah indeed; the predicted points will be closer (in case of mean) to the μ curve, even though they are less likely (compared to mode) to fall right on the curve. Just a guess though.
I think you're right. I think though, that **for the saturated model** we could get higher likelihood if we positioned the points on the mode. But, if we restrict ourselves to placing them on the means, the saturated model is still the highest likelihood we can hope for comparing to any type of (additional) structural constraints.
May I attempt to answer the question? The reason to use the mean is implicitly an assumption that whatever distribution the data is drawn from, it is symmetrical. If you are willing to accept this, then your loss function can be for example a square loss, which is nicely differentiable at least twice. So the mathematics is simple. Now, let us assume that the distribution is asymmetrical, as you propose it. Then indeed the better central measure is the median. We do not use the mode, because it may be more problematic to find. So when you do this, your model estimates the conditional median, not the conditional mean. This boils down to a loss function which takes the absolute value of the error, instead of square. But is the absolute value function differentiable? Not really if your data can take both positive and negative values. So you cannot simply use derivation for your optimization. This whole stuff I described is a form of quantile regression. I hope this helps.
I always thought the reason we use the mean is because it's the expected value of the distribution. I think this reasoning is equivalent to what Kees commented.
The constrained model is linear and the saturated model is some quadratic or non-linear. If you as many coefficients as observations you have the Saturated model but what about models in between the constrained and saturated? Are they linear as well?
I think any model you choose (linear or not) will always be considered constrained as long as it's not the same model as the staturated one. You can try to fit a cubic polynomial to the data which is more "relaxed" than the linear case but it will still have a lower likelihood than the saturated one.
If there exists more than one observation for a given x value, say our data has observations {(1,1), (1,2)}, how does the polynomial go about fitting the data exactly? For any polynomial, indeed, for any function there exists only one value in the codomain for any value in the domain. Is it a case of the distribution taking the sum of normal distributions weighted inversely to the number of overlapping data, like a Kernel Density Estimation of sorts?
@@MeerkatStatistics like instead of being some normal distribution around a single y value, be the weighted sum of normal distributions with means for each y value
Full course is now available on my private website. Become a member and get full access:
meerkatstatistics.com/courses...
* 🎉 Special RUclips 60% Discount on Yearly Plan - valid for the 1st 100 subscribers; Voucher code: First100 🎉 *
“GLM in R” Course Outline:
Administration
* Administration
Up to Scratch
* Notebook - Introduction
* Notebook - Linear Models
* Notebook - Intro to R
Intro to GLM’s
* Linear Models vs. Generalized Linear Models
* Least Squares vs. Maximum Likelihood
* Saturated vs. Constrained Model
* Link Functions
Exponential Family
* Definition and Examples
* More Examples
* Notebook - Exponential Family
* Mean and Variance
* Notebook - Mean-Variance Relationship
Deviance
* Deviance
* Notebook - Deviance
Likelihood Analysis
* Likelihood Analysis
* Numerical Solution
* Notebook - GLM’s in R
* Notebook - Fitting the GLM
* Inference
Code Examples:
* Notebook - Binary/Binomial Regression
* Notebook - Poisson & Negative Binomial Regression
* Notebook - Gamma & Inverse Gaussian Regression
Advanced Topics:
* Quasi-Likelihood
* Generalized Estimating Equations (GEE)
* Mixed Models (GLMM)
Why become a member?
* All video content
* Extra material (notebooks)
* Access to code and notes
* Community Discussion
* No Ads
* Support the Creator ❤
mode is frequency of a point and it can seen in smaller smaples of data but for larger panel of data we try to constraint to lets as u said a line or any given model to find more convergent results mean could be best possible way wrt mode.
I really appreciate the question at the end; I find the best teachers are open about what they don't fully understand because it fosters a desire in the students to really understand what's going on.
In the last video, MLE seemed like the "obvious" thing to do but I'm not so sure. Let's say our model is linear with a skewed distribution of the error. And let's say our data magically is entirely collinear.
Then drawing "distributions" with the mean centered at each point, the MLE approach would actually say to pick the line that passes through the modes, not the line you see right in front of your eyes! And maybe that's okay, but it's something I never would have even thought to question.
IMHO, symmetrical or not, the mean is still your expected value. If you want to minimise error, you can only work with the expectation, since that's what you can expect. If you optimised for something else, you may use the mode or whatever else that fits the optimization.
I think the answer to the question would be: if you are using a GLM to analyze the contribution(correlation) of x to the y, the mode of x makes it look like positively correlated, but they are actually no correlation or negatively correlated.
Regarding the mode vs mean, it might be that I do not understand it fully, I'm just trying to refresh my knowledge of glm watching your videos. But, I would think that if a model is optimized using the mode, it would try to predict the most likely value, so it would get the prediction exactly right more often, but by using the mean, the model tries to minimize the error: It would do better on average at being close to points taken from a population with that distribution.
yeah indeed; the predicted points will be closer (in case of mean) to the μ curve, even though they are less likely (compared to mode) to fall right on the curve. Just a guess though.
In general, the likelihood of hitting either the mean or the mode is zero. In other words, you're never "exactly right".
I think you're right.
I think though, that **for the saturated model** we could get higher likelihood if we positioned the points on the mode. But, if we restrict ourselves to placing them on the means, the saturated model is still the highest likelihood we can hope for comparing to any type of (additional) structural constraints.
I totally came up with same question regarding mean vs mode :-)
Great minds... ;-)
May I attempt to answer the question?
The reason to use the mean is implicitly an assumption that whatever distribution the data is drawn from, it is symmetrical. If you are willing to accept this, then your loss function can be for example a square loss, which is nicely differentiable at least twice. So the mathematics is simple.
Now, let us assume that the distribution is asymmetrical, as you propose it. Then indeed the better central measure is the median. We do not use the mode, because it may be more problematic to find. So when you do this, your model estimates the conditional median, not the conditional mean. This boils down to a loss function which takes the absolute value of the error, instead of square. But is the absolute value function differentiable? Not really if your data can take both positive and negative values. So you cannot simply use derivation for your optimization. This whole stuff I described is a form of quantile regression.
I hope this helps.
For a skewed distribution of Yi wouldn't it better to use the median instead of the mode?
yes
I always thought the reason we use the mean is because it's the expected value of the distribution. I think this reasoning is equivalent to what Kees commented.
The constrained model is linear and the saturated model is some quadratic or non-linear.
If you as many coefficients as observations you have the Saturated model but what about models in between the constrained and saturated? Are they linear as well?
I think any model you choose (linear or not) will always be considered constrained as long as it's not the same model as the staturated one. You can try to fit a cubic polynomial to the data which is more "relaxed" than the linear case but it will still have a lower likelihood than the saturated one.
If there exists more than one observation for a given x value, say our data has observations {(1,1), (1,2)}, how does the polynomial go about fitting the data exactly? For any polynomial, indeed, for any function there exists only one value in the codomain for any value in the domain. Is it a case of the distribution taking the sum of normal distributions weighted inversely to the number of overlapping data, like a Kernel Density Estimation of sorts?
In that case the line will pass somewhere between the y values. Not sure I understand the latter half of your comment.
@@MeerkatStatistics like instead of being some normal distribution around a single y value, be the weighted sum of normal distributions with means for each y value
very nice
Linear in the sense that it is a constraint on what Beta can be.