Just wanted to say thanks for all the work you do! I had a job interview a couple of months ago and knew they'd asked about stats tests as part of it. Having all your videos there in one place made it really easy to just skimm through and give myself a refresher on a bunch of topics. I ended up passing that interview too!
My intuition for this problem would be, that, in the real world, we would not know how many nuggets there are, if any. So I would set up a mixture model with 2 mixtures and compare it with a mixture model with just 1 to see if there are any nuggets. Mixture models can of course be viewed as a hierarchical model
Absolute thanks for this amazing masterpiece of a video, i know its very time consuming to make these videos but this channel will grow a lot. one of the best.
My data analyst run-to would have been leave-one-out comparisons. So compare each group mean to the mean of all other groups. The variance of the left-out of course does not change but the variance of the pool drops by a factor 6, roughly (35 groups, ~6^2), thus the CIs no longer overlap as easily. They will always do so, and I'm doing 36 tests instead of one, inducing higher demands of separatedness for the same p-value. So I'm curious to know how much overlap reduction would be required and my theory doesn't scale up I'll just do a monte-carlo to estimate p. How would this approach compare to the hierarchical? Which one has best odds of securing the nugget?
That’s definitely an interesting approach. One reason I chose to do the analysis in a Bayesian way was so that I could sidestep any talks about multiple testing problems, since many frequentist approaches would have to contend with it. 36 tests is a lot, so it’d be interesting to figure out the best way to do it for getting that nugget
My guess was something like “assume that each cancer type is in one of two classes, ‘non-nugget’ and ‘nugget’, and each of these two classes has their own mean, and where the mean for the latter is greater than for the former. Assume each of the cancer types has some probability p of being in the nugget group. I don’t know what prior to use for p. Then, given an assignment of which type is in which of the two types, have binomial distribution.” though that sounds like it could be pretty hard to compute the posterior for? I guess you would need to have both a confidence interval for the probability of good outcomes in the nugget group, along with a probability of “there is at least one type in the nugget group”?
The scale of the axes at 6:08 and 6:28 should be the same for the comparison. Without having a similar scale on the x-axis, it takes a lot of computational brain power to actually compare the two distributions. My brain goes, oh those are similar size, but wait the axis is different, ok 25-20=5 and 7-2=5, so they are the same, wait decimals, ok 0.05 and 0.5 ish.
My question is: where do you learn this kind of applied stuff? Any book recommendations with case studies to carry out these type of analyses yourself in R?
Could you also use a bayeasian approach to create subsamples of your total sample and update the variation of your estimates using the best candidate of each round to enhance how you distribute your sample sizes? Eg starting with two groups with half their total samples and grouping the most distinct averages in new groups as you add more data from your sample in, say, binary cuts of the data (like the binary search).
Great video! If you‘re looking for new video ideas, I‘d love to see a visualization of the different types of convergence and their connections :) Couldn’t find any good videos on YT yet…
Couldn't you use a leave-one-out approach where you do the pooling each time with one group not included? You then know the nugget is the one with the biggest distance and least variance overlap with the pooled groups. Does that make sense?
Yeah it makes sense! What you’ve described is actually very similar to one of the more sophisticated approaches to this problem, sort of testing out different combinations of “sameness” and trying to find the most probable one
Not sure what you apply that to, since that is a method to fit a histogram with multiple additive gaussians (if that is what you mean). But, responses are binary, resulting in probabilities, not gaussian, which works for general numeric data; and you have knowledge regarding which group each sample belongs to, that does not seem to be used for gaussian mixtures.
A off topic problem but that intrigues me let we are obeserving no of cattles we are observing in one hour. In the countryside, there will be many cattles so let we have data:30,31,28,29,30,... and in urban area there will be less cattle but somehow a truck of cattle may appear. So there our data:2,3,30,4,25,2... If we want to model these with poisson, then first case variance will be higher as lamda is higher. but in real life, we can sense, second process will have higher variance. How to interprete it?
One way to model this hierarchically is to place a distribution on the lambda parameter, similar to how a distribution was assumed for the binomial parameters here. City and country side would have different lambda, which will let them have different variances. This encodes a belief that the distribution of the number of cattle observed are different in city and countryside but also acknowledges that there are common factors that should make them similar. Unrelated to this video, but you can also try negative binomial regression to account for the overdispersion and add city as a regressor. In the end, the model you choose should hopefully be motivated by your knowledge about the data
Maybe using a student-t distribution instead of a normal distribution? I a not sure, but I would assume that due to the normality parameter nu of the t-dist, the shrinkage would be smaller than with a normal one. Bbut the "outlier" would be more compatible and the mean doesn't need to be shifted so much. Likewise, the sigma parameter could be smaller, which could decrease the credible intervals. I would love to see the results from it. Thanks for the videos, "brms" would be great to show ;-) I know, Stan is much more flexible...
yeah a T posterior would probably be a better choice. I'm not super well versed in BHM's but I've seen T posteriors far far far more often than normal posteriors
@@jeffreychandler8418 You mean a t-dist. as a prior right? The posterior does not need to follow any specific distribution. I have not seen t-distributions as priors for the random effect (as far as I remember), but for typical fixed effect models, mimicking frequentist ANOVAS and t-tests, as proposed by Kruschke or McElreath.
I'd like a comment on traditional statistical methods assuming a distribution function, estimating the central and spread measures, and adjustments for kurtosis, heteroscedaticity et c. compared to machine learning ("AI") that uses all data with brute force to "finds patterns". Overfitting is the traditional reaction, but ChatGPT works darn well, doesn't it? Over a decade ago I had a look at Support Vector Machines (SVM) which is a mathematically analytical approach, not black boxed pattern fitting by (guided) brute force. The math of this stuff is a bit beyond my current paygrade and I have no use for it professionally, so it is only out of personal curiosity that I ask. I hear that SVM only has specialized applications nowadays since neural networks and such outcompetes it, so if you haven't familiarized yourself with SVP, it might not be worth the effort to do so. I just wonder if anyone reading this has, and wants to compare it to the traditional statistical methods. It seemed pretty nifty as far as I could tell, although I never applied it.
From a certain perspective, the Bayesian approach in the video is a machine learning method. A model is fit to some data. The first model has one parameter, the "overall response rate". The second model, approach #2, has 36 parameters, one for each group. The final model, approach #3, has 3 parameters. Given the size of the data in the video, ie small, the 3 parameter approach makes a lot of sense. The 36 parameter approach just has too many knobs to turn, too many degrees of freedom. ChatGPT 4 probably has around 1.8 trillion parameters.
@@colin4349 Good point there! I really have to get a grip on machine learning, somehow. But those node layers turn me off, it's so stupid simple in detail. Not exactly Euler math. And that emulates human conversation? It is humiliating, but it is what it is.
@@bjorntorlarsson Don't be afraid of just ignoring some details. For example, I know how to do simple calculations by hand and using a handheld calculator. However, I do not understand how the electrical signals make a calculator work. ChatGPT works at a high level by learning and then sampling from a distribution. The distribution is learned from the data, (how? don't care, ignored), conditioned on a "context window". A context window is the recent text conversation. Then ChatGPT samples the next word from the distribution conditioned on the context window, updates the context window, samples the next word, and so on.
@@colin4349 ChatGPT is bad at math! But still, it surprisingly often generates the correct answer. If one then asks it what process it used, it just generates a new answer relating to that new question. A new generation that is of course completely unrelated to whatever it did to generate the previous correct math answer. It doesn't "know" anything about that, and doesn't "understand" my question the way I meant it. It takes some getting used to it. It never uses any logic! I've asked it, and it says so. I suppose it could be used as a convenient user interface that in turn actually uses math software like Wolfram or Matlab. But it doesn't yet, the standard version that I pay $20 a month for. Perhaps I could "prompt" it to do so? For real: "- But, without using any logic, how come you can generate code that actually work?" "- Oh, coding is just heuristics!"
It’s one way to do this. It’ll tell you that one of the groups will have a different mean from the rest, but you’d also have to do secondary pair wise tests to identify the nugget itself.
I can learn best with implementing the examples myself and play with the data coming with the example. Hence for me it would be far more educative if you would use either Python or R Bayesian stat libraries, and explain/justify the data set used. Else could you add a video how to execute Stan under Python (or R).
“Babe quick, Very Normal just uploaded” - genuinely me to my statistician girlfriend
"- Are you in love with a statistician nerd?"
"- Probably."
Just wanted to say thanks for all the work you do!
I had a job interview a couple of months ago and knew they'd asked about stats tests as part of it. Having all your videos there in one place made it really easy to just skimm through and give myself a refresher on a bunch of topics.
I ended up passing that interview too!
That’s great! I’m glad I was helpful in that process! Hopefully it helps with my own interviews as well lol
@very-normal I'll keep my fingers crossed for you! Given your excellent teaching style and attention to detail I'm sure you'd do great!
My intuition for this problem would be, that, in the real world, we would not know how many nuggets there are, if any. So I would set up a mixture model with 2 mixtures and compare it with a mixture model with just 1 to see if there are any nuggets. Mixture models can of course be viewed as a hierarchical model
Yeah that’s exactly what I had in mind! This is known as the EX-NEX model in the clinical trial space
Absolute thanks for this amazing masterpiece of a video, i know its very time consuming to make these videos but this channel will grow a lot. one of the best.
12:22 makes me think of a mixture model
Youre my motivation to keep deep diving into statistics
Statistics is so hard... This video definitely went over my head. Still, thank you for making the effort of trying to explain it.
It’ll come with time! Just keep at it and you’ll be surprised how far you get
Looks like you were going to try a Mixture Model at the end there.
👌 you got the idea of it
@@very-normal can't wait for the video about it. Hope you get better soon regardles!
I really appreciate your videos and the way you present topics!
My data analyst run-to would have been leave-one-out comparisons. So compare each group mean to the mean of all other groups. The variance of the left-out of course does not change but the variance of the pool drops by a factor 6, roughly (35 groups, ~6^2), thus the CIs no longer overlap as easily. They will always do so, and I'm doing 36 tests instead of one, inducing higher demands of separatedness for the same p-value. So I'm curious to know how much overlap reduction would be required and my theory doesn't scale up I'll just do a monte-carlo to estimate p. How would this approach compare to the hierarchical? Which one has best odds of securing the nugget?
That’s definitely an interesting approach. One reason I chose to do the analysis in a Bayesian way was so that I could sidestep any talks about multiple testing problems, since many frequentist approaches would have to contend with it. 36 tests is a lot, so it’d be interesting to figure out the best way to do it for getting that nugget
My guess was something like “assume that each cancer type is in one of two classes, ‘non-nugget’ and ‘nugget’, and each of these two classes has their own mean, and where the mean for the latter is greater than for the former. Assume each of the cancer types has some probability p of being in the nugget group. I don’t know what prior to use for p.
Then, given an assignment of which type is in which of the two types, have binomial distribution.”
though that sounds like it could be pretty hard to compute the posterior for?
I guess you would need to have both a confidence interval for the probability of good outcomes in the nugget group, along with a probability of “there is at least one type in the nugget group”?
The scale of the axes at 6:08 and 6:28 should be the same for the comparison.
Without having a similar scale on the x-axis, it takes a lot of computational brain power to actually compare the two distributions. My brain goes, oh those are similar size, but wait the axis is different, ok 25-20=5 and 7-2=5, so they are the same, wait decimals, ok 0.05 and 0.5 ish.
My question is: where do you learn this kind of applied stuff? Any book recommendations with case studies to carry out these type of analyses yourself in R?
Could you also use a bayeasian approach to create subsamples of your total sample and update the variation of your estimates using the best candidate of each round to enhance how you distribute your sample sizes? Eg starting with two groups with half their total samples and grouping the most distinct averages in new groups as you add more data from your sample in, say, binary cuts of the data (like the binary search).
Why Stan over a Bayesian package in R? (I’ve used Bolstad for Bayesian work)
just for speed, I know Stan the best so it was the fastest to implement. It also helped me point out differences in models
Would you ever consider to make video about factor analysis? It would be dope to have it explained by you!
Yeah! I only know vaguely about it but it’s definitely be worth a video!
Bro deserves more subscribers
Great video! Peace out
Thank you so much 🙏
Go over Stein's example and the JS estimator next !
Great video! If you‘re looking for new video ideas, I‘d love to see a visualization of the different types of convergence and their connections :) Couldn’t find any good videos on YT yet…
Couldn't you use a leave-one-out approach where you do the pooling each time with one group not included? You then know the nugget is the one with the biggest distance and least variance overlap with the pooled groups. Does that make sense?
Yeah it makes sense! What you’ve described is actually very similar to one of the more sophisticated approaches to this problem, sort of testing out different combinations of “sameness” and trying to find the most probable one
That was my thought as well. I.e. perform a jackknife analysis.
Bias-variance tradeoff is one of the most important concepts in data science, ML/AI, stats. Respect for bringing it up!
I just read the tipping point book on the plane lol
I video about hierarchical models and not a single Simpsons (paradox) gif is criminal! Jk, loving the series
Please to see Stan mentioned
What's really funny I yesterday had my final exam on bayesian statistics and it was on hierarchical models.
Gaussian mixture model?
Not sure what you apply that to, since that is a method to fit a histogram with multiple additive gaussians (if that is what you mean). But, responses are binary, resulting in probabilities, not gaussian, which works for general numeric data; and you have knowledge regarding which group each sample belongs to, that does not seem to be used for gaussian mixtures.
A off topic problem but that intrigues me
let we are obeserving no of cattles we are observing in one hour. In the countryside, there will be many cattles so let we have data:30,31,28,29,30,... and in urban area there will be less cattle but somehow a truck of cattle may appear. So there our data:2,3,30,4,25,2... If we want to model these with poisson, then first case variance will be higher as lamda is higher. but in real life, we can sense, second process will have higher variance. How to interprete it?
One way to model this hierarchically is to place a distribution on the lambda parameter, similar to how a distribution was assumed for the binomial parameters here. City and country side would have different lambda, which will let them have different variances. This encodes a belief that the distribution of the number of cattle observed are different in city and countryside but also acknowledges that there are common factors that should make them similar.
Unrelated to this video, but you can also try negative binomial regression to account for the overdispersion and add city as a regressor.
In the end, the model you choose should hopefully be motivated by your knowledge about the data
@very-normal Thanks!
There is no way a real statistician lookes like in 6:42
Maybe using a student-t distribution instead of a normal distribution? I a not sure, but I would assume that due to the normality parameter nu of the t-dist, the shrinkage would be smaller than with a normal one. Bbut the "outlier" would be more compatible and the mean doesn't need to be shifted so much. Likewise, the sigma parameter could be smaller, which could decrease the credible intervals. I would love to see the results from it.
Thanks for the videos, "brms" would be great to show ;-) I know, Stan is much more flexible...
yeah a T posterior would probably be a better choice. I'm not super well versed in BHM's but I've seen T posteriors far far far more often than normal posteriors
@@jeffreychandler8418 You mean a t-dist. as a prior right? The posterior does not need to follow any specific distribution.
I have not seen t-distributions as priors for the random effect (as far as I remember), but for typical fixed effect models, mimicking frequentist ANOVAS and t-tests, as proposed by Kruschke or McElreath.
@@Inexorablehorror oh yeah I did. brain crosses wires when I'm tired haha.
I'd like a comment on traditional statistical methods assuming a distribution function, estimating the central and spread measures, and adjustments for kurtosis, heteroscedaticity et c. compared to machine learning ("AI") that uses all data with brute force to "finds patterns". Overfitting is the traditional reaction, but ChatGPT works darn well, doesn't it?
Over a decade ago I had a look at Support Vector Machines (SVM) which is a mathematically analytical approach, not black boxed pattern fitting by (guided) brute force. The math of this stuff is a bit beyond my current paygrade and I have no use for it professionally, so it is only out of personal curiosity that I ask. I hear that SVM only has specialized applications nowadays since neural networks and such outcompetes it, so if you haven't familiarized yourself with SVP, it might not be worth the effort to do so. I just wonder if anyone reading this has, and wants to compare it to the traditional statistical methods. It seemed pretty nifty as far as I could tell, although I never applied it.
learning how to code an SVM: 😃
learning the theory of an SVM: 💀
From a certain perspective, the Bayesian approach in the video is a machine learning method. A model is fit to some data. The first model has one parameter, the "overall response rate". The second model, approach #2, has 36 parameters, one for each group. The final model, approach #3, has 3 parameters. Given the size of the data in the video, ie small, the 3 parameter approach makes a lot of sense. The 36 parameter approach just has too many knobs to turn, too many degrees of freedom.
ChatGPT 4 probably has around 1.8 trillion parameters.
@@colin4349 Good point there!
I really have to get a grip on machine learning, somehow. But those node layers turn me off, it's so stupid simple in detail. Not exactly Euler math. And that emulates human conversation? It is humiliating, but it is what it is.
@@bjorntorlarsson Don't be afraid of just ignoring some details. For example, I know how to do simple calculations by hand and using a handheld calculator. However, I do not understand how the electrical signals make a calculator work.
ChatGPT works at a high level by learning and then sampling from a distribution. The distribution is learned from the data, (how? don't care, ignored), conditioned on a "context window". A context window is the recent text conversation. Then ChatGPT samples the next word from the distribution conditioned on the context window, updates the context window, samples the next word, and so on.
@@colin4349 ChatGPT is bad at math! But still, it surprisingly often generates the correct answer. If one then asks it what process it used, it just generates a new answer relating to that new question. A new generation that is of course completely unrelated to whatever it did to generate the previous correct math answer. It doesn't "know" anything about that, and doesn't "understand" my question the way I meant it.
It takes some getting used to it. It never uses any logic! I've asked it, and it says so. I suppose it could be used as a convenient user interface that in turn actually uses math software like Wolfram or Matlab. But it doesn't yet, the standard version that I pay $20 a month for. Perhaps I could "prompt" it to do so?
For real:
"- But, without using any logic, how come you can generate code that actually work?"
"- Oh, coding is just heuristics!"
This is why in ML they are called 'models'...
i think that’s the name for them in statistics too
Cant Chi square test do this?
It’s one way to do this. It’ll tell you that one of the groups will have a different mean from the rest, but you’d also have to do secondary pair wise tests to identify the nugget itself.
love it! but that background is awful for reading :(
oops sorry I’ll keep that in mind for future videos
I can learn best with implementing the examples myself and play with the data coming with the example. Hence for me it would be far more educative if you would use either Python or R Bayesian stat libraries, and explain/justify the data set used. Else could you add a video how to execute Stan under Python (or R).
First.
congratulations