Hi Emma, thanks for the great A/B testing series. Can you elaborate more why sample ratio mismatch will cause the invalidity of the test results from statistical perspective? Also, can we design the sample size rather than 1:1 in reality?
Great and Useful videos. While you have explained few ways to identify the causes for sample ratio mistach, What are ways to deal Sample Ratio Mismatch ? Is it required to re-run the experiment after resolving the bugs/issues ? Or can we make random sample to make both groups equal?
Thanks for the great video. I have a question w.r.t the sample size that you mentioned. With a 50:50 split on a website, there will be numerous sessions coming in. So, is the sample size the minimum number of sessions we need on each side to run a test. Or do we randomly sample X samples from all the incoming sessions, X being the sample size?
Hey Emma, is reading Trustworthy Online Controlled Experiments book enough for an entry-level data scientist interviews? If not what else should I pair the book with for interview preparation? Amazing content as always!
I have the same question. From my interview experience so far, I feel it is very important to learn how to tie the theory to the context. The book only gives you all the framework and potential caveats, but you have to think about how to "tailor" them according to the case in the interview. (Looking for Emma's answer as well!)
Hi Emma, I have a question about covariate imbalance for A/B test. If covariate imbalance was observed after the experiment ended, how would you tackle this issue? Thanks in advance!
Hey Emma, great video! Quick question, for tiered significance levels, is it safe to have a lower significance level for a guardrail metric than for your primary metric? Based on your video, if my primary metric is CTR, and I expect that to increase, I would use a significance of 0.05, and if I track a guardrail metric like bounce rate and I don't think it will be affected I would use a significance level of 0.001. To me that doesn't seem safe because I could get a significant p=0.04 for CTR and an unsignificant p=0.003 for bounce rate, and the conclusion would be that the experiment should be implemented. I guess what I'm asking is how confident should I be in how a metric might change to be able to group it into a group using a smaller significance level?
Hello Emma, Thank you very much for this insightful video! I have follow-up questions for geo-based randomization to make control and treatment groups more independent. 1. For example, if we put all the SF users in control and Dallas users in treatment groups in case of Uber app. The feature based on the test wins, but how can we roll out this feature to all the markets, since the test is only done within those two specific markets? or we firstly roll out to the markets which are comparable to these 2 markets? 2. Do you mind doing a video explaining the common observational causal studies in case that the firm can not use AB tests to establish the casualty? Thanks a lotttt!!
This is individual heterogeneity estimation. Causal inference methods might be useful. Or time randomization can be used for each location and the control / treated groups are split based on date.
We use chi-square to test if T:C =1:1
Thank you! Somehow I missed this video, this has a lot of info and content, I've write them all done. May come back and watch again.
Hi Emma, thanks for the great A/B testing series. Can you elaborate more why sample ratio mismatch will cause the invalidity of the test results from statistical perspective? Also, can we design the sample size rather than 1:1 in reality?
Great knowledge sharing.
Thank you 👍
Thank you for sharing. Super helpful.
This is a really great video, especially for people new to AB testing
Great and Useful videos. While you have explained few ways to identify the causes for sample ratio mistach, What are ways to deal Sample Ratio Mismatch ? Is it required to re-run the experiment after resolving the bugs/issues ? Or can we make random sample to make both groups equal?
Thanks for the great video. I have a question w.r.t the sample size that you mentioned. With a 50:50 split on a website, there will be numerous sessions coming in. So, is the sample size the minimum number of sessions we need on each side to run a test. Or do we randomly sample X samples from all the incoming sessions, X being the sample size?
Hi Emma, thank you for your video! Can you help explain why would a segment of population (ios, android) would cause multiple testing?
Hey Emma, is reading Trustworthy Online Controlled Experiments book enough for an entry-level data scientist interviews? If not what else should I pair the book with for interview preparation?
Amazing content as always!
I have the same question. From my interview experience so far, I feel it is very important to learn how to tie the theory to the context. The book only gives you all the framework and potential caveats, but you have to think about how to "tailor" them according to the case in the interview. (Looking for Emma's answer as well!)
Hi Emma, I have a question about covariate imbalance for A/B test. If covariate imbalance was observed after the experiment ended, how would you tackle this issue? Thanks in advance!
Hey Emma, great video! Quick question, for tiered significance levels, is it safe to have a lower significance level for a guardrail metric than for your primary metric? Based on your video, if my primary metric is CTR, and I expect that to increase, I would use a significance of 0.05, and if I track a guardrail metric like bounce rate and I don't think it will be affected I would use a significance level of 0.001. To me that doesn't seem safe because I could get a significant p=0.04 for CTR and an unsignificant p=0.003 for bounce rate, and the conclusion would be that the experiment should be implemented. I guess what I'm asking is how confident should I be in how a metric might change to be able to group it into a group using a smaller significance level?
How do we use t-test for SRM? I thought we can only use chi-squared
"Trustworthy Online Controlled Experiments" by Ron Kohavi, Diane Tang, Ya Xu - Thanks for your recommendation
Hello Emma, Thank you very much for this insightful video! I have follow-up questions for geo-based randomization to make control and treatment groups more independent.
1. For example, if we put all the SF users in control and Dallas users in treatment groups in case of Uber app. The feature based on the test wins, but how can we roll out this feature to all the markets, since the test is only done within those two specific markets? or we firstly roll out to the markets which are comparable to these 2 markets?
2. Do you mind doing a video explaining the common observational causal studies in case that the firm can not use AB tests to establish the casualty?
Thanks a lotttt!!
This is individual heterogeneity estimation. Causal inference methods might be useful. Or time randomization can be used for each location and the control / treated groups are split based on date.
Good question, did you find the answer?
Thank you