There are so many youtube videos on these subjects that are barely scratching the surface. They absolutely serve a purpose, but this is incredibly thorough for a 14 minute video. Thank you!
Thanks for the great video! a little question about your final conclusion: It looks like the treatment 1 is not statistically significant nor practically siginificant, therefore we should definitely not to launch it; the treatment 2 is not practically significant but it is statistically significant, we may consider to launch if the cost is low.
Great job! For the sample size, I would seriously advise to use a more precise approach like G-power (or R/Python). It would tremendously increase the precision.
Emma, thank you for your channel and your videos! I love how practical and clear your videos are. They are very easy to understand and very useful for understanding the topic and getting ready for an interview. BTW, I wanted to share that when your videos are example-oriented and precise they are more understandable than theoretical ones.
Hi Emma, great video! Would like to ask whether we need to worry about multiple testing problem here (i.e. apply Bonferroni correction), since there are now >2 variants in this context?
Great video ! I didnt understand why Treatment 2 was practically insignificant? the p-value was less that 0.05 and also point estimate was greater than $2. What am I missing?
@@ramavishwanathan7214 then whats diff between statistical significance and practical? if lower bound was atleast 2, then it would be statistically significant too.
Hi Emma, Amazing Content !! Cleared a lot of doubts I had a question though. How can we estimate the Standard Deviation of Population. I mean looking at the historical data ??
This video is super helpful! One question, why is it recommended to be overpowered? If the test gets too overpowered wouldn't the effect size you are detecting be too small/ trivial?
Being overpowered would significantly reduce the variance of the sampling distribution of whatever you’re measuring. It’s very possible you can detect trivial effects like you mentioned, but that’s where practical significance plays a role. If my final sampling distribution has an interval [0.04, 0.07], that’s a completely trivial effect which is “statistically significant” but way lower than what we’d consider practical. On the other hand, it could be [2.22, 2.25] where we have much more certainty and we can see in this case it’s comfortable enough to say that it’s a practical change to make. It’s only recommended to be overpowered if you can manage all of the costs associated with the design and of course, if you have the users for it
Ahh...AB testing. Depending on your career path, you may not have experienced these kinds of challenges since you tackled them in grad school. Thanks for the videos.
Hi Emma, I love your videos, short, clear, and to the point. However, I tend not to agree with you on the conclusion you made. There is no point estimate comparison that can validate either of statistically or practically significant observations. For both cases, CI needs to be used. Generally, if CI has overlaps with any value, it means chances are high that CI is referring to the same distribution that the point estimate or parameter is referring to. For TG1, the CI includes both control value of "0" and practical significance boundary of "2", so for both it is NOT significant. For TG2, it doesn't cover "0" but covers "2", so it is statistically significant, but practically NOT significant.
Hi Emma, you're the best!!!! I have question about smaller significance level though. When α is smaller, that means my margion of error is higher, which means I need less sample size. Please let me know if I understand wrong. Keep going!
When you mentioned "follow-up tests" in the end, would you just keep the current experiment running for more power? Or are you referring to something else? Thanks! (love the video!!!)
Great question! It depends. Generally speaking, when the experiment hasn't stopped, the simplest way is to keep it running to get more users. If it has stopped or there are some major changes to the experiments or assumptions, we need to rerun the experiment, which will introduce some overhead.
@@emma_ding Hey Emma. Thanks for sharing! Just to follow up: I was wondering why people always talk about "power" like "underpowered" or "adding more power". I am thinking that the only way to increase power is to increase the sample size, then why not just say "sample size is too small" or "adding more sample?"? Is there any other way to increase power? Thanks!
Thanks for the great video! One question on delta (business requirement), what if there's no minimum required delta value, e.g. maybe the cost of test is extremely low and we just want to see if there can be any more revenue generated, then how should we estimate sample size?
For this case where you don’t have a specific delta in mind, I would probably target a specific margin of error for both groups. In the case where both groups have the same size then this reduces to one calculation. For example, maybe you want to target a margin of error of 0.3 (I.e., point estimate +/- 0.3) then you can use the formula of the variance of the sampling distribution to determine the appropriate sample size to reach that margin of error
The slide at 8'48'' is wrong, that chart title says "% of users in each group," while it shows the total number of users in three groups. Here's why: if the total # of users who check out is 2000, they need to be split into 3 variant groups, meaning each group has about 666 daily check-out users. Then from there, for variant 1 group, the first day users are 666*5%, instead of 2000*5% (which is the total number of users across all 3 groups).
Hi Emma, thanks for your inspired and helpful explanations about A/B testing. I'm confused about the result of Treatment II VS Control group. It seems like Treatment II is both practically and statistically significant. Why we also would like to get more tests and then make final decision? Does the follow-up test is like a fine tuned model which well gives us better design details and results?
The C.I. of Treatment II overlaps with practical significance boundary meaning it's likely to be NOT practical significant. The followup test is to get more users to increase the power -- either keep running the ongoing experiment or rerun with more users.
from my understanding, ideally we want to launch a change if is (1) statistically significant (2)partially significant. For treatment I, point estimate - 2.45 > 2, indicating the change may be practically significant, but we are not sure since confident interval [-0.1, 5] overlaps practical significance boundary [-2, 2]. Most importantly, the p-value = 0.055 indicates that it's not statistically significant. Since the change may be practically significant, but not statistically significant, we are not sure about the launch, so we want to increase the power of the test, to see if the change become statistically significant, and also observe whether the CI move outside the Practical significant boundary. For treatment II, differs from treatment I, now p-value = 0.0001, indicating the change is statistically significant. However, still, the CI overlaps practical significance boundary. We know there's a positive change, but not sure if such change meets the business goal. Is my understanding correct! welcome comments and feedbacks!
@@qietang2701 I was also confused by this. I felt Treatment I is NOT practical significant if using the criteria mentioned later. I tried to convince myself that maybe Emma meant to bring in CI as a condition only in Treatment II.
Thanks for the great video! Very informative! 10:23 Can customers belong to more than one group ? What I mean is - In Cycle 1 there are 100 users , so total of 300 users will be tested. In cycle 2 - 200*3 - 600 users Now can the 100 users that belonged to control or treatment 1 in cycle 1(day 1), belong to treatment 2 in another cycle While starting the experiment only do we fix them into the here categories?
hello Emma, thank you so much for this amazing ab testing series video. I got a quick question about the ramp-up plan part, so in the first day, we only use 5% of traffic for each variant, so its like we use 5% of traffic for variant 1, another 5% of traffic for variant 2, and the left 90% of traffic goes to the control group? i am just really confused about this part. much appreciated if you can give me some hints! : )
At the end of the video, the p-value for Treatment I vs Control is 0.055, which is larger than 5%. So statistically it should be significant, while practically it may not be. In the video, you mentioned otherwise. Can you please advise?
Need your advice on this; Should we rather have customer picked for this experiment by customer profiling to make sure we cover wide array of our customer behavior and represent the change with new features? I have personally had a lot of explaining when I do experiments with randomly picked user groups
Thanks Emma for taking your time to create this great video!! I have a quick question regarding defining metrics. Do we also need to mention the timeframe associated with the revenue per user? Like in your example, it seems that average daily revenue per user was used, or is it avg weekly revenue? In the following series, it will be super helpful if you can do another video like this for referral or promo program in the two-sided markets like Uber (many of us feel difficult to answer ab test in that case)!! Thanks a lot for your help.
It really depends on the setting of the question. If you question is about a feature that is used by user everyday, then you can use the daily time frame. If you think there's a weekly pattern, say users tend to use this feature more on the weekend, then it makes sense to use weekly. For this problem, I think it makes sense to just use the average rev/user over the experiment period, because I am assuming shopping is not a frequent user behavior.
Thanks ktyt. Adding to the answer above, generally, we want to keep experiments short. You can get more "per day" measurements than "per week" measurements over the same period. In our example, "per day" measurement is preferred.
@@emma_ding Thanks Emma for the explanation! Another question I have when rewatching this video is that, the duration of 1 week mentioned here is to do random assignment and collect sample for each group, right? But for metrics to analyze, if it is a weekly metrics, then it will take additional time for metrics to realize per each cohort. So I guess that also resonates with your point that we prefer daily metrics more?
Hi Emma, thank you for the great video! I have some questions after watching the videos and hope you can help with that. In the example you mentioned, there are 2 treatment groups, one is displaying similar product below check-out cart, another one is a pop-up window. Therefore shouldn't the variants = 2? should we use bonferroni correction to divide the significance level by 2? thanks!
It's a great point to consider multiple testing problem. However, Bonferroni correction is often being criticized for being too conservative. A more recommended way to deal with the problem is to control false discovery rate. More details in this video ruclips.net/video/X8u6kr4fxXc/видео.html. Also, the number of variants is 3 -- control, treatment I and treatment II.
Great video, and really clear. Thank you so much! Question: One of the big assumption is knowing the population standard deviation when in reality we don't. Can you give a real-life example how to estimate that too? Thank you!
The standard deviation has to be estimated from historical data. You could assume it's the same in both control and treatment group. For more info, you can refer to ruclips.net/video/JEAsoUrX6KQ/видео.html
thanks Emma! I have one question, based on your formula for sample size, it only contain difference of control and treatment groups and standard deviation. you said if the significance level to 2.5%, the sample size will be larger. how large? double or triple? you didn't say clearly. thanks.
Hi Emma, For this particular example, when choosing which population to target, I have always had one question, how do we control the self selection bias if users visit the check out page? I would assume that for those who visit checkout page might have higher propensity to purchase and they might not be the representative subset of the population. One way i think is to correct this bias during post-experiment analysis (e.g. adjusting the distribution of both groups to make inference for overall population). Do you think my concern is valid? Would love to hear your thoughts.
Always wondering where the delta comes from when computing the sample size. It looked like a chicken egg problem to me. Now realized that delta is actually tied to the practical significance level which is from the business requirement.
Hi, this subject trigger the interest for me to study bit more on AB testing , but to practice this , how do we proceed in learning A/B testing if we do not have a digital product? Is there like a mock data to begin testing with ? just so able to learn the A/B test?
Hi Emma. I have a urgent question 😅 I decided to apply AB test to Average Order Value. AOV = Total Payment/#Transaction .. However, I wanted to apply AB Testing to Total Payment and Num. of transactions. So there was no significant difference in total Payment and num of transactions. Then, I applied AB test to AOV. Surprisingly, there was a meaningful difference in AOV. So how can it be possible?
Hi Emma, for both treatments, the point estimates are larger than the practical significance boundary, why treatment 1 is likely to be practically significant and treatment 2 is not? I am not quite clear on that part. Hope you can clarify. Thank you.
For treatment 1: Difference between treatment and control = $2.45, it is more than practical boundary of $2, but this is only an estimate based on the sample and might not incorporate margin of error, so it’s more reliable to look at the CI, which in this case is [-0.1, 5], this CI suggests that we are confident that we would observe an average increase of -$0.1 (which is a loss of revenue) all the way up to average increase of $5, so it is likely that it would not be practical significant. However, if the minimum of CI is above 2 (e.g. [3, 6]), we would say it is practical significant. Thus, Treatment 1 is neither statistically nor practically significant. For treatment 2: Using the same logic as above, treatment 2 should be considered as statistically significant but not practically significant. Am new learner as well, so feel free to correct me.
Need some help to understand the sample size calculation. It seems to me the MDE ($2/user) missed info about the time period. Is it $2/user per day, per month, or per testing period? If it's per testing period, then how can we use this MDE to calculate how many days we need to run a test?
just a question on the final formula sample size = 16(sigma)^2/(delata)^2 I get for the delta part we can determine minimal effect we want to measure and it can be taken by the business. But without knowing the sample size how can we get sample SD (Sigma)? @
Hi Ningxin, currently all my services are sold out. But I will soon launch my masterclass which will introduce you to my full course, and that includes mock interviews in the cohort!
@@emma_ding For AB test, does the web page A and B run in the same day or today we run A and tomorrow we run B and so on? Do you have a video to show the entire process plus who are the key parties involved in AB test? Thanks
There are so many youtube videos on these subjects that are barely scratching the surface. They absolutely serve a purpose, but this is incredibly thorough for a 14 minute video. Thank you!
Thanks for the great video! a little question about your final conclusion: It looks like the treatment 1 is not statistically significant nor practically siginificant, therefore we should definitely not to launch it; the treatment 2 is not practically significant but it is statistically significant, we may consider to launch if the cost is low.
感谢Emma的A/B testing系列视频。前段时间面试被考到一些基本概念,但因为没做过实际项目,回答得不好。重新开始学习
Great job! For the sample size, I would seriously advise to use a more precise approach like G-power (or R/Python). It would tremendously increase the precision.
Emma, thank you for your channel and your videos! I love how practical and clear your videos are. They are very easy to understand and very useful for understanding the topic and getting ready for an interview.
BTW, I wanted to share that when your videos are example-oriented and precise they are more understandable than theoretical ones.
Thanks for your feedback, Edward! I will keep this in mind. 😊
You are AMAZING Emma. This is a WONDERFUL example.
Hi Emma, great video! Would like to ask whether we need to worry about multiple testing problem here (i.e. apply Bonferroni correction), since there are now >2 variants in this context?
Great video ! I didnt understand why Treatment 2 was practically insignificant? the p-value was less that 0.05 and also point estimate was greater than $2. What am I missing?
The lower bound of the CI was less than 2. For practical sign., in this case, the lower bound of CI had to at least be equal to 2
@@ramavishwanathan7214 then whats diff between statistical significance and practical? if lower bound was atleast 2, then it would be statistically significant too.
Thank you for sharing your valuable knowledge on this topic. Very helpful.
Hi Emma,
Amazing Content !! Cleared a lot of doubts
I had a question though. How can we estimate the Standard Deviation of Population. I mean looking at the historical data ??
Hi Sandeep! Apologies for the delayed response but yes, we can get that from the historical data.
This video is super helpful! One question, why is it recommended to be overpowered? If the test gets too overpowered wouldn't the effect size you are detecting be too small/ trivial?
Being overpowered would significantly reduce the variance of the sampling distribution of whatever you’re measuring. It’s very possible you can detect trivial effects like you mentioned, but that’s where practical significance plays a role.
If my final sampling distribution has an interval [0.04, 0.07], that’s a completely trivial effect which is “statistically significant” but way lower than what we’d consider practical.
On the other hand, it could be [2.22, 2.25] where we have much more certainty and we can see in this case it’s comfortable enough to say that it’s a practical change to make.
It’s only recommended to be overpowered if you can manage all of the costs associated with the design and of course, if you have the users for it
@@zachmanifold Such a good example☝
Ahh...AB testing. Depending on your career path, you may not have experienced these kinds of challenges since you tackled them in grad school. Thanks for the videos.
You are welcome Ajit! I am glad you are finding them useful!
Hi Emma, I love your videos, short, clear, and to the point. However, I tend not to agree with you on the conclusion you made. There is no point estimate comparison that can validate either of statistically or practically significant observations. For both cases, CI needs to be used. Generally, if CI has overlaps with any value, it means chances are high that CI is referring to the same distribution that the point estimate or parameter is referring to. For TG1, the CI includes both control value of "0" and practical significance boundary of "2", so for both it is NOT significant. For TG2, it doesn't cover "0" but covers "2", so it is statistically significant, but practically NOT significant.
hi Emma. thanks for the video. why are you determining stat sig based on the confidence interval and not the t-statistic?
Hi Emma, I recently started watching your videos. good explanation. keep going !
Hi Emma, you're the best!!!! I have question about smaller significance level though. When α is smaller, that means my margion of error is higher, which means I need less sample size. Please let me know if I understand wrong. Keep going!
Thanks for the great video. I really liked your transparent approach of attributing the source that you have built upon.
Hi Emma, I might've missed this in your video, which hypothesis test did you choose for this example, t-test or z-test and why?
Thank you, I like your format!
Thanks, became much more clear
When you mentioned "follow-up tests" in the end, would you just keep the current experiment running for more power? Or are you referring to something else? Thanks! (love the video!!!)
Good point! I have the same question. For "follow-up tests", do we let the experiment run longer so more samples can be collected?
Great question! It depends. Generally speaking, when the experiment hasn't stopped, the simplest way is to keep it running to get more users. If it has stopped or there are some major changes to the experiments or assumptions, we need to rerun the experiment, which will introduce some overhead.
@@emma_ding Hey Emma. Thanks for sharing! Just to follow up: I was wondering why people always talk about "power" like "underpowered" or "adding more power". I am thinking that the only way to increase power is to increase the sample size, then why not just say "sample size is too small" or "adding more sample?"? Is there any other way to increase power? Thanks!
@@Crtg17 a diluted experiment can also be underpowered.
Thorough and precise explanation! Thank you
Thanks for the great video! One question on delta (business requirement), what if there's no minimum required delta value, e.g. maybe the cost of test is extremely low and we just want to see if there can be any more revenue generated, then how should we estimate sample size?
For this case where you don’t have a specific delta in mind, I would probably target a specific margin of error for both groups. In the case where both groups have the same size then this reduces to one calculation.
For example, maybe you want to target a margin of error of 0.3 (I.e., point estimate +/- 0.3) then you can use the formula of the variance of the sampling distribution to determine the appropriate sample size to reach that margin of error
The slide at 8'48'' is wrong, that chart title says "% of users in each group," while it shows the total number of users in three groups. Here's why: if the total # of users who check out is 2000, they need to be split into 3 variant groups, meaning each group has about 666 daily check-out users. Then from there, for variant 1 group, the first day users are 666*5%, instead of 2000*5% (which is the total number of users across all 3 groups).
Excellent Video!
Great video. Exactly what I was looking for and well explained.
Thank you for your clear tutorials. Do you know any resources to go into detail about practical significance? Thanks and keep it up 😉.
Hi Simone, I would recommend checking the blog posts on www.datainterviewpro.com/blogpost.
Hi Emma, thanks for your inspired and helpful explanations about A/B testing. I'm confused about the result of Treatment II VS Control group. It seems like Treatment II is both practically and statistically significant. Why we also would like to get more tests and then make final decision? Does the follow-up test is like a fine tuned model which well gives us better design details and results?
The C.I. of Treatment II overlaps with practical significance boundary meaning it's likely to be NOT practical significant.
The followup test is to get more users to increase the power -- either keep running the ongoing experiment or rerun with more users.
@@emma_ding Then why Treatment I is likely to be practically significant? The C.I. of Treatment I also overlaps with practical significance boundary.
from my understanding, ideally we want to launch a change if is (1) statistically significant (2)partially significant.
For treatment I, point estimate - 2.45 > 2, indicating the change may be practically significant, but we are not sure since confident interval [-0.1, 5] overlaps practical significance boundary [-2, 2]. Most importantly, the p-value = 0.055 indicates that it's not statistically significant. Since the change may be practically significant, but not statistically significant, we are not sure about the launch, so we want to increase the power of the test, to see if the change become statistically significant, and also observe whether the CI move outside the Practical significant boundary.
For treatment II, differs from treatment I, now p-value = 0.0001, indicating the change is statistically significant. However, still, the CI overlaps practical significance boundary. We know there's a positive change, but not sure if such change meets the business goal.
Is my understanding correct! welcome comments and feedbacks!
@@qietang2701 I was also confused by this. I felt Treatment I is NOT practical significant if using the criteria mentioned later. I tried to convince myself that maybe Emma meant to bring in CI as a condition only in Treatment II.
Hey Emma, you mentioned sanity checks in the video, could you please elaborate on how can we do these checks?
Thanks for the great video! Very informative!
10:23 Can customers belong to more than one group ?
What I mean is - In Cycle 1 there are 100 users , so total of 300 users will be tested.
In cycle 2 - 200*3 - 600 users
Now can the 100 users that belonged to control or treatment 1 in cycle 1(day 1), belong to treatment 2 in another cycle
While starting the experiment only do we fix them into the here categories?
hello Emma, thank you so much for this amazing ab testing series video. I got a quick question about the ramp-up plan part, so in the first day, we only use 5% of traffic for each variant, so its like we use 5% of traffic for variant 1, another 5% of traffic for variant 2, and the left 90% of traffic goes to the control group? i am just really confused about this part. much appreciated if you can give me some hints! : )
At the end of the video, the p-value for Treatment I vs Control is 0.055, which is larger than 5%. So statistically it should be significant, while practically it may not be. In the video, you mentioned otherwise. Can you please advise?
So well explained! Great example, thank you!
IMO it's more natural to choose randomization unit (randomize by user) first before metric (avg revenue per user)
Thank you Emma!
This was a great video! Thank you for such a great explanation, love the passion you show.
Need your advice on this; Should we rather have customer picked for this experiment by customer profiling to make sure we cover wide array of our customer behavior and represent the change with new features? I have personally had a lot of explaining when I do experiments with randomly picked user groups
Thanks Emma for taking your time to create this great video!! I have a quick question regarding defining metrics. Do we also need to mention the timeframe associated with the revenue per user? Like in your example, it seems that average daily revenue per user was used, or is it avg weekly revenue? In the following series, it will be super helpful if you can do another video like this for referral or promo program in the two-sided markets like Uber (many of us feel difficult to answer ab test in that case)!! Thanks a lot for your help.
It really depends on the setting of the question. If you question is about a feature that is used by user everyday, then you can use the daily time frame. If you think there's a weekly pattern, say users tend to use this feature more on the weekend, then it makes sense to use weekly.
For this problem, I think it makes sense to just use the average rev/user over the experiment period, because I am assuming shopping is not a frequent user behavior.
Thanks ktyt. Adding to the answer above, generally, we want to keep experiments short. You can get more "per day" measurements than "per week" measurements over the same period. In our example, "per day" measurement is preferred.
@@emma_ding Thanks Emma for the explanation! Another question I have when rewatching this video is that, the duration of 1 week mentioned here is to do random assignment and collect sample for each group, right? But for metrics to analyze, if it is a weekly metrics, then it will take additional time for metrics to realize per each cohort. So I guess that also resonates with your point that we prefer daily metrics more?
Excellent! very helpful, ty.
Hi Emma, thank you for the great video! I have some questions after watching the videos and hope you can help with that. In the example you mentioned, there are 2 treatment groups, one is displaying similar product below check-out cart, another one is a pop-up window. Therefore shouldn't the variants = 2? should we use bonferroni correction to divide the significance level by 2? thanks!
It's a great point to consider multiple testing problem. However, Bonferroni correction is often being criticized for being too conservative. A more recommended way to deal with the problem is to control false discovery rate. More details in this video ruclips.net/video/X8u6kr4fxXc/видео.html. Also, the number of variants is 3 -- control, treatment I and treatment II.
@@emma_ding Hey Emma, so would it be cirrect to consider this example as a multiple testing problem ???
Great video, and really clear. Thank you so much! Question: One of the big assumption is knowing the population standard deviation when in reality we don't. Can you give a real-life example how to estimate that too? Thank you!
The standard deviation has to be estimated from historical data. You could assume it's the same in both control and treatment group. For more info, you can refer to ruclips.net/video/JEAsoUrX6KQ/видео.html
@@emma_ding The video doesnt really explain how to get variance from previous data. check ruclips.net/video/OM5Lbb2gZgA/видео.html
Can you disambiguate expected effect size and practical significance boundary? Thank you!
Hey Emma, can I ask what is the rationale of running another test with "more power"?
super useful! thank you!
thanks Emma! I have one question, based on your formula for sample size, it only contain difference of control and treatment groups and standard deviation. you said if the significance level to 2.5%, the sample size will be larger. how large? double or triple? you didn't say clearly. thanks.
Hey, You would need to find the corresponding z score to calculate it. I hope this answers your question. Thanks!
Hi Emma,
For this particular example, when choosing which population to target, I have always had one question, how do we control the self selection bias if users visit the check out page? I would assume that for those who visit checkout page might have higher propensity to purchase and they might not be the representative subset of the population. One way i think is to correct this bias during post-experiment analysis (e.g. adjusting the distribution of both groups to make inference for overall population). Do you think my concern is valid? Would love to hear your thoughts.
Great primer
Always wondering where the delta comes from when computing the sample size. It looked like a chicken egg problem to me. Now realized that delta is actually tied to the practical significance level which is from the business requirement.
Thank you! This wasn't clicking for me
Hi. Can somebody explain why treatment 2 is not practically significant? 2.25>2
Hi, this subject trigger the interest for me to study bit more on AB testing , but to practice this , how do we proceed in learning A/B testing if we do not have a digital product? Is there like a mock data to begin testing with ? just so able to learn the A/B test?
Hi Emma. I have a urgent question 😅 I decided to apply AB test to Average Order Value. AOV = Total Payment/#Transaction .. However, I wanted to apply AB Testing to Total Payment and Num. of transactions. So there was no significant difference in total Payment and num of transactions. Then, I applied AB test to AOV. Surprisingly, there was a meaningful difference in AOV. So how can it be possible?
Hi Emma, for both treatments, the point estimates are larger than the practical significance boundary, why treatment 1 is likely to be practically significant and treatment 2 is not? I am not quite clear on that part. Hope you can clarify. Thank you.
For treatment 1: Difference between treatment and control = $2.45, it is more than practical boundary of $2, but this is only an estimate based on the sample and might not incorporate margin of error, so it’s more reliable to look at the CI, which in this case is [-0.1, 5], this CI suggests that we are confident that we would observe an average increase of -$0.1 (which is a loss of revenue) all the way up to average increase of $5, so it is likely that it would not be practical significant. However, if the minimum of CI is above 2 (e.g. [3, 6]), we would say it is practical significant. Thus, Treatment 1 is neither statistically nor practically significant.
For treatment 2: Using the same logic as above, treatment 2 should be considered as statistically significant but not practically significant.
Am new learner as well, so feel free to correct me.
Need some help to understand the sample size calculation. It seems to me the MDE ($2/user) missed info about the time period. Is it $2/user per day, per month, or per testing period? If it's per testing period, then how can we use this MDE to calculate how many days we need to run a test?
just a question on the final formula sample size = 16(sigma)^2/(delata)^2 I get for the delta part we can determine minimal effect we want to measure and it can be taken by the business. But without knowing the sample size how can we get sample SD (Sigma)? @
why gradual ramp up?
your videos are the best
shouldn't the selection of the sagement of users falls under step 2 (design experiment)? in the video, it still falls under the prerequisites
Hi Emma do you still provide mock interview services? I checked your website and they are marked sold-out.
Hi Ningxin, currently all my services are sold out. But I will soon launch my masterclass which will introduce you to my full course, and that includes mock interviews in the cohort!
What is practical significance?
Hi, how could we ensure that the users assigned in each group are carried out randomly? thanks
You can use hypothesis testing: t-test or chi-squared test.
@@emma_ding For AB test, does the web page A and B run in the same day or today we run A and tomorrow we run B and so on? Do you have a video to show the entire process plus who are the key parties involved in AB test? Thanks
Eventually we reach a point where you listed all of your tshirts! 🤣 this looks great on you! (wait am I watching a tech video??
Have you ever made some A/B Testing for youtube thumbnails?
the more i watch you...more i fall in love with you 😓😭
Why are your videos so bad? Do you use a teleprompter. Please explain stuff don't just read.