A/B Testing Made Easy: Real-Life Example and Step-by-Step Walkthrough for Data Scientists!

Поделиться
HTML-код
  • Опубликовано: 29 июн 2024
  • This is a comprehensive walkthrough of an A/B testing example. I will go through the details of designing A/B testing experiments, running experiments, and interpreting results. The idea is inspired by the book Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. I'm sure it will be extremely helpful for data science interview preparation.
    Part 1 of the tutorial:
    • A/B Testing Fundamenta...
    Derivation of sample size equation:
    • Sample Size Estimation...
    🟢Get all my free data science interview resources
    www.emmading.com/resources
    🟡 Product Case Interview Cheatsheet www.emmading.com/product-case...
    🟠 Statistics Interview Cheatsheet www.emmading.com/statistics-i...
    🟣 Behavioral Interview Cheatsheet www.emmading.com/behavioral-i...
    🔵 Data Science Resume Checklist www.emmading.com/data-science...
    ✅ We work with Experienced Data Scientists to help them land their next dream jobs. Apply now: www.emmading.com/coaching
    // Comment
    Got any questions? Something to add?
    Write a comment below to chat.
    // Let's connect on LinkedIn:
    / emmading001
    ====================
    Contents of this video:
    ====================
    0:00 Intro
    1:00 Background
    2:02 Prerequisites
    5:50 Experiment design
    11:08 Result to decision

Комментарии • 96

  • @alexlindroos828
    @alexlindroos828 2 года назад +28

    There are so many youtube videos on these subjects that are barely scratching the surface. They absolutely serve a purpose, but this is incredibly thorough for a 14 minute video. Thank you!

  • @tangled55
    @tangled55 3 года назад

    You are AMAZING Emma. This is a WONDERFUL example.

  • @prithviprakash1110
    @prithviprakash1110 2 года назад +1

    This was a great video! Thank you for such a great explanation, love the passion you show.

  • @kelvinscomps
    @kelvinscomps 3 года назад +1

    Great video. Exactly what I was looking for and well explained.

  • @amadyba6761
    @amadyba6761 2 года назад

    So well explained! Great example, thank you!

  • @aoihana1042
    @aoihana1042 2 года назад

    Thorough and precise explanation! Thank you

  • @ramavishwanathan7214
    @ramavishwanathan7214 2 года назад

    Thanks for the great video. I really liked your transparent approach of attributing the source that you have built upon.

  • @praveensebastian2107
    @praveensebastian2107 2 года назад

    Hi Emma, I recently started watching your videos. good explanation. keep going !

  • @Kkidnappedd
    @Kkidnappedd 10 месяцев назад

    Thank you, I like your format!

  • @biniyampauloschamiso1411
    @biniyampauloschamiso1411 10 месяцев назад

    Thank you for sharing your valuable knowledge on this topic. Very helpful.

  • @Edward_Adams
    @Edward_Adams 11 месяцев назад +1

    Emma, thank you for your channel and your videos! I love how practical and clear your videos are. They are very easy to understand and very useful for understanding the topic and getting ready for an interview.
    BTW, I wanted to share that when your videos are example-oriented and precise they are more understandable than theoretical ones.

    • @emma_ding
      @emma_ding  11 месяцев назад +1

      Thanks for your feedback, Edward! I will keep this in mind. 😊

  • @mingzhouzhu4668
    @mingzhouzhu4668 2 года назад

    super useful! thank you!

  • @Jonwisniewski04
    @Jonwisniewski04 Год назад

    Excellent! very helpful, ty.

  • @Vikash_Kumar8090
    @Vikash_Kumar8090 Год назад

    Excellent Video!

  • @victoradewoyinva
    @victoradewoyinva Год назад

    Thank you Emma!

  • @vincentyin4992
    @vincentyin4992 2 года назад +27

    Thanks for the great video! a little question about your final conclusion: It looks like the treatment 1 is not statistically significant nor practically siginificant, therefore we should definitely not to launch it; the treatment 2 is not practically significant but it is statistically significant, we may consider to launch if the cost is low.

  • @vonderklaas
    @vonderklaas Год назад

    Thanks, became much more clear

  • @kennethleung4487
    @kennethleung4487 2 года назад +6

    Hi Emma, great video! Would like to ask whether we need to worry about multiple testing problem here (i.e. apply Bonferroni correction), since there are now >2 variants in this context?

  • @gbting1988
    @gbting1988 2 года назад +1

    Hi. Can somebody explain why treatment 2 is not practically significant? 2.25>2

  • @Alex-tv8cf
    @Alex-tv8cf 3 года назад +2

    感谢Emma的A/B testing系列视频。前段时间面试被考到一些基本概念,但因为没做过实际项目,回答得不好。重新开始学习

  • @sharanupst
    @sharanupst Месяц назад

    Great primer

  • @xinxinli8779
    @xinxinli8779 2 года назад +1

    Hi Emma, I might've missed this in your video, which hypothesis test did you choose for this example, t-test or z-test and why?

  • @sitongchen6688
    @sitongchen6688 3 года назад +2

    Thanks Emma for taking your time to create this great video!! I have a quick question regarding defining metrics. Do we also need to mention the timeframe associated with the revenue per user? Like in your example, it seems that average daily revenue per user was used, or is it avg weekly revenue? In the following series, it will be super helpful if you can do another video like this for referral or promo program in the two-sided markets like Uber (many of us feel difficult to answer ab test in that case)!! Thanks a lot for your help.

    • @ktyt684
      @ktyt684 3 года назад +3

      It really depends on the setting of the question. If you question is about a feature that is used by user everyday, then you can use the daily time frame. If you think there's a weekly pattern, say users tend to use this feature more on the weekend, then it makes sense to use weekly.
      For this problem, I think it makes sense to just use the average rev/user over the experiment period, because I am assuming shopping is not a frequent user behavior.

    • @emma_ding
      @emma_ding  3 года назад +2

      Thanks ktyt. Adding to the answer above, generally, we want to keep experiments short. You can get more "per day" measurements than "per week" measurements over the same period. In our example, "per day" measurement is preferred.

    • @sitongchen6688
      @sitongchen6688 3 года назад

      @@emma_ding Thanks Emma for the explanation! Another question I have when rewatching this video is that, the duration of 1 week mentioned here is to do random assignment and collect sample for each group, right? But for metrics to analyze, if it is a weekly metrics, then it will take additional time for metrics to realize per each cohort. So I guess that also resonates with your point that we prefer daily metrics more?

  • @sophiezheng4850
    @sophiezheng4850 2 года назад

    Hi Emma,
    For this particular example, when choosing which population to target, I have always had one question, how do we control the self selection bias if users visit the check out page? I would assume that for those who visit checkout page might have higher propensity to purchase and they might not be the representative subset of the population. One way i think is to correct this bias during post-experiment analysis (e.g. adjusting the distribution of both groups to make inference for overall population). Do you think my concern is valid? Would love to hear your thoughts.

  • @nianyiwang8993
    @nianyiwang8993 2 года назад

    Hi Emma, you're the best!!!! I have question about smaller significance level though. When α is smaller, that means my margion of error is higher, which means I need less sample size. Please let me know if I understand wrong. Keep going!

  • @mindfuel-ness
    @mindfuel-ness 2 года назад

    Need your advice on this; Should we rather have customer picked for this experiment by customer profiling to make sure we cover wide array of our customer behavior and represent the change with new features? I have personally had a lot of explaining when I do experiments with randomly picked user groups

  • @ajitkirpekar4251
    @ajitkirpekar4251 2 года назад +2

    Ahh...AB testing. Depending on your career path, you may not have experienced these kinds of challenges since you tackled them in grad school. Thanks for the videos.

    • @emma_ding
      @emma_ding  2 года назад +1

      You are welcome Ajit! I am glad you are finding them useful!

  • @suhascrazy805
    @suhascrazy805 2 года назад

    Hey Emma, you mentioned sanity checks in the video, could you please elaborate on how can we do these checks?

  • @goryglory729
    @goryglory729 3 года назад +1

    Can you disambiguate expected effect size and practical significance boundary? Thank you!

  • @SimoneIovane
    @SimoneIovane 2 года назад

    Thank you for your clear tutorials. Do you know any resources to go into detail about practical significance? Thanks and keep it up 😉.

    • @emma_ding
      @emma_ding  2 года назад

      Hi Simone, I would recommend checking the blog posts on www.datainterviewpro.com/blogpost.

  • @quentingallea166
    @quentingallea166 5 месяцев назад

    Great job! For the sample size, I would seriously advise to use a more precise approach like G-power (or R/Python). It would tremendously increase the precision.

  • @sandeepgupta2
    @sandeepgupta2 2 года назад +3

    Hi Emma,
    Amazing Content !! Cleared a lot of doubts
    I had a question though. How can we estimate the Standard Deviation of Population. I mean looking at the historical data ??

    • @emma_ding
      @emma_ding  2 года назад

      Hi Sandeep! Apologies for the delayed response but yes, we can get that from the historical data.

  • @viviandoggy07
    @viviandoggy07 Год назад +1

    why gradual ramp up?

  • @kellykuang6122
    @kellykuang6122 3 года назад +4

    Hi Emma, thank you for the great video! I have some questions after watching the videos and hope you can help with that. In the example you mentioned, there are 2 treatment groups, one is displaying similar product below check-out cart, another one is a pop-up window. Therefore shouldn't the variants = 2? should we use bonferroni correction to divide the significance level by 2? thanks!

    • @emma_ding
      @emma_ding  3 года назад +1

      It's a great point to consider multiple testing problem. However, Bonferroni correction is often being criticized for being too conservative. A more recommended way to deal with the problem is to control false discovery rate. More details in this video ruclips.net/video/X8u6kr4fxXc/видео.html. Also, the number of variants is 3 -- control, treatment I and treatment II.

    • @suhascrazy805
      @suhascrazy805 2 года назад

      @@emma_ding Hey Emma, so would it be cirrect to consider this example as a multiple testing problem ???

  • @nikhilsahay895
    @nikhilsahay895 2 года назад +5

    Great video ! I didnt understand why Treatment 2 was practically insignificant? the p-value was less that 0.05 and also point estimate was greater than $2. What am I missing?

    • @ramavishwanathan7214
      @ramavishwanathan7214 2 года назад

      The lower bound of the CI was less than 2. For practical sign., in this case, the lower bound of CI had to at least be equal to 2

    • @nikhilsahay895
      @nikhilsahay895 2 года назад

      @@ramavishwanathan7214 then whats diff between statistical significance and practical? if lower bound was atleast 2, then it would be statistically significant too.

  • @jessehe9286
    @jessehe9286 3 года назад +9

    When you mentioned "follow-up tests" in the end, would you just keep the current experiment running for more power? Or are you referring to something else? Thanks! (love the video!!!)

    • @nanfengbb
      @nanfengbb 3 года назад

      Good point! I have the same question. For "follow-up tests", do we let the experiment run longer so more samples can be collected?

    • @emma_ding
      @emma_ding  3 года назад +3

      Great question! It depends. Generally speaking, when the experiment hasn't stopped, the simplest way is to keep it running to get more users. If it has stopped or there are some major changes to the experiments or assumptions, we need to rerun the experiment, which will introduce some overhead.

    • @Crtg17
      @Crtg17 3 года назад +1

      @@emma_ding Hey Emma. Thanks for sharing! Just to follow up: I was wondering why people always talk about "power" like "underpowered" or "adding more power". I am thinking that the only way to increase power is to increase the sample size, then why not just say "sample size is too small" or "adding more sample?"? Is there any other way to increase power? Thanks!

    • @goryglory729
      @goryglory729 3 года назад

      @@Crtg17 a diluted experiment can also be underpowered.

  • @yiminglee4372
    @yiminglee4372 19 дней назад

    hello Emma, thank you so much for this amazing ab testing series video. I got a quick question about the ramp-up plan part, so in the first day, we only use 5% of traffic for each variant, so its like we use 5% of traffic for variant 1, another 5% of traffic for variant 2, and the left 90% of traffic goes to the control group? i am just really confused about this part. much appreciated if you can give me some hints! : )

  • @sinaamininiaki3995
    @sinaamininiaki3995 3 года назад

    your videos are the best

  • @ashwinmanickam
    @ashwinmanickam 2 года назад

    Thanks for the great video! Very informative!
    10:23 Can customers belong to more than one group ?
    What I mean is - In Cycle 1 there are 100 users , so total of 300 users will be tested.
    In cycle 2 - 200*3 - 600 users
    Now can the 100 users that belonged to control or treatment 1 in cycle 1(day 1), belong to treatment 2 in another cycle
    While starting the experiment only do we fix them into the here categories?

  • @Funnylukn
    @Funnylukn 3 года назад

    Great video, and really clear. Thank you so much! Question: One of the big assumption is knowing the population standard deviation when in reality we don't. Can you give a real-life example how to estimate that too? Thank you!

    • @emma_ding
      @emma_ding  3 года назад +9

      The standard deviation has to be estimated from historical data. You could assume it's the same in both control and treatment group. For more info, you can refer to ruclips.net/video/JEAsoUrX6KQ/видео.html

    • @TheCsePower
      @TheCsePower Год назад

      @@emma_ding The video doesnt really explain how to get variance from previous data. check ruclips.net/video/OM5Lbb2gZgA/видео.html

  • @karthicthangarasu6766
    @karthicthangarasu6766 2 года назад

    Hey Emma - one question on segmenting results. Does each segment require us to run an SRM check? E.g. the SRM check passes at the overall level but fails at the segment level.

  • @yuanliu1290
    @yuanliu1290 2 года назад +1

    Thanks for the great video! One question on delta (business requirement), what if there's no minimum required delta value, e.g. maybe the cost of test is extremely low and we just want to see if there can be any more revenue generated, then how should we estimate sample size?

    • @zachmanifold
      @zachmanifold Год назад

      For this case where you don’t have a specific delta in mind, I would probably target a specific margin of error for both groups. In the case where both groups have the same size then this reduces to one calculation.
      For example, maybe you want to target a margin of error of 0.3 (I.e., point estimate +/- 0.3) then you can use the formula of the variance of the sampling distribution to determine the appropriate sample size to reach that margin of error

  • @charlottelee5831
    @charlottelee5831 2 года назад +4

    This video is super helpful! One question, why is it recommended to be overpowered? If the test gets too overpowered wouldn't the effect size you are detecting be too small/ trivial?

    • @zachmanifold
      @zachmanifold Год назад +2

      Being overpowered would significantly reduce the variance of the sampling distribution of whatever you’re measuring. It’s very possible you can detect trivial effects like you mentioned, but that’s where practical significance plays a role.
      If my final sampling distribution has an interval [0.04, 0.07], that’s a completely trivial effect which is “statistically significant” but way lower than what we’d consider practical.
      On the other hand, it could be [2.22, 2.25] where we have much more certainty and we can see in this case it’s comfortable enough to say that it’s a practical change to make.
      It’s only recommended to be overpowered if you can manage all of the costs associated with the design and of course, if you have the users for it

    • @anuragnegi9636
      @anuragnegi9636 5 месяцев назад

      ​@@zachmanifold Such a good example☝

  • @Leon_1218
    @Leon_1218 2 года назад +1

    Hey Emma, can I ask what is the rationale of running another test with "more power"?

  • @DePhpBug
    @DePhpBug 2 года назад

    Hi, this subject trigger the interest for me to study bit more on AB testing , but to practice this , how do we proceed in learning A/B testing if we do not have a digital product? Is there like a mock data to begin testing with ? just so able to learn the A/B test?

  • @AlirezaAminian
    @AlirezaAminian Год назад

    At the end of the video, the p-value for Treatment I vs Control is 0.055, which is larger than 5%. So statistically it should be significant, while practically it may not be. In the video, you mentioned otherwise. Can you please advise?

  • @kandulareddy5394
    @kandulareddy5394 2 года назад

    just a question on the final formula sample size = 16(sigma)^2/(delata)^2 I get for the delta part we can determine minimal effect we want to measure and it can be taken by the business. But without knowing the sample size how can we get sample SD (Sigma)? @

  • @moashtari7619
    @moashtari7619 2 года назад +1

    Hi Emma, I love your videos, short, clear, and to the point. However, I tend not to agree with you on the conclusion you made. There is no point estimate comparison that can validate either of statistically or practically significant observations. For both cases, CI needs to be used. Generally, if CI has overlaps with any value, it means chances are high that CI is referring to the same distribution that the point estimate or parameter is referring to. For TG1, the CI includes both control value of "0" and practical significance boundary of "2", so for both it is NOT significant. For TG2, it doesn't cover "0" but covers "2", so it is statistically significant, but practically NOT significant.

  • @hongyuweng7786
    @hongyuweng7786 3 года назад +4

    Hi Emma, thanks for your inspired and helpful explanations about A/B testing. I'm confused about the result of Treatment II VS Control group. It seems like Treatment II is both practically and statistically significant. Why we also would like to get more tests and then make final decision? Does the follow-up test is like a fine tuned model which well gives us better design details and results?

    • @emma_ding
      @emma_ding  3 года назад +2

      The C.I. of Treatment II overlaps with practical significance boundary meaning it's likely to be NOT practical significant.
      The followup test is to get more users to increase the power -- either keep running the ongoing experiment or rerun with more users.

    • @qietang2701
      @qietang2701 3 года назад +8

      @@emma_ding Then why Treatment I is likely to be practically significant? The C.I. of Treatment I also overlaps with practical significance boundary.

    • @kellykuang6122
      @kellykuang6122 3 года назад +8

      from my understanding, ideally we want to launch a change if is (1) statistically significant (2)partially significant.
      For treatment I, point estimate - 2.45 > 2, indicating the change may be practically significant, but we are not sure since confident interval [-0.1, 5] overlaps practical significance boundary [-2, 2]. Most importantly, the p-value = 0.055 indicates that it's not statistically significant. Since the change may be practically significant, but not statistically significant, we are not sure about the launch, so we want to increase the power of the test, to see if the change become statistically significant, and also observe whether the CI move outside the Practical significant boundary.
      For treatment II, differs from treatment I, now p-value = 0.0001, indicating the change is statistically significant. However, still, the CI overlaps practical significance boundary. We know there's a positive change, but not sure if such change meets the business goal.
      Is my understanding correct! welcome comments and feedbacks!

    • @scottqian7687
      @scottqian7687 3 года назад +6

      @@qietang2701 I was also confused by this. I felt Treatment I is NOT practical significant if using the criteria mentioned later. I tried to convince myself that maybe Emma meant to bring in CI as a condition only in Treatment II.

  • @ruthrugezhao862
    @ruthrugezhao862 2 года назад

    IMO it's more natural to choose randomization unit (randomize by user) first before metric (avg revenue per user)

  • @tinawang1291
    @tinawang1291 Год назад

    Need some help to understand the sample size calculation. It seems to me the MDE ($2/user) missed info about the time period. Is it $2/user per day, per month, or per testing period? If it's per testing period, then how can we use this MDE to calculate how many days we need to run a test?

  • @jushkunjuret4386
    @jushkunjuret4386 Год назад

    shouldn't the selection of the sagement of users falls under step 2 (design experiment)? in the video, it still falls under the prerequisites

  • @haiyentruong1510
    @haiyentruong1510 3 года назад

    Hi Emma, for both treatments, the point estimates are larger than the practical significance boundary, why treatment 1 is likely to be practically significant and treatment 2 is not? I am not quite clear on that part. Hope you can clarify. Thank you.

    • @codylian9009
      @codylian9009 2 года назад +2

      For treatment 1: Difference between treatment and control = $2.45, it is more than practical boundary of $2, but this is only an estimate based on the sample and might not incorporate margin of error, so it’s more reliable to look at the CI, which in this case is [-0.1, 5], this CI suggests that we are confident that we would observe an average increase of -$0.1 (which is a loss of revenue) all the way up to average increase of $5, so it is likely that it would not be practical significant. However, if the minimum of CI is above 2 (e.g. [3, 6]), we would say it is practical significant. Thus, Treatment 1 is neither statistically nor practically significant.
      For treatment 2: Using the same logic as above, treatment 2 should be considered as statistically significant but not practically significant.
      Am new learner as well, so feel free to correct me.

  • @adachen5319
    @adachen5319 2 года назад +1

    thanks Emma! I have one question, based on your formula for sample size, it only contain difference of control and treatment groups and standard deviation. you said if the significance level to 2.5%, the sample size will be larger. how large? double or triple? you didn't say clearly. thanks.

    • @emma_ding
      @emma_ding  2 года назад

      Hey, You would need to find the corresponding z score to calculate it. I hope this answers your question. Thanks!

  • @user-hw8gx9vh5v
    @user-hw8gx9vh5v 11 месяцев назад

    Hi @emma_ding , thanks for the video. I have an A/B test with two groups, treatment and control, but I analyze the statistical significance of multiple subgroups within these groups, such as different shift segments throughout the day: morning shift from 9 am to 12 pm, lunchtime shift from 12 pm to 3 pm, snack time shift from 3 pm to 6 pm, and dinner time shift after 6 pm. If I perform an A/B test by dividing the results of control and variant within each of these subgroups, should I conduct an ANOVA or can I still use a T Test to compare means with a correction, similar to multiple ab testing (ABN tests)?

    • @jasminbogatinovski8391
      @jasminbogatinovski8391 10 месяцев назад +1

      ANOVA will do the trick. You are using one independent categorical variable to analyze the dependent variable. In case the normality assumptions and homoscedasticity are violated, consider the usage of the Kruskal Wallies test.

    • @user-hw8gx9vh5v
      @user-hw8gx9vh5v 10 месяцев назад

      @@jasminbogatinovski8391 thank you!

  • @liuauto
    @liuauto 2 года назад +1

    Always wondering where the delta comes from when computing the sample size. It looked like a chicken egg problem to me. Now realized that delta is actually tied to the practical significance level which is from the business requirement.

  • @PriyankaSingh
    @PriyankaSingh 2 года назад

    What is practical significance?

  • @frankchen5093
    @frankchen5093 3 года назад

    Eventually we reach a point where you listed all of your tshirts! 🤣 this looks great on you! (wait am I watching a tech video??

  • @modhua4497
    @modhua4497 3 года назад

    Hi, how could we ensure that the users assigned in each group are carried out randomly? thanks

    • @emma_ding
      @emma_ding  3 года назад +1

      You can use hypothesis testing: t-test or chi-squared test.

    • @modhua4497
      @modhua4497 3 года назад

      @@emma_ding For AB test, does the web page A and B run in the same day or today we run A and tomorrow we run B and so on? Do you have a video to show the entire process plus who are the key parties involved in AB test? Thanks

  • @ningxinhuang1924
    @ningxinhuang1924 2 года назад

    Hi Emma do you still provide mock interview services? I checked your website and they are marked sold-out.

    • @emma_ding
      @emma_ding  2 года назад

      Hi Ningxin, currently all my services are sold out. But I will soon launch my masterclass which will introduce you to my full course, and that includes mock interviews in the cohort!

  • @Rapha_Carpio
    @Rapha_Carpio 2 года назад

    Have you ever made some A/B Testing for youtube thumbnails?

  • @dydx3741
    @dydx3741 3 года назад

    the more i watch you...more i fall in love with you 😓😭