2 - Potential Outcomes (Week 2)

Brady Neal - Causal Inference

Просмотров 28 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 26 дек 2024

Комментарии • 80

@sethjchandler 3 года назад ⁺²⁰
A masterpiece of clarity.
@woods.9549 4 года назад ⁺⁸
Nice death star easter egg!
@hafsabenzzi3609 Год назад ⁺¹
wonderful lecture bravo!
@tOo_matcha 2 года назад ⁺²
31:13 that split of a second when you see the Death Star 😂
@KyleReevesSci Год назад
Was looking for this comment 😂
2 года назад
In Unconfoundedness, does the conditioning to X means if we fill the group "went to sleep with shoes" with ALL PEOPLE DRUNK, and fill the group "went sleep without shoes" ALSO WITH DRUNK PEOPLE, is a workaround for to fill both groups with random people, selected by a coin? The negative aspect of this is that some data will be lost because we only care about a subset of the dataset (e.g. DRUNK=1, ignoring all data with DRUNK = 0)?
@TheProblembaer2 5 месяцев назад ⁺¹
I SAW THE DEATH STAR!
@YashSharma-yw9er 3 года назад
How is the two groups (shoe sleepers and non-shoe sleepers) not being comparable considered a separate reason for association not being causation? Isn't it indirectly a confounder as well?
@kangchenghou5027 4 года назад ⁺²
Thanks for the great lecture again! I learnt a lot and I have a few questions:
1. The fundamental problem in causal inference refers to that for each individual, we only get to observe one potential outcome. Ways to get around this is to make assumptions, therefore convert a causal estimand to a statistical estimand. So far in the course, it seems that we cope with average treatment effects. To estimate individual treatment effects, is it that we need more assumption there? Will we cover that in the course?
2. For positivity assumption, if for some covariates, P(T = 1 | X = x) is very close to 0 or 1. The estimation will be fine if we have access to the full distribution. When it goes to the estimation using the finite samples, it will leads to big variance. So to have good estimate to the treatment effects, we would want P(T = 1 | X = x) not go to the extreme, is this correct? This also reminds me of the bias-variance tradeoff: including more covariates reduce confoundedness (bias), but may lead to estimate with high variance (variance). Does this make sense?
3. This is more of a comment: I think the lectures mentions that including more covariates is better (correct me if I am wrong). I think it may worthwhile to mention, this is not always the case, for example X -> C
@BradyNealCausalInference 4 года назад
1. Awesome question. Makes me think you already know the answer haha ;). To move from ATEs to ITEs, we do need to make stronger assumptions. The stronger assumptions we need to make have to do with the specific functional form and noise distribution (in addition to the causal graph). This corresponds to moving from Level 2 to Level 3 of Pearl's ladder. We will see this later in the course when we get to counterfactuals.
@BradyNealCausalInference 4 года назад
2. You are exactly right on both counts. When we get to estimation in week 5, we will actually see that people sometimes just drop specific examples where P(T = 1 | X = x) is too close to 0 or 1. Your bit about the bias-variance tradeoff is also right (usually).
@BradyNealCausalInference 4 года назад
3. Right again. I mention this in sidenote 8 of Chapter 2 in the book (www.bradyneal.com/Introduction_to_Causal_Inference-Sep1_2020-Neal.pdf). I think I meant to use weak language in the lecture (e.g. "there is a general perception that this is the case"). If I used strong language (e.g. "this is the case"), would you mind linking me to it, as I should probably correct that with an annotation.
@BradyNealCausalInference 4 года назад
4. I do everything with PowerPoint and TikZ (since I use TikZ for the book, might as well just reuse those figures in the slides). I sometimes use Inkscape when I need more flexibility than both of those can easily provide.
@kangchenghou5027 4 года назад ⁺¹
@@BradyNealCausalInference Thanks for the detailed explanation! For 3, it could be just my perceptual bias :) You did mention this is not the general case. But just for the reference, 34:32 "for unconfoundedness, the general idea (which is not always true) is that the more covariates you condition on, the more likely you are to have satisfied unconfoundedness." For 4, may i know how do you integrate the latex with powerpoint?
@sahilverma1635 4 года назад ⁺⁴
Hello Brady. I have a silly doubt, what is the difference between Y(0) and Y | T= 0 ?
@BradyNealCausalInference 4 года назад ⁺⁸
Y(0) corresponds to "take a random person in the whole population and force them to take treatment 0." Y | T = 0 corresponds "take a random person from the subpopulation that happened to take treatment 0." Some of the comments in the threads on this video might also be helpful: ruclips.net/video/eg-bFhNKbnY/видео.html
@michelspeiser5789 Год назад ⁺¹
@@BradyNealCausalInference This is a very helpful formulation, that I recommend to be included in the course (unless it's already there and I missed it)
@Theviswanath57 4 года назад
@Brady: In Jason A. Roy's Coursera course both no-interference assumption & only one way of getting treatment is clubbed under SUTVA;
@Theviswanath57 4 года назад
Whereas in your example of "Golden retriever or other dog" which I guess violating "only one way of getting treatment assumption" your putting consistency assumption
@BradyNealCausalInference 4 года назад ⁺¹
@@Theviswanath57 Not entirely sure i understand you comment, but are you saying this:
"SUTVA is satisfied if unit (individual) i's outcome is simply a function of unit i's treatment. Therefore, SUTVA is a combination of consistency and no interference (and also deterministic potential outcomes)."
If so, that sounds right to me. That's taken from Section 2.3.5 of the course book (not everything makes it into the lecture)
@Theviswanath57 4 года назад
@@BradyNealCausalInference make sense, thanks
@Theviswanath57 4 года назад ⁺¹
In Slide #40 with regards to Estimation: I feel it should be sigma_i rather than sigma_x;
Currently it's 1/n * ( Sigma_x ( { E[Y | T=1, x ] - E[Y | T=0, x] } ))
I feel it should be 1/n * Sigma_i ( { E[Y | T_i=1, X_i ] - E[Y | T_i=0, X_i] } )) which we can re-written as Sigma_x ( P(X=x) * (E[Y|T=1, X=x] - E[Y|T=0, X=x]) )
@BradyNealCausalInference 4 года назад ⁺²
You are absolutely right. Unfortunatley, some typos might stay in the videos, even if they have been fixed in the book.
@Theviswanath57 4 года назад
Reason:
Let's there are four sub-groups with following conditional average treatment effect: 1, 0.5, 1.5, 2.5
Let's say P(X=x) = [ 0.5, 0.2, 0.2, 0.1]
Let's say there are total 100 subjects
with the first equation: ATE will be (1/100) * ( 1 + 0.5 + 1.5 + 2.5 ) = (1/100) * 5.5 = 0.055
with the second equation ATE will be ( 0.5*1 + 0.2*0.5 + 0.2*1.5 + 0.1*2.5 ) = 1.15
@Fhoneysuckle Год назад
Hi Brady,thanks for your awesome lecture.But I have a question about the Ignorability and Exchangeablility.In the Causal Inferences: What if ,Randomization
refer to the joint independence of potential oucome as full exchangeability. Randomization makes the potential outcome jointly independent of treatment T which implies,
but is not implied by exchangeability.So why the Randomization/Ignorability means joint independence rather than marginal distribution?
@scotth.hawley1560 4 года назад
Great lecture, but starting at 20:02 I become lost: How is E[Y(1) | T=0] not a contradiction? If you do(T=1) then doesn’t that force T=1?
@BradyNealCausalInference 4 года назад ⁺¹
Yes, but T=0 is *conditioning* on T=0, not doing T=0. So condition on T=0 means "look at the people how happened to not take the treatment." Then for those people, Y(1) means "what would have happened had they taken the treatment?"
@scotth.hawley1560 4 года назад
@@BradyNealCausalInference Thanks so much for taking the time to respond! This clarification helped me be able to move forward.
@BradyNealCausalInference 4 года назад
@@scotth.hawley1560 Glad to hear it! Thanks for bearing with me on the slow response time haha.
@Theviswanath57 4 года назад
On Final Estimation example:
Question 1: By controlling for age, our estimated ATE is matching with actual ATE; but whereas by controlling for both age & 'protein excreted in urine', our estimated ATE is just 0.85;
Question 2: What's the causal graph with both age & protein excreted in the urine
age blood_pressure } where age is confounding variable
Actual ATE: 1.05 & estimated_ate: 1.05 ( Both from the "mean of differences" & from regression coefficient )
@BradyNealCausalInference 4 года назад
I'm not sure I see a question in there haha. It sounds like you are describing the code. Note: some of that code is for Chapter 4, where we acually write down the causal graph, so it might not all make sense without Chapters 3 and 4.
@Theviswanath57 4 года назад
@@BradyNealCausalInference Cool, will wait for chapter 3 & 4 to be covered
@edisonge9311 4 года назад ⁺¹
Hi Brady, in page 18, I understand your point here, but I have a question about the definition of E[Y(1)|T=0]. If we observe T=0, then what the meaning of Y(1) here?
@BradyNealCausalInference 4 года назад ⁺¹
Y(1) given that you observe T = 0 is the outcome you would have observed if you had taken T = 1. It isn't something that we can observe (usually)! I think I give the intuition for this on the potential outcomes intuition slide.
@edisonge9311 4 года назад
@@BradyNealCausalInference So observation T=0 is independent of the do-operation Y(1), then we also can get E[Y(1)] - E[Y(0)] = E[Y(1)|T=0] - E[Y(0)|T=1] , right? But we cannot use consistency law here, therefore, in ICI, Eq.(2.3), it's E[Y(1)] - E[Y(0)] = E[Y(1)|T=1] - E[Y(0)|T=0]. Is my understanding correct?
@sourajmishra1450 3 года назад ⁺²
Hey Brady, Thanks for the great course!! In slide 17: Why does E[Y(1)|T=1] becomes E[Y|T=1]? and same for E[Y(0)|T=0] = E[Y|T=0]?
@shipan5940 2 года назад ⁺¹
my understanding, because the condition is T=1, so Y(T) = Y(1) = Y(all) = Y. My own way of explaining this. If T could be 1 or 0, it can't be simplified like this.
@rajeevbhatt7415 4 месяца назад
It's after applying the consistency assumption because we are guaranteed that for T=t, we will get Y(t), so Y | T = t is sufficient.
@Ptilu2 2 года назад
Hi Brady! Thank you so much for those lovely pedagogical videos! There is something I am struggling to wrap my head around though and I was wondering if somebody (you or some other kind soul) could help me with here. You presented ignorability as resulting from an assumption of independence between the causal variables Y(1) and Y(0), leading to E[Y(1)|T=0] = E[Y(1)|T=1]. Isn't this independence meaning that basically the treatment has no causal effect on Y? Instead of removing the arrow from X to T, aren't we removing all arrows leading to T?
If I try to explain in other words my confusion: if the expectation of the outcome Y(1) does not change whether we give T or not, doesn't it mean that T is not causal for Y? I am obviously having a logic flaw here somewhere so I would be glad if someone could help me seeing it :)
@Ptilu2 2 года назад
I think I am confusing Y(1) with Y=1 here, while in fact it is Y|do(T=1). Some getting used to...
@charismaticaazim 4 года назад
Reporting a mistake: Around 5:03 Brady said T=0 for taking the pill. It shld be T=1.
@charismaticaazim 4 года назад
Brady, do the casual theory literature say anything about "knowing the presence of confounding variables, but not being able to know or measure what they are". This would hint the domain expert that there's something else that is influencing the decision.
Also, in terms of the shoe example, since we know being drunk is contributing to the outcome it wouldn't really be a confounder if we know
it, right.
@gwillis3323 3 года назад
Hey, you say that the approach at the end, where you train a regression of the form y=at + bx only works because the treatment effect is the same for all individuals (ATE=CATE). I don't think this is correct. In fact, the paper which introduced the Double Machine Learning approach starts off by showing that for the case of y = at + g(x), standard approaches which predict y well will give biased estimators for a (although granted, the Double Machine Learning approach really starts to shine when y=f(x)t + g(x)). Do you have any intuition on why the linear regression approach works so well here? Is it because the outcome variable depends linearly on both the treatment and the feature? Will it always work well in such cases? My intuition says no, that confoundedness can still mess you up. Maybe it's just a quirk of this exact dataset?
@Theviswanath57 4 года назад
Slide #40: Naive estimate might have been estimated through following regression equation: Y_i = alpha + Beta * T_i;
alpha_hat is 5.33 ?
@BradyNealCausalInference 4 года назад ⁺¹
Not quite. That simple regression and taking the coefficient from the regression is actually what I describe for slide *41*. And in your comment, *beta* hat is actually the ATE estimate (5.33), not alpha hat. In the notation I use in slide 41 (different from yours), it is alpha hat that is 5.33.
@Theviswanath57 4 года назад
@@BradyNealCausalInference yeah that's right, little confused; thanks
@Theviswanath57 4 года назад
Where can I get the data
@BradyNealCausalInference 4 года назад ⁺¹
@@Theviswanath57 See the GitHub link in Section 2.5 of the book for the data generation and estimation code.
@charismaticaazim 4 года назад
Independently & Identically distributed = Ignorability / Exchangability.
Agree ?
@galaxystat 4 года назад
Hi Brady, thanks for great lectures! I read the book of why by Judea Pearl. Any difference between potential outcomes framework and counterfactual calculation in the Peral's book ? I saw some comments in the book that Judea thought missing value interpretation was wrong. What methodology do you recommend in practical applications ? or they are just the same ?
@BradyNealCausalInference 4 года назад ⁺¹
I think the two languages share a lot more than a lot of people seem to think. To me, they are simply different notations and different ways to formulate the assumptions. You should be able to understand both, so I include them both in the first month of the course. I use both, depending on the setting or who I'm talking to.
@jitingjiang7401 4 года назад
Hi Brady, Thanks for this lecture. It is super great. I have one question about the fourth assumption for identification, i.e. consistency. To illustrate the concept, you mentioned an example with two different types of dogs as multiple versions of the treatment. I am wondering, is it really a problem? I guess one can always define a specific version of treatments as the T, right? Thank you!
@BradyNealCausalInference 4 года назад
Yes, that just means being sufficiently specific about how you define the treatment.
@adrianoyoshino 3 года назад
In the consistency example I got the point that we can't have multiple treatments (like different type of drugs as a treatment). But does it has to have the same outcome always? I mean, is it possible having a case where I take a pill one day and I get better but I take a pill another day and the headache does not get better?
@rajeevbhatt7415 4 месяца назад
Not following consistency is like adding more nodes to the causal graph. For example, the dog type in the given example, along with whether the person got a dog. Similarly, if the pill's effect is different each day, a day node needs to be added to the causal graph.
@RobertKwapich 3 года назад
Great course!
Any particular books or review papers that you could recommend to read in more detail?
@alialthiab7527 Год назад
Have you found any?
@tyflehd 2 года назад
Hello Brady, thank you for the awesome video :) I come over here to get an intuitive understanding on causality. I have a question on lecture slide 14. If the group from T=1 and T=0 are comparable, shouldn't it be drunk on the right if it is sober on the left. Based on my understanding, let's say I am the topmost guy in both groups (T=1, T=0). How can I be included in a group 'go to sleep with shoes on' and the other group 'without shoes on' under the same condition 'drunk'? Please correct me if I am wrong. Thanks!
@rajeevbhatt7415 4 месяца назад
The same person cannot be included in both groups. just the number of people in both groups is almost same, due to randomization.
@souradipchakraborty7071 4 года назад
Can we have a non-linear cause and effect relationship? In that case, how do we estimate the exact effect ?
@BradyNealCausalInference 4 года назад
Yes! You'd use the same estimator that is used in slide 40, but with a nonlinear model instead of linear regression. You can also use any of the other estimators that we discuss in week 6 of the course.
@souradipchakraborty7071 4 года назад
@@BradyNealCausalInference Thanks will definitely check the week 6 course. I asked as if there is non-linearity with respect to T, then Y_hat = alpha * T + alpha' * T^2 + alpha'' * T^3.... + beta_X. Then which coefficient would give us the causal effect of T on Y.
@chadpark9248 4 года назад
Thanks for the grate lecture again. I have a few questions about the text book
In page 8. "A natural quantity that comes to mind is the associational diﬀerence: ~~~~Then, maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]."
From these sentences, I got little confused What "maybe ~ equals" means.....
@chadpark9248 4 года назад
In addition, I have a question about the description section of "Consistency" on page 14. I understand Y(t) intuitively, but I don't understand "whereas Y(T) is the potential outcome for the actual value of treatment that we observe" intuitively. Do you have an example?
@BradyNealCausalInference 4 года назад ⁺¹
Basically, it's just like a train of thought that is common to go down. "maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]" is the more formal way of writing "maybe causation equals association (correlation equals causation)." Of course, this thinking is often incorrect :)
@BradyNealCausalInference 4 года назад ⁺¹
@@chadpark9248 For a given individual, they will observe a specific value, say t', for the random variable T. That means that they will observe the potential outcome Y(t'). So the realized value, t', of T gets connected to the observed outcome Y in that way (assuming consistency). Similarly, Y(T) corresponds to the potential outcome that we observe when we know the realized value of the treatment random variable T. It is distinct from Y(1), Y(0), or Y(t) which is meant to denote a specific potential outcome, that isn't related to the random variable T at all (even though, we use the same letter, but in lower case, for Y(t)).
@chadpark9248 4 года назад
@@BradyNealCausalInference Thank you for your detailed explanation.
@viralgupta5630 4 года назад
Is there a textbook or course website
@BradyNealCausalInference 4 года назад
Website: causalcourse.com
Book: www.bradyneal.com/causal-inference-course#course-textbook
@Theviswanath57 4 года назад
@Brady: In Slide #41, I am wondering estimation should be sigma_x ( P(X=x) * ( E[ Y/T=1, X=x] - E(Y/T=0, X=x) )
@Theviswanath57 4 года назад
In your variant essentially we are saying that P(X=x) is same for all x; please correct me if I am wrong;
@BradyNealCausalInference 4 года назад
@@Theviswanath57 In slide 40, it is that equation that you write, assuming that you meant "E[Y | T=1, X=x] - E[Y | T=0, X=x] " when you wrote "E[ Y/T=1, X=x] - P(Y/T=0, X=x)." However, in slide 41, we use a completely different way to estimate the ATE: linear regression and then using the coefficient of the regression. In general is not equal to the correct equation from slide 40. It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals). I don't actually include the specific equation for the estimate in slide 41, but you can get it using the closed-form solution to linear regression. You can see the exact code that I used for this in Section 2.5 of the course book.
@Theviswanath57 4 года назад
@@BradyNealCausalInference regarding P(Y/T=0, X=x), yes I mean E(Y/T=0, X=x).
@Theviswanath57 4 года назад
Understood on "It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals)."
@Theviswanath57 4 года назад
@brady if we have P(X=x) as part of the equation ATE is unbiased estimate even if "E[Y | T=1, X=x] - E[Y | T=0, X=x] is not same for all x "?
@amins6695 2 года назад
Amazing video. One question. The example at the end of the lecture seems like a simple linear regression. Does it mean that when we run linear regression, we are doing causal inference? What is the difference between regression and causal inference, here?
@mingmingchen7154 3 года назад
Thanks for the lecture! I have a question around ruclips.net/video/5x_pPemAVxs/видео.html: Is E[Y(1) - Y(0)] (here the individual subscript i is implicit) properly defined since some data are missing?
@DailySFY 11 месяцев назад
@mingmingchen7154 As you have pointed out it is a biased estimate. And Brady explains this clearly afterwards.

Следующие

Автовоспроизведение

3 - The Flow of Causation and Association in Graphs (Week 3)