Видео 4
Просмотров 72 325

A Tutorial on Conformal Prediction Part 2: Conditional Coverage and Diagnostics

28:23

Event-based, 10KHz eye tracking

7:04

A Tutorial on Conformal Prediction

38:08

Tutorial on Conformal Prediction Part 3: Beyond Conformal Prediction

Here we discuss our recent works on controlling error rates other than coverage --- such as the false-discovery rate, intersection-over-union, and OOD detection metrics. For this, we need tools beyond conformal prediction. (Re-upload, the old one was missing the Detectron2 video.)

Видео

A Tutorial on Conformal Prediction Part 2: Conditional Coverage and Diagnostics

28:23

A Tutorial on Conformal Prediction Part 2: Conditional Coverage and Diagnostics

Просмотров 13 тыс.2 года назад

This video tutorial on conformal prediction follows a document (linked below) we wrote that is meant to teach people conformal prediction and distribution-free uncertainty quantification. Here we focus on the practical aspects of conformal prediction, such as marginal vs conditional coverage and also diagnostics to make sure your conformal procedure is correct/effective. You do not need to be a...

7:04

Event-based, 10KHz eye tracking

Просмотров 1,6 тыс.3 года назад

Presentation for IEEEVR2021. Project website: www.computationalimaging.org/publications/event-based-eye-tracking/ Dataset on GitHub: github.com/aangelopoulos/event_based_gaze_tracking

38:08

A Tutorial on Conformal Prediction

Просмотров 51 тыс.3 года назад

This video tutorial on conformal prediction follows a document (pdf link below) we wrote that is meant to teach people conformal prediction and distribution-free uncertainty quantification. The document is a hands-on introduction for a reader interested in the practical implementation of distribution-free UQ, who is not necessarily a statistician. We included many explanatory illustrations, exa...

@yifanli8778 5 месяцев назад
Nicely organized and neat introduction. Thanks!
@questions-n8f 6 месяцев назад
Hi, My question is why it makes sense to assume that X and Y come from a joint distribution P, which to me at least would imply that X is random. In the frequentist setting like regression I am familiar with, one assumes that the predictors X are deterministic and only y is random, i.e. the conditional expectation of y given x is modelled. Can anyone explain the intuition behind this assumption? Is it just a more general assumption that doesn't contradict the frequentist view?
@anastasiosangelopoulos 6 месяцев назад
Hey! Good question. There are certainly some settings where you model X as fixed, or even analyze worst-case behavior over X, but that’s a harder problem setup. Random X is also a standard frequentist setup, although it’s easier than the one you brought up. (Actually most of the standard frequentist theory on, e.g., M-estimation is done using random X.) This assumption might be suitable when the inputs to your algorithm can be thought of as coming from a consistent population.
@demohub 6 месяцев назад
Thank you
@miriamczech8888 7 месяцев назад
Your videos are such great resources, thank you! I am just wondering, how do we know the theoretical distribution that the coverage should follow? Is there any resource that explains this in more detail? Thanks!
@anastasiosangelopoulos 7 месяцев назад
One good resource is this paper: arxiv.org/abs/1209.2673 Informally, you can think of the conformal scores as being uniformly distributed on the coverage scale, and then apply the fact that order statistics of uniformly distributed random variables are beta distributed. en.wikipedia.org/wiki/Order_statistic .
@miriamczech8888 7 месяцев назад
@@anastasiosangelopoulos Thanks a lot!
@nickbishop7315 7 месяцев назад
The hierarchical clustering and its relation to the Learn Then Test framework (LTT) was slightly unclear to me. Am I right in thinking the general procedure looks something like adaptive predictive sets for classification, but instead of moving through individual labels ordered in terms of probability mass to build your quantile you consider a sequence of large and larger subtrees? Or do you consider a threshold such that if a subtree has total mass greater than the threshold value you accept it and then employ the LTT framework directly to select a risk controlled threshold value?
@anastasiosangelopoulos 7 месяцев назад
Well, both of these would be possible! The first one works if your risk function is coverage (i.e., if you're looking to make "high-accuracy" hierarchical predictions). When your risk function is not coverage, you can't reframe it in terms of conformal scores, so you need to do this "threshold sweeping" thing. Conformal risk control tells you the right way to do it for a monotone risk, and LTT gives a high-probability version for a non-monotone risk.
@nickbishop7315 7 месяцев назад
@@anastasiosangelopoulos Great thanks for such an informative reply and an amazing series!
@nickbishop7315 7 месяцев назад
Is the small diagram Anastasio's draws at ~30:00 slightly incorrect? Shouldn't the quantile line be horizontal?
@anastasiosangelopoulos 7 месяцев назад
Good point, the diagram is a little weird. If we were on the scale of the CDF, it would be a horizontal line. Here, I'm trying to get across the idea that you should stop after the cumulative mass of the bars hits \hat{q}. (So it's more like a horizontal line if you take the _integral_ of the X axis on that plot.) :) thanks for the question!
@nickbishop7315 7 месяцев назад
@@anastasiosangelopoulos Thanks for the response! The point you were conveying was super clear in any case - great tutorial!
@jorgecelis8459 8 месяцев назад
Hi! In minute 13:00 with the first algorithm, in cases with no clear class detected it could be the case that no class has score greater than q_hat, so the output is an empty set. But in case of uncertainty I would expect bigger set, is this the expected behavior?
@anastasiosangelopoulos 8 месяцев назад
Yep, for this particular score function, that's the expected behavior! It's a little bit weird, but when you output a size-zero set, that decreases your _average_ set size! 😅 If you read the gentle introduction document, we talk also about the (R)APS score, which is usually a better solution for classification. That score has the more intuitive behavior of growing the set whenever there is no clear class.
@bradhatch8302 8 месяцев назад
Truly a video of education. Thank you for taking the time to explain this concept clearly.
@BillSun-sk6si 9 месяцев назад
Hey there! Nice video. I recently ran some experiments on conformal prediction distillation and I was curious if you had any ideas or suggestions. In a nutshell, the goal was to see if problem-specific conformal prediction guarantees (e.g. coverage in classification) could be instilled into a model without the calibration step. I tested some knowledge distillation techniques (e.g. KL divergence for "soft" distillation loss) in training a smaller model trained on a dataset of (X,Y) pairs, where X is the initial input and Y is the a multiclass binary vector representing the prediction set from a calibrated prediction-set generator f on X. This approach seems to somewhat (?) work and is better than just a naive model without any calibration but does not nearly preserve the same amount of coverage guarantees as actually performing the calibration step. Any suggestions or thoughts on how to expand on this?
@anastasiosangelopoulos 8 месяцев назад
Hmm, interesting thought! And sorry for the late response! In short, I don't think this is possible to provide a guarantee for without the calibration step. If you want to try something, I would try running a quantile regression trained to predict the residuals of your model. Conformal prediction can be thought about as an adjusted form of quantile regression anyway, so this is likely to work better if done in the right way.
@BillSun-sk6si 8 месяцев назад
@@anastasiosangelopoulos Thanks for the response! Ill try that
@BillSun-sk6si Месяц назад
@@anastasiosangelopoulos Hi again, hope all is well. I was curious if you were familiar with any literature on learning a conformal score function? I am looking at classification. From my understanding, if the calibration set and X_test are IID, then for any arbitrary conformal score function S, if we construct C(X_test) = { y | S(X_test, y) <= q), then C(X_test) will have coverage. Then, it seems like by following similar logic to the "Learning Optimal Conformal Classifiers" (2021) paper, it seems possible to set the score function S as a learnable function of X and Y, and use the above paper's method of splitting mini-batches into calibration / prediction sets and defining some fancy loss function to learn the score function. Do you know if there are any examples of other people doing this?
@zhenghaopeng6633 11 месяцев назад
I'm a little confused in 18:11 step 3 is taking the scores < q_hat. It should be > q_hat according to 13:35 ?
@anastasiosangelopoulos 8 месяцев назад
It's correct as-is. There's a sign-flip happening: the "conformal score" s(x,y) is 1-softmax score at 13:35. It's very unfortunate that "score" is used to refer to both of these concepts, but they have a different sign. Sorry for the super-late response!
@ak90clb 11 месяцев назад
You can explain difficult concepts more clearly than most of my professors from undergrad!
@71sephiroth 11 месяцев назад
Wonderful and concise tutorial! Could you please elaborate on what 'finite sample validity' means? Is it something like this: 'Given that the training data is finite, and a portion of it is calibration data, which is also finite but smaller, one can create a conformal framework around that 'small' sample and still achieve coverage of (1-alpha)?'
@anastasiosangelopoulos 8 месяцев назад
Sorry for the very late response! Finite-sample validity means "Given a finite number of calibration data points, one can use that calibration data to run conformal prediction, and still achieve a coverage guarantee of 1-alpha." Nothing about the training data is assumed. It can be finite or infinite.
@casimirocarrino9040 11 месяцев назад
Thanks for the video, it is very accessible despite the complexity of the topic. Indeed, I have a question: what about the application of conformal prediction to the case of binary classification? My doubt regards the size of the prediction set since in the case of binary classification we only have two labels and I guess having a coverage on a prediction set greater than one would be useless.
@TheCrmagic Год назад
Thank you for this lecture.
@nintishia Год назад
Thanks a lot for this super -simple, elegant expansion on a topic that appears daunting. Hats off to you guys.
@brianwalsh7040 Год назад
Hey great video! One portion I’m stuck at, now that I have a prediction set - now what? How do I get this into a single prediction? Also - I am using an ensemble approach to my modeling using multiple classification techniques, then averaging the softmax scores across the models to get my “final” prediction. Is this something I would do for each model, or just once the softmax scores are combined/averaged? Thanks!
@NoNTr1v1aL Год назад
Absolutely brilliant lecture series! Subscribed.
@kevon217 Год назад
Beautiful indeed.
@abdjahdoiahdoai Год назад
haha technically you guys are doctors, just not the kind cutting people open 😂
@akshayparanjape4646 Год назад
Thanks for the video and great work guys Anastasios and Stephen. This video really helped me to understand the concept behind conformal prediction quickly. I do have one question regarding CP for regression model. In the video, you have mentioned about training quantile regression models for (alpha/2) and (1-alpha/2). While this is possible by re-training the model based on pinball loss (I assume that the process remains the same for non-NN models as well), is there a way to get this quantile regression without re-training the model? I am particularly concerned with the cases where (1.) re-training is either very costly, or (2.) the model training is performed in an automated fashion (where changing the loss function is not possible).
@anastasiosangelopoulos Год назад
You can train a new quantile regression on the errors of the pre-trained model!
@akshayparanjape4646 Год назад
@@anastasiosangelopoulos thanks for the tip, that would also work !
@Madsott Год назад
Very well done! I read your "Gentle Introduction.." report as well, equally well done. Unfortunately I am left behind the academic paper wall when it comes to your ref[41] V. Vovk, “Cross-conformal predictors”. Hence I figured I'll post my question here: I am considering using Cross Validation (or nested) to obtain out of fold predictions for a train set, then use these as calibration data to calculate nonconformity_scores. I am withholding a test set outside the CV, for which I aim to evaluate final model and coverage. If all good, I retrain the model on all train data, but I am tempted to keep my nonconformity_scores.. Furthermore I could then get additional scores from the "unseen" test set using the final model. Combining these test scores with previous out of fold scores, I obtain scores for all data, betting on that my outer fold models produce comparable errors to the final version. Does this make sense? If not, how do you recommend making use of CV procedures in relation to conformal predictions?
@anastasiosangelopoulos Год назад
I would read this paper for a comprehensive solution to CV-type procedures with conformal guarantees: www.stat.cmu.edu/~ryantibs/papers/jackknife.pdf
@alexdecastro9078 Год назад
what do you guys use for generating these slides? I really like the font and layout
@anastasiosangelopoulos Год назад
Goodnotes, and we write by hand!
@ΣτέργιοςΜιχαηλίδης-κ2γ Год назад
Very interesting and informative!
@mariofigueiredo5124 Год назад
Excellent lecture!
@anastasiosangelopoulos Год назад
Thanks!
@TianyiChen-f6i Год назад
Awesome! I have been working on clinical trial outcome prediction recently. Do you think it is sensible or necessary to apply conformal prediction on a binary classification problem? I asked because sometimes I doubt the value of doing this. Well, a binary classification has only two classes and intuitively, it's kind of "small". And if it is worth doing, is the method in the vedio suitable, or do you think there're some other ways of quantifying the uncertainties for a two-class classification?
@anastasiosangelopoulos Год назад
For binary problems, I generally recommend conformal selective classification. See the relevant section in the Gentle Intro on arXiv!
@blakete Год назад
Fantastic video, thank you!
@thomasf6807 Год назад
Great tutorials on Conformal Prediction, thanks a lot! I have a question regarding the algorithm to evaluate the performance. There, in line 4, you compute 1-D_cal.max(axis=1). Shouldn't the score be defined as in your first tutorial as 1-D_cal[np.arange(len(y_cal),) y_cal], i.e., the estimated probabilities of the true labels? Let's say we have a classifier which classifies the wrong class with a probability of almost 1. Then the scores are close to zero and hence our quantile. In line 6, the rhs would be close to one and hence only false prediction would be added to the set. So 'covered' computed in line 7 will mostly be false. Or am I missing something here?
@alexandre.hsdias Год назад
up
@jimshtepa5423 Год назад
how is it gentle introduction?
@anastasiosangelopoulos Год назад
It is gentle🐑as a bedtime story 😪😴📖
@Samkb92 Год назад
First, thank you for the clear explanation. Second, I have a remark/question. I have been reading about statistical analysis recently, and how there is a reccuring problem surrounding the confidence interval interpretation. One should not say: 'There is a 90% chance that our interval holds the true value of our statistic of interest', but instead: '90% of the intervals built with our method will hold the true value'. The mathematical reason for this is detailled in "The fallacy of placing confidence in confidence intervals". In practice, this mistake can lead to understimating uncertainty for important decisions (e.g. cancer diagnosis). So here is my question: have you considered whether there is the same problem for conformal prediction intervals/sets (e.g. is it truly okay to say that our prediction interval has a 90% chance of containing the true value?)? Have a nice day 😁
@anastasiosangelopoulos Год назад
Good question! Here, it's a bit more complicated: the intervals are random, but actually, (X_{test}, Y_{test}) is ALSO random. So the standard interpretation of a confidence interval isn't exactly right here. In reality, you have to say that with probability at least 1-alpha, the new (random) test label lands in the (random) prediction set. The probability is over BOTH the test point and the calibration dataset. Usually we abbreviate this long story and just say "there's a 90% chance the ground truth lands in the interval."
@EvanH-pn4sq Год назад
Great presentation! I'm curious, when applied to a multi-label problem, is the quantile function just setting a new decision threshold? The quantile is basically a decision threshold for which labels are predicted. And given an alpha, the quantile is a constant for all new predictions regardless of their matrix of dependent variables. So isn't this process basically just updating your decision threshold to account for a desired coverage probability?
@anastasiosangelopoulos Год назад
Yes, that’s right! If you want to incorporate dependencies, you can build them into the score function too, but at the end of the day, no matter what you do it’s just thresholding :)
@EvanH-pn4sq Год назад
@@anastasiosangelopoulos Thanks for the timely response!
@jeffreyalidochair Год назад
So I know to get the upper and lower quantiles for regression, you use the pinball loss function. But does the loss for the point/mean prediction in regression have to be a specific kind of loss? Or can we use mse, mae, CE?
@anastasiosangelopoulos Год назад
If you’re making a point prediction, you can use whatever you want! The pinball loss is only for the purpose of quantile prediction.
@jeffreyalidochair Год назад
@@anastasiosangelopoulos great thank you, anastasios!
@anastasiosangelopoulos Год назад
@@jeffreyalidochair my pleasure!
@hugobuurmeijer2100 Год назад
Extremely helpful, thanks so much!
@hamzawi2752 Год назад
Thank you so much for this informative and nice presentation. I am looking for resources to guide me how the end user benefits from a prediction set and make a decision upon this set in a typical classification problem?
@chaitanyaagrawal8481 Год назад
Very informative video. Thanks
@christianhower8059 2 года назад
Great video! Anyone recognize the app they are using?
@anastasiosangelopoulos 2 года назад
We use GoodNotes!
@jeffreyalidochair 2 года назад
does this conformal prediction method for regression take covariance into account?
@anastasiosangelopoulos 2 года назад
Yes, the covariance between X and Y is normally factored in by the quantile regression model.
@jeffreyalidochair 2 года назад
is there conformal prediction for regression problems?
@jeffreyalidochair 2 года назад
oops, didnt watch long enough before commenting
@anastasiosangelopoulos 2 года назад
😂 nice! 😊
@rock2crack 2 года назад
Excuse me for being naive, but how do you get the prediction intervals/sets on unknown data with no labels? I understand that you calibrate the probabilities with the calibration set, but how do I know the prediction interval for model prediction yhat(X_i) where the ground truth y_i is not known?
@anastasiosangelopoulos 2 года назад
In classification, the model gives you a softmax vector, and the prediction set is all classes with a high enough softmax value. More generally, at prediction time, the model gives you some heuristic notion of uncertainty that you use to build a set. Hope this helps!
@mughairamir9200 2 года назад
Hi, really amazing explanation. I had a question, How do you do it for binary classification?
@anastasiosangelopoulos 2 года назад
See the "selective classification" setting of the gentle intro: arxiv.org/abs/2107.07511
@mughairamir9200 2 года назад
@@anastasiosangelopoulos Thank you
@mughairamir9200 2 года назад
@@anastasiosangelopoulos I want to use it for my predictions from GCNs but I am confused with the code given. Is there a resource you can point me towards? That'd be helpful
@anastasiosangelopoulos 2 года назад
@@mughairamir9200 Have you seen this notebook? github.com/aangelopoulos/conformal-prediction/blob/main/notebooks/imagenet-selective-classification.ipynb
@mughairamir9200 2 года назад
@@anastasiosangelopoulos Yes, that's the one. I want to use it for the outputs of a GCN model for node prediction task
@BastiSeen 2 года назад
Very good video. I did enjoy it, thanks. Wrt to adaptive prediction sets and in the case of binary classification problems, Ei will be either <= 1 (the value of the first score in the list of sorted scores) or = 1. Is my intuition correct? I am also trying to figure out what would be the p values for a test example from a Mondrian conformal predictor. An explanation on that would be very appreciated.
@anastasiosangelopoulos 2 года назад
Yes, the intuition is correct - but for binary classification you might instead look at selective classification (check out the GitHub repo) instead of APS. Regarding the Mondrian question, the p value will have the same form, but it will depend only on the subgroup across which you are stratifying.
@BastiSeen 2 года назад
@@anastasiosangelopoulos Thanks for your answer.
@abrahamowos 2 года назад
Please what writing tool is he using?
@anastasiosangelopoulos 2 года назад
We're using GoodNotes on iPad!
@anastasiosangelopoulos 2 года назад
📣Code for conformal prediction on real data! github.com/aangelopoulos/conformal-prediction The new codebase is part of a huge update to the gentle intro document: arxiv.org/abs/2107.07511 . It includes Imagenet classification, MS-COCO multilabel classification, time-series regression, conformalized quantile regression on medical data, and much more! Leave a ⭐if you enjoy it :)
@james2396 2 года назад
I'm ill right now so I'm missing Volodymyr Vovk's lecture on this and I'm watching this video instead, thanks for making it!
@anastasiosangelopoulos 2 года назад
Haha! Don't tell Vovk ;) Hope you feel better. Are you a RHUL student?
@james2396 2 года назад
@@anastasiosangelopoulos haha, I'll keep it quiet. I am from RHUL though!
@DDeathdealer007 2 года назад
These videos are a fabulous addition to the tutorial document - thank you! Will you add a example Colab notebook to you git repo for the object detection case?
@anastasiosangelopoulos 2 года назад
It's possible eventually! Right now it's not on the front-burner to put in the Gentle Intro. However, we have a Colab for it on the LTT codebase here: github.com/aangelopoulos/ltt
@DDeathdealer007 2 года назад
@@anastasiosangelopoulos that's great, thanks - I'd somehow missed that repo.
@vishalahuja2502 2 года назад
Hi @Anastasios, I am going through the video and am quite eager to try it out. First question: does this method handle scenarios where a CNN predicts incorrectly but with high softmax scores
@anastasiosangelopoulos 2 года назад
Conformal will always work, for any model. If your model is often confidently incorrect, then the sets will account for that by growing such that they obtain the marginal coverage guarantee. However, in the subgroup with confident misclassifications, it may not obtain correct coverage. That subgroup is close to adversarial and will probably have poor coverage, even if the marginal guarantee is satisfied.
@adrielmartins5649 2 года назад
Just what I needed for my master's degree proposal! Full of ideas thanks to you guys! :)
@anastasiosangelopoulos 2 года назад
Happy to help! 🎓
@trolzzdrizzt5674 2 года назад
First of all, Thank you for the great presentation! I have some questions concerning your examples for the classification tasks. 1. Did I understand it correctly, that it is necessary for the heuristic output (in your case softmax output) to add up to the same value (1 in your case) for all data points when all classes of the label space are considered? 2. I tried to adapt your method with the adaptive prediction sets on my binary classification problem at hand (more precisely it is a anomaly detection problem). I observed the q threshold value to become very high resulting in the situation that in almost all cases the prediction set will contain both classes. Obviously this doesn't have much added value for me. I'd like to outline my approach. I used a Variational Autoencoder and trained it with normal data only. I have a mixed test set and computed the reconstruction probability (RecProb) for every data point. I was able to find a optimal decision threshold for anomaly detection by maximizing the Matthews Correlation Coefficient, but unfortunately the RecProbs of anomalous data points are only marginally greater than those of normal data points, so there is a significant amount of overlap. That's why I am looking for a (ideally not very complicated) approach to indicate the uncertainty of my model in some way. Now I thought about doing some kind of comparison of the relative frequencies for normal and anormal data for a given RecProb looking at the corresponding histogram bin. It would be very helpful for me, if you could maybe give me some advice. Many thanks in advance and keep up the great work! Regards, Tobias.
@anastasiosangelopoulos 2 года назад
1. That's actually not needed! They sum to one in this example, but that's not actually required for conformal to work. (See arxiv.org/abs/2009.14193 for an example score where the regularized probabilities do not sum to 1.) 2. In binary classification, prediction sets are not usually useful. Instead, you might try using selective classification (i.e. the model learns to say "I dont' know", in such a way that it achieves an accuracy of, say, 95% when it chooses to speak, even if its marginal accuracy is lower). We'll soon release a V3 of the gentle intro that describes how to do this; for now, you can see how to do distribution-free selective classification here: arxiv.org/abs/2110.01052
@trolzzdrizzt5674 2 года назад
Thank you very much Anastasios!
@MuammarElKhatib 2 года назад
Thanks for this video. Very informational and to the point.
@dangerousdansg 2 года назад
Fantastic presentation! I was captivated from start to finish
@paulscemama6517 2 года назад
Really wonderful complement to your paper "Gentle Intro to Conformal Prediction..". One question! - for each of the example algorithms, you create a three-box summary of what the algorithm does. In the first of these three boxes (for each of the algorithms), you have "score" on the y-axis of the histogram. However, I feel as though this should be named "heuristic-output" or, for example, "softmax output" (for multi-class classification). To me, "score" means the "E" in that first box, i.e., it is defined differently for each algorithm, and encodes the properties we want the prediction set function to have. Correct me if I'm wrong! I may be misunderstanding. Thanks again!
@anastasiosangelopoulos 2 года назад
Yeah, that's right. If I could go back and edit the slides, I'd put "softmax output" on the Y axis of the first box. It's unfortunate that the "softmax score" language clashes with the "conformal score" language. In this case we meant the first.
@paulscemama6517 2 года назад
@@anastasiosangelopoulos No worries at all, I just wanted to clarify so that I wouldn't be under the wrong impression. Seriously though, I admire this type of extra length taken to spread your knowledge on the subject. I haven't seen a better presentation in a long time.
@anastasiosangelopoulos 2 года назад
@@paulscemama6517 Thank you so much😊

Anastasios Nikolas Angelopoulos

Видео

Комментарии