- Видео 4
- Просмотров 72 325
Anastasios Nikolas Angelopoulos
США
Добавлен 16 дек 2019
I am Anastasios Nikolas Angelopoulos, a fourth-year Ph.D. student at the University of California, Berkeley.
I work on theoretical machine learning with applications in vision and healthcare. My goal is to apply modern statistical ideas to increase robustness of black-box models like deep neural networks. I am motivated by medical diagnostics: statistical reliability will become paramount as computer vision and machine learning become ubiquitous in such high-risk settings. My other applied interests include computational imaging and ophthalmology.
I am privileged to be advised by Michael I. Jordan and Jitendra Malik. From 2016 to 2019, I was an electrical engineering student at Stanford University advised by Gordon Wetzstein and Stephen P. Boyd. See my website below.
people.eecs.berkeley.edu/~angelopoulos/
I work on theoretical machine learning with applications in vision and healthcare. My goal is to apply modern statistical ideas to increase robustness of black-box models like deep neural networks. I am motivated by medical diagnostics: statistical reliability will become paramount as computer vision and machine learning become ubiquitous in such high-risk settings. My other applied interests include computational imaging and ophthalmology.
I am privileged to be advised by Michael I. Jordan and Jitendra Malik. From 2016 to 2019, I was an electrical engineering student at Stanford University advised by Gordon Wetzstein and Stephen P. Boyd. See my website below.
people.eecs.berkeley.edu/~angelopoulos/
Tutorial on Conformal Prediction Part 3: Beyond Conformal Prediction
Here we discuss our recent works on controlling error rates other than coverage --- such as the false-discovery rate, intersection-over-union, and OOD detection metrics. For this, we need tools beyond conformal prediction. (Re-upload, the old one was missing the Detectron2 video.)
Просмотров: 6 920
Видео
A Tutorial on Conformal Prediction Part 2: Conditional Coverage and Diagnostics
Просмотров 13 тыс.2 года назад
This video tutorial on conformal prediction follows a document (linked below) we wrote that is meant to teach people conformal prediction and distribution-free uncertainty quantification. Here we focus on the practical aspects of conformal prediction, such as marginal vs conditional coverage and also diagnostics to make sure your conformal procedure is correct/effective. You do not need to be a...
Event-based, 10KHz eye tracking
Просмотров 1,6 тыс.3 года назад
Presentation for IEEEVR2021. Project website: www.computationalimaging.org/publications/event-based-eye-tracking/ Dataset on GitHub: github.com/aangelopoulos/event_based_gaze_tracking
A Tutorial on Conformal Prediction
Просмотров 51 тыс.3 года назад
This video tutorial on conformal prediction follows a document (pdf link below) we wrote that is meant to teach people conformal prediction and distribution-free uncertainty quantification. The document is a hands-on introduction for a reader interested in the practical implementation of distribution-free UQ, who is not necessarily a statistician. We included many explanatory illustrations, exa...
Nicely organized and neat introduction. Thanks!
Hi, My question is why it makes sense to assume that X and Y come from a joint distribution P, which to me at least would imply that X is random. In the frequentist setting like regression I am familiar with, one assumes that the predictors X are deterministic and only y is random, i.e. the conditional expectation of y given x is modelled. Can anyone explain the intuition behind this assumption? Is it just a more general assumption that doesn't contradict the frequentist view?
Hey! Good question. There are certainly some settings where you model X as fixed, or even analyze worst-case behavior over X, but that’s a harder problem setup. Random X is also a standard frequentist setup, although it’s easier than the one you brought up. (Actually most of the standard frequentist theory on, e.g., M-estimation is done using random X.) This assumption might be suitable when the inputs to your algorithm can be thought of as coming from a consistent population.
Thank you
Your videos are such great resources, thank you! I am just wondering, how do we know the theoretical distribution that the coverage should follow? Is there any resource that explains this in more detail? Thanks!
One good resource is this paper: arxiv.org/abs/1209.2673 Informally, you can think of the conformal scores as being uniformly distributed on the coverage scale, and then apply the fact that order statistics of uniformly distributed random variables are beta distributed. en.wikipedia.org/wiki/Order_statistic .
@@anastasiosangelopoulos Thanks a lot!
The hierarchical clustering and its relation to the Learn Then Test framework (LTT) was slightly unclear to me. Am I right in thinking the general procedure looks something like adaptive predictive sets for classification, but instead of moving through individual labels ordered in terms of probability mass to build your quantile you consider a sequence of large and larger subtrees? Or do you consider a threshold such that if a subtree has total mass greater than the threshold value you accept it and then employ the LTT framework directly to select a risk controlled threshold value?
Well, both of these would be possible! The first one works if your risk function is coverage (i.e., if you're looking to make "high-accuracy" hierarchical predictions). When your risk function is not coverage, you can't reframe it in terms of conformal scores, so you need to do this "threshold sweeping" thing. Conformal risk control tells you the right way to do it for a monotone risk, and LTT gives a high-probability version for a non-monotone risk.
@@anastasiosangelopoulos Great thanks for such an informative reply and an amazing series!
Is the small diagram Anastasio's draws at ~30:00 slightly incorrect? Shouldn't the quantile line be horizontal?
Good point, the diagram is a little weird. If we were on the scale of the CDF, it would be a horizontal line. Here, I'm trying to get across the idea that you should stop after the cumulative mass of the bars hits \hat{q}. (So it's more like a horizontal line if you take the _integral_ of the X axis on that plot.) :) thanks for the question!
@@anastasiosangelopoulos Thanks for the response! The point you were conveying was super clear in any case - great tutorial!
Hi! In minute 13:00 with the first algorithm, in cases with no clear class detected it could be the case that no class has score greater than q_hat, so the output is an empty set. But in case of uncertainty I would expect bigger set, is this the expected behavior?
Yep, for this particular score function, that's the expected behavior! It's a little bit weird, but when you output a size-zero set, that decreases your _average_ set size! 😅 If you read the gentle introduction document, we talk also about the (R)APS score, which is usually a better solution for classification. That score has the more intuitive behavior of growing the set whenever there is no clear class.
Truly a video of education. Thank you for taking the time to explain this concept clearly.
Hey there! Nice video. I recently ran some experiments on conformal prediction distillation and I was curious if you had any ideas or suggestions. In a nutshell, the goal was to see if problem-specific conformal prediction guarantees (e.g. coverage in classification) could be instilled into a model without the calibration step. I tested some knowledge distillation techniques (e.g. KL divergence for "soft" distillation loss) in training a smaller model trained on a dataset of (X,Y) pairs, where X is the initial input and Y is the a multiclass binary vector representing the prediction set from a calibrated prediction-set generator f on X. This approach seems to somewhat (?) work and is better than just a naive model without any calibration but does not nearly preserve the same amount of coverage guarantees as actually performing the calibration step. Any suggestions or thoughts on how to expand on this?
Hmm, interesting thought! And sorry for the late response! In short, I don't think this is possible to provide a guarantee for without the calibration step. If you want to try something, I would try running a quantile regression trained to predict the residuals of your model. Conformal prediction can be thought about as an adjusted form of quantile regression anyway, so this is likely to work better if done in the right way.
@@anastasiosangelopoulos Thanks for the response! Ill try that
@@anastasiosangelopoulos Hi again, hope all is well. I was curious if you were familiar with any literature on learning a conformal score function? I am looking at classification. From my understanding, if the calibration set and X_test are IID, then for any arbitrary conformal score function S, if we construct C(X_test) = { y | S(X_test, y) <= q), then C(X_test) will have coverage. Then, it seems like by following similar logic to the "Learning Optimal Conformal Classifiers" (2021) paper, it seems possible to set the score function S as a learnable function of X and Y, and use the above paper's method of splitting mini-batches into calibration / prediction sets and defining some fancy loss function to learn the score function. Do you know if there are any examples of other people doing this?
I'm a little confused in 18:11 step 3 is taking the scores < q_hat. It should be > q_hat according to 13:35 ?
It's correct as-is. There's a sign-flip happening: the "conformal score" s(x,y) is 1-softmax score at 13:35. It's very unfortunate that "score" is used to refer to both of these concepts, but they have a different sign. Sorry for the super-late response!
You can explain difficult concepts more clearly than most of my professors from undergrad!
Wonderful and concise tutorial! Could you please elaborate on what 'finite sample validity' means? Is it something like this: 'Given that the training data is finite, and a portion of it is calibration data, which is also finite but smaller, one can create a conformal framework around that 'small' sample and still achieve coverage of (1-alpha)?'
Sorry for the very late response! Finite-sample validity means "Given a finite number of calibration data points, one can use that calibration data to run conformal prediction, and still achieve a coverage guarantee of 1-alpha." Nothing about the training data is assumed. It can be finite or infinite.
Thanks for the video, it is very accessible despite the complexity of the topic. Indeed, I have a question: what about the application of conformal prediction to the case of binary classification? My doubt regards the size of the prediction set since in the case of binary classification we only have two labels and I guess having a coverage on a prediction set greater than one would be useless.
Thank you for this lecture.
Thanks a lot for this super -simple, elegant expansion on a topic that appears daunting. Hats off to you guys.
Hey great video! One portion I’m stuck at, now that I have a prediction set - now what? How do I get this into a single prediction? Also - I am using an ensemble approach to my modeling using multiple classification techniques, then averaging the softmax scores across the models to get my “final” prediction. Is this something I would do for each model, or just once the softmax scores are combined/averaged? Thanks!
Absolutely brilliant lecture series! Subscribed.
Beautiful indeed.
haha technically you guys are doctors, just not the kind cutting people open 😂
Thanks for the video and great work guys Anastasios and Stephen. This video really helped me to understand the concept behind conformal prediction quickly. I do have one question regarding CP for regression model. In the video, you have mentioned about training quantile regression models for (alpha/2) and (1-alpha/2). While this is possible by re-training the model based on pinball loss (I assume that the process remains the same for non-NN models as well), is there a way to get this quantile regression without re-training the model? I am particularly concerned with the cases where (1.) re-training is either very costly, or (2.) the model training is performed in an automated fashion (where changing the loss function is not possible).
You can train a new quantile regression on the errors of the pre-trained model!
@@anastasiosangelopoulos thanks for the tip, that would also work !
Very well done! I read your "Gentle Introduction.." report as well, equally well done. Unfortunately I am left behind the academic paper wall when it comes to your ref[41] V. Vovk, “Cross-conformal predictors”. Hence I figured I'll post my question here: I am considering using Cross Validation (or nested) to obtain out of fold predictions for a train set, then use these as calibration data to calculate nonconformity_scores. I am withholding a test set outside the CV, for which I aim to evaluate final model and coverage. If all good, I retrain the model on all train data, but I am tempted to keep my nonconformity_scores.. Furthermore I could then get additional scores from the "unseen" test set using the final model. Combining these test scores with previous out of fold scores, I obtain scores for all data, betting on that my outer fold models produce comparable errors to the final version. Does this make sense? If not, how do you recommend making use of CV procedures in relation to conformal predictions?
I would read this paper for a comprehensive solution to CV-type procedures with conformal guarantees: www.stat.cmu.edu/~ryantibs/papers/jackknife.pdf
what do you guys use for generating these slides? I really like the font and layout
Goodnotes, and we write by hand!
Very interesting and informative!
Excellent lecture!
Thanks!
Awesome! I have been working on clinical trial outcome prediction recently. Do you think it is sensible or necessary to apply conformal prediction on a binary classification problem? I asked because sometimes I doubt the value of doing this. Well, a binary classification has only two classes and intuitively, it's kind of "small". And if it is worth doing, is the method in the vedio suitable, or do you think there're some other ways of quantifying the uncertainties for a two-class classification?
For binary problems, I generally recommend conformal selective classification. See the relevant section in the Gentle Intro on arXiv!
Fantastic video, thank you!
Great tutorials on Conformal Prediction, thanks a lot! I have a question regarding the algorithm to evaluate the performance. There, in line 4, you compute 1-D_cal.max(axis=1). Shouldn't the score be defined as in your first tutorial as 1-D_cal[np.arange(len(y_cal),) y_cal], i.e., the estimated probabilities of the true labels? Let's say we have a classifier which classifies the wrong class with a probability of almost 1. Then the scores are close to zero and hence our quantile. In line 6, the rhs would be close to one and hence only false prediction would be added to the set. So 'covered' computed in line 7 will mostly be false. Or am I missing something here?
up
how is it gentle introduction?
It is gentle🐑as a bedtime story 😪😴📖
First, thank you for the clear explanation. Second, I have a remark/question. I have been reading about statistical analysis recently, and how there is a reccuring problem surrounding the confidence interval interpretation. One should not say: 'There is a 90% chance that our interval holds the true value of our statistic of interest', but instead: '90% of the intervals built with our method will hold the true value'. The mathematical reason for this is detailled in "The fallacy of placing confidence in confidence intervals". In practice, this mistake can lead to understimating uncertainty for important decisions (e.g. cancer diagnosis). So here is my question: have you considered whether there is the same problem for conformal prediction intervals/sets (e.g. is it truly okay to say that our prediction interval has a 90% chance of containing the true value?)? Have a nice day 😁
Good question! Here, it's a bit more complicated: the intervals are random, but actually, (X_{test}, Y_{test}) is ALSO random. So the standard interpretation of a confidence interval isn't exactly right here. In reality, you have to say that with probability at least 1-alpha, the new (random) test label lands in the (random) prediction set. The probability is over BOTH the test point and the calibration dataset. Usually we abbreviate this long story and just say "there's a 90% chance the ground truth lands in the interval."
Great presentation! I'm curious, when applied to a multi-label problem, is the quantile function just setting a new decision threshold? The quantile is basically a decision threshold for which labels are predicted. And given an alpha, the quantile is a constant for all new predictions regardless of their matrix of dependent variables. So isn't this process basically just updating your decision threshold to account for a desired coverage probability?
Yes, that’s right! If you want to incorporate dependencies, you can build them into the score function too, but at the end of the day, no matter what you do it’s just thresholding :)
@@anastasiosangelopoulos Thanks for the timely response!
So I know to get the upper and lower quantiles for regression, you use the pinball loss function. But does the loss for the point/mean prediction in regression have to be a specific kind of loss? Or can we use mse, mae, CE?
If you’re making a point prediction, you can use whatever you want! The pinball loss is only for the purpose of quantile prediction.
@@anastasiosangelopoulos great thank you, anastasios!
@@jeffreyalidochair my pleasure!
Extremely helpful, thanks so much!
Thank you so much for this informative and nice presentation. I am looking for resources to guide me how the end user benefits from a prediction set and make a decision upon this set in a typical classification problem?
Very informative video. Thanks
Great video! Anyone recognize the app they are using?
We use GoodNotes!
does this conformal prediction method for regression take covariance into account?
Yes, the covariance between X and Y is normally factored in by the quantile regression model.
is there conformal prediction for regression problems?
oops, didnt watch long enough before commenting
😂 nice! 😊
Excuse me for being naive, but how do you get the prediction intervals/sets on unknown data with no labels? I understand that you calibrate the probabilities with the calibration set, but how do I know the prediction interval for model prediction yhat(X_i) where the ground truth y_i is not known?
In classification, the model gives you a softmax vector, and the prediction set is all classes with a high enough softmax value. More generally, at prediction time, the model gives you some heuristic notion of uncertainty that you use to build a set. Hope this helps!
Hi, really amazing explanation. I had a question, How do you do it for binary classification?
See the "selective classification" setting of the gentle intro: arxiv.org/abs/2107.07511
@@anastasiosangelopoulos Thank you
@@anastasiosangelopoulos I want to use it for my predictions from GCNs but I am confused with the code given. Is there a resource you can point me towards? That'd be helpful
@@mughairamir9200 Have you seen this notebook? github.com/aangelopoulos/conformal-prediction/blob/main/notebooks/imagenet-selective-classification.ipynb
@@anastasiosangelopoulos Yes, that's the one. I want to use it for the outputs of a GCN model for node prediction task
Very good video. I did enjoy it, thanks. Wrt to adaptive prediction sets and in the case of binary classification problems, Ei will be either <= 1 (the value of the first score in the list of sorted scores) or = 1. Is my intuition correct? I am also trying to figure out what would be the p values for a test example from a Mondrian conformal predictor. An explanation on that would be very appreciated.
Yes, the intuition is correct - but for binary classification you might instead look at selective classification (check out the GitHub repo) instead of APS. Regarding the Mondrian question, the p value will have the same form, but it will depend only on the subgroup across which you are stratifying.
@@anastasiosangelopoulos Thanks for your answer.
Please what writing tool is he using?
We're using GoodNotes on iPad!
📣Code for conformal prediction on real data! github.com/aangelopoulos/conformal-prediction The new codebase is part of a huge update to the gentle intro document: arxiv.org/abs/2107.07511 . It includes Imagenet classification, MS-COCO multilabel classification, time-series regression, conformalized quantile regression on medical data, and much more! Leave a ⭐if you enjoy it :)
I'm ill right now so I'm missing Volodymyr Vovk's lecture on this and I'm watching this video instead, thanks for making it!
Haha! Don't tell Vovk ;) Hope you feel better. Are you a RHUL student?
@@anastasiosangelopoulos haha, I'll keep it quiet. I am from RHUL though!
These videos are a fabulous addition to the tutorial document - thank you! Will you add a example Colab notebook to you git repo for the object detection case?
It's possible eventually! Right now it's not on the front-burner to put in the Gentle Intro. However, we have a Colab for it on the LTT codebase here: github.com/aangelopoulos/ltt
@@anastasiosangelopoulos that's great, thanks - I'd somehow missed that repo.
Hi @Anastasios, I am going through the video and am quite eager to try it out. First question: does this method handle scenarios where a CNN predicts incorrectly but with high softmax scores
Conformal will always work, for any model. If your model is often confidently incorrect, then the sets will account for that by growing such that they obtain the marginal coverage guarantee. However, in the subgroup with confident misclassifications, it may not obtain correct coverage. That subgroup is close to adversarial and will probably have poor coverage, even if the marginal guarantee is satisfied.
Just what I needed for my master's degree proposal! Full of ideas thanks to you guys! :)
Happy to help! 🎓
First of all, Thank you for the great presentation! I have some questions concerning your examples for the classification tasks. 1. Did I understand it correctly, that it is necessary for the heuristic output (in your case softmax output) to add up to the same value (1 in your case) for all data points when all classes of the label space are considered? 2. I tried to adapt your method with the adaptive prediction sets on my binary classification problem at hand (more precisely it is a anomaly detection problem). I observed the q threshold value to become very high resulting in the situation that in almost all cases the prediction set will contain both classes. Obviously this doesn't have much added value for me. I'd like to outline my approach. I used a Variational Autoencoder and trained it with normal data only. I have a mixed test set and computed the reconstruction probability (RecProb) for every data point. I was able to find a optimal decision threshold for anomaly detection by maximizing the Matthews Correlation Coefficient, but unfortunately the RecProbs of anomalous data points are only marginally greater than those of normal data points, so there is a significant amount of overlap. That's why I am looking for a (ideally not very complicated) approach to indicate the uncertainty of my model in some way. Now I thought about doing some kind of comparison of the relative frequencies for normal and anormal data for a given RecProb looking at the corresponding histogram bin. It would be very helpful for me, if you could maybe give me some advice. Many thanks in advance and keep up the great work! Regards, Tobias.
1. That's actually not needed! They sum to one in this example, but that's not actually required for conformal to work. (See arxiv.org/abs/2009.14193 for an example score where the regularized probabilities do not sum to 1.) 2. In binary classification, prediction sets are not usually useful. Instead, you might try using selective classification (i.e. the model learns to say "I dont' know", in such a way that it achieves an accuracy of, say, 95% when it chooses to speak, even if its marginal accuracy is lower). We'll soon release a V3 of the gentle intro that describes how to do this; for now, you can see how to do distribution-free selective classification here: arxiv.org/abs/2110.01052
Thank you very much Anastasios!
Thanks for this video. Very informational and to the point.
Fantastic presentation! I was captivated from start to finish
Really wonderful complement to your paper "Gentle Intro to Conformal Prediction..". One question! - for each of the example algorithms, you create a three-box summary of what the algorithm does. In the first of these three boxes (for each of the algorithms), you have "score" on the y-axis of the histogram. However, I feel as though this should be named "heuristic-output" or, for example, "softmax output" (for multi-class classification). To me, "score" means the "E" in that first box, i.e., it is defined differently for each algorithm, and encodes the properties we want the prediction set function to have. Correct me if I'm wrong! I may be misunderstanding. Thanks again!
Yeah, that's right. If I could go back and edit the slides, I'd put "softmax output" on the Y axis of the first box. It's unfortunate that the "softmax score" language clashes with the "conformal score" language. In this case we meant the first.
@@anastasiosangelopoulos No worries at all, I just wanted to clarify so that I wouldn't be under the wrong impression. Seriously though, I admire this type of extra length taken to spread your knowledge on the subject. I haven't seen a better presentation in a long time.
@@paulscemama6517 Thank you so much😊