A Tutorial on Conformal Prediction

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ม.ค. 2025

ความคิดเห็น • 124

  • @anastasiosangelopoulos
    @anastasiosangelopoulos  2 ปีที่แล้ว +19

    📣Code for conformal prediction on real data! github.com/aangelopoulos/conformal-prediction
    The new codebase is part of a huge update to the gentle intro document: arxiv.org/abs/2107.07511 . It includes Imagenet classification, MS-COCO multilabel classification, time-series regression, conformalized quantile regression on medical data, and much more! Leave a ⭐if you enjoy it :)

  • @nintishia
    @nintishia 11 หลายเดือนก่อน +4

    Thanks a lot for this super -simple, elegant expansion on a topic that appears daunting. Hats off to you guys.

  • @ak90clb
    @ak90clb 11 หลายเดือนก่อน +2

    You can explain difficult concepts more clearly than most of my professors from undergrad!

  • @bradhatch8302
    @bradhatch8302 8 หลายเดือนก่อน +1

    Truly a video of education. Thank you for taking the time to explain this concept clearly.

  • @srishtigureja6534
    @srishtigureja6534 2 ปีที่แล้ว +7

    Thanks so much for this video! Conformal Prediction is really intuitive, and the way it's taught in the video made it even easier to grasp. Now going for other two parts of the video. Excited to explore the current research directions in CP.

  • @psic-protosysintegratedcyb2422
    @psic-protosysintegratedcyb2422 2 ปีที่แล้ว +3

    Anastasios, you are a great teacher! Well done!

  • @rahulvishwakarma4413
    @rahulvishwakarma4413 3 ปีที่แล้ว +11

    Nice to see CP being explained in a simple and easy to understand approach.
    Thanks for the presentation.

  • @anastasiosangelopoulos
    @anastasiosangelopoulos  3 ปีที่แล้ว +8

    Hi everyone! I'm happy to see so many people are watching this video :)
    Please feel free to comment with any comments or remarks you may have. I get notified when you do, and will respond as soon as I can.
    We will be posting more videos soon --- subscribe to my channel if you want to see those when they come out.

    • @2011moser
      @2011moser 3 ปีที่แล้ว +1

      Which tools are you guys using for the prezo?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +3

      @@2011moser We did this presentation with an iPad, Apple Pencil 2, and an app called GoodNotes. We screen-recorded GoodNotes, and separately, we recorded ourselves using our computer webcams. Then we aligned the videos in Adobe Premiere.

  • @NoNTr1v1aL
    @NoNTr1v1aL ปีที่แล้ว +2

    Absolutely brilliant lecture series! Subscribed.

  • @Samkb92
    @Samkb92 ปีที่แล้ว +2

    First, thank you for the clear explanation.
    Second, I have a remark/question. I have been reading about statistical analysis recently, and how there is a reccuring problem surrounding the confidence interval interpretation.
    One should not say: 'There is a 90% chance that our interval holds the true value of our statistic of interest', but instead: '90% of the intervals built with our method will hold the true value'. The mathematical reason for this is detailled in "The fallacy of placing confidence in confidence intervals". In practice, this mistake can lead to understimating uncertainty for important decisions (e.g. cancer diagnosis).
    So here is my question: have you considered whether there is the same problem for conformal prediction intervals/sets (e.g. is it truly okay to say that our prediction interval has a 90% chance of containing the true value?)?
    Have a nice day 😁

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  ปีที่แล้ว +1

      Good question! Here, it's a bit more complicated: the intervals are random, but actually, (X_{test}, Y_{test}) is ALSO random. So the standard interpretation of a confidence interval isn't exactly right here.
      In reality, you have to say that with probability at least 1-alpha, the new (random) test label lands in the (random) prediction set. The probability is over BOTH the test point and the calibration dataset.
      Usually we abbreviate this long story and just say "there's a 90% chance the ground truth lands in the interval."

  • @dangerousdansg
    @dangerousdansg 2 ปีที่แล้ว +1

    Fantastic presentation! I was captivated from start to finish

  • @zhenghaopeng6633
    @zhenghaopeng6633 10 หลายเดือนก่อน +1

    I'm a little confused in 18:11 step 3 is taking the scores < q_hat. It should be > q_hat according to 13:35 ?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  8 หลายเดือนก่อน

      It's correct as-is. There's a sign-flip happening: the "conformal score" s(x,y) is 1-softmax score at 13:35.
      It's very unfortunate that "score" is used to refer to both of these concepts, but they have a different sign.
      Sorry for the super-late response!

  • @lazy1peasant
    @lazy1peasant 2 ปีที่แล้ว +7

    there was once an OKCupid question that says "does it bother you when someone says ATM Machine and PIN Number?" I was like "yeah, sure I guess it makes sense to be upset about that" Then this guy says "MR Image" and I'm like "dude, just say MRI image, no one knows what the hell an MR is"

  • @Madsott
    @Madsott ปีที่แล้ว +1

    Very well done! I read your "Gentle Introduction.." report as well, equally well done. Unfortunately I am left behind the academic paper wall when it comes to your ref[41] V. Vovk, “Cross-conformal predictors”. Hence I figured I'll post my question here:
    I am considering using Cross Validation (or nested) to obtain out of fold predictions for a train set, then use these as calibration data to calculate nonconformity_scores.
    I am withholding a test set outside the CV, for which I aim to evaluate final model and coverage.
    If all good, I retrain the model on all train data, but I am tempted to keep my nonconformity_scores..
    Furthermore I could then get additional scores from the "unseen" test set using the final model.
    Combining these test scores with previous out of fold scores, I obtain scores for all data, betting on that my outer fold models produce comparable errors to the final version.
    Does this make sense? If not, how do you recommend making use of CV procedures in relation to conformal predictions?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  ปีที่แล้ว +1

      I would read this paper for a comprehensive solution to CV-type procedures with conformal guarantees:
      www.stat.cmu.edu/~ryantibs/papers/jackknife.pdf

  • @jorgecelis8459
    @jorgecelis8459 8 หลายเดือนก่อน +1

    Hi! In minute 13:00 with the first algorithm, in cases with no clear class detected it could be the case that no class has score greater than q_hat, so the output is an empty set. But in case of uncertainty I would expect bigger set, is this the expected behavior?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  8 หลายเดือนก่อน

      Yep, for this particular score function, that's the expected behavior! It's a little bit weird, but when you output a size-zero set, that decreases your _average_ set size! 😅
      If you read the gentle introduction document, we talk also about the (R)APS score, which is usually a better solution for classification. That score has the more intuitive behavior of growing the set whenever there is no clear class.

  • @vladimiriurcovschi1657
    @vladimiriurcovschi1657 2 ปีที่แล้ว +1

    Πολύ χρήσιμο tutorial! Συγχαρητήρια για την υπέροχη δουλειά!

  • @RajeevVerma-su6hs
    @RajeevVerma-su6hs 2 ปีที่แล้ว +2

    Thanks for the presentation. I am slightly confused between the two classification methods/scores outlined here. For the method at 09:17 (the first method), we take the softmax value for the true label as the score. Softmax value would capture the notion of conformity of input and the label. So, essentially with this measure of conformity, you are doing the exact opposite procedure what Stephen explains. The general Conformal Prediction that Stephen explains takes the non-conformity score for the algorithm. With the non-conformity score, you take the 1-\alpha quantile and construct your prediction set with labels for which the non-conformity score is less than this 1-\alpha quantile. So in the first classification method, since the softmax score is the conformity score, you take \alpha quantile and include all the labels in your prediction set for which this conformity score is greater than \alpha quantile. I get this.
    But I am confused about the second classification method (at 25:00). Here, the score (sum of sorted softmax values until the true label hasn't come) is a non-conformity score. This goes with the next procedure where you find the 1-\alpha quantile value. So, to go in consistency with what Stephen explains, in the third step you should include all those labels in your prediction set for which the non-conformity score is less than 1-\alpha quantile. But you do the exact opposite and include the labels for which the non-conformity score is greater than this 1-\alpha quantile.
    Did I understand it correctly? Am I saying something wrong? I would really appreciate it if you can explain where I am wrong.

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +2

      Hey Rajeev. That's a great question, and you got the concept perfectly. Everything you said is correct. The first method we presented at 09:17 doesn't exactly follow the general strategy Stephen outlined---it uses a "conformity score" instead of a "nonconformity score". You can convert between the two with a "1-".
      We made this decision for pedagogical reasons, so that the first version of conformal prediction we introduced would be maximally intuitive. But now that you know about the general strategy, you can plug in by taking one minus the softmax score of the true class to be the score function s(x,y) in Stephen's general development.
      See also David Liu's comment below.

    • @RajeevVerma-su6hs
      @RajeevVerma-su6hs 2 ปีที่แล้ว +1

      @@anastasiosangelopoulos Thanks a lot Anastasios. I get it now. Also great to see all the work you're doing to make conformal prediction more mainstream.

  • @davidliu5075
    @davidliu5075 3 ปีที่แล้ว +2

    Great presentation! Thanks for sharing!

  • @james2396
    @james2396 2 ปีที่แล้ว +1

    I'm ill right now so I'm missing Volodymyr Vovk's lecture on this and I'm watching this video instead, thanks for making it!

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +1

      Haha! Don't tell Vovk ;)
      Hope you feel better.
      Are you a RHUL student?

    • @james2396
      @james2396 2 ปีที่แล้ว +1

      @@anastasiosangelopoulos haha, I'll keep it quiet. I am from RHUL though!

  • @brianwalsh7040
    @brianwalsh7040 ปีที่แล้ว +1

    Hey great video! One portion I’m stuck at, now that I have a prediction set - now what? How do I get this into a single prediction?
    Also - I am using an ensemble approach to my modeling using multiple classification techniques, then averaging the softmax scores across the models to get my “final” prediction. Is this something I would do for each model, or just once the softmax scores are combined/averaged? Thanks!

  • @ysig
    @ysig 2 ปีที่แล้ว +5

    Amazing work!
    It would be interesting whether you can extend this work to segmentation networks.
    They way I envision it could even give you, emergent maps of object parts in which you could identify areas of high ambiguity (lots of classes) and areas with very low one - which is in the end a form of emergent visual semantics.
    As such it can have a lot of value even for the whole interpretability literature of Computer Vision (what Grad CAM tries to show but much more informative)

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +3

      You’re totally right! And it’s definitely possible.
      Here’s some resources on the topic… we wrote about it in the paper version of this gentle intro!
      For example, see Section 5.3 of arxiv.org/abs/2107.07511
      It follows Section 5.4 of arxiv.org/abs/2101.02703
      And also see Section 6 of arxiv.org/abs/2110.01052

  • @TheCrmagic
    @TheCrmagic 11 หลายเดือนก่อน +1

    Thank you for this lecture.

  • @nickbishop7315
    @nickbishop7315 7 หลายเดือนก่อน +1

    Is the small diagram Anastasio's draws at ~30:00 slightly incorrect? Shouldn't the quantile line be horizontal?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  7 หลายเดือนก่อน +1

      Good point, the diagram is a little weird. If we were on the scale of the CDF, it would be a horizontal line. Here, I'm trying to get across the idea that you should stop after the cumulative mass of the bars hits \hat{q}.
      (So it's more like a horizontal line if you take the _integral_ of the X axis on that plot.)
      :) thanks for the question!

    • @nickbishop7315
      @nickbishop7315 7 หลายเดือนก่อน +1

      @@anastasiosangelopoulos Thanks for the response! The point you were conveying was super clear in any case - great tutorial!

  • @71sephiroth
    @71sephiroth 11 หลายเดือนก่อน +1

    Wonderful and concise tutorial! Could you please elaborate on what 'finite sample validity' means? Is it something like this: 'Given that the training data is finite, and a portion of it is calibration data, which is also finite but smaller, one can create a conformal framework around that 'small' sample and still achieve coverage of (1-alpha)?'

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  8 หลายเดือนก่อน

      Sorry for the very late response! Finite-sample validity means
      "Given a finite number of calibration data points, one can use that calibration data to run conformal prediction, and still achieve a coverage guarantee of 1-alpha."
      Nothing about the training data is assumed. It can be finite or infinite.

  • @BillSun-sk6si
    @BillSun-sk6si 9 หลายเดือนก่อน +1

    Hey there! Nice video.
    I recently ran some experiments on conformal prediction distillation and I was curious if you had any ideas or suggestions. In a nutshell, the goal was to see if problem-specific conformal prediction guarantees (e.g. coverage in classification) could be instilled into a model without the calibration step.
    I tested some knowledge distillation techniques (e.g. KL divergence for "soft" distillation loss) in training a smaller model trained on a dataset of (X,Y) pairs, where X is the initial input and Y is the a multiclass binary vector representing the prediction set from a calibrated prediction-set generator f on X. This approach seems to somewhat (?) work and is better than just a naive model without any calibration but does not nearly preserve the same amount of coverage guarantees as actually performing the calibration step. Any suggestions or thoughts on how to expand on this?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  8 หลายเดือนก่อน

      Hmm, interesting thought! And sorry for the late response!
      In short, I don't think this is possible to provide a guarantee for without the calibration step.
      If you want to try something, I would try running a quantile regression trained to predict the residuals of your model. Conformal prediction can be thought about as an adjusted form of quantile regression anyway, so this is likely to work better if done in the right way.

    • @BillSun-sk6si
      @BillSun-sk6si 8 หลายเดือนก่อน

      @@anastasiosangelopoulos Thanks for the response! Ill try that

    • @BillSun-sk6si
      @BillSun-sk6si หลายเดือนก่อน

      ​@@anastasiosangelopoulos Hi again, hope all is well. I was curious if you were familiar with any literature on learning a conformal score function? I am looking at classification.
      From my understanding, if the calibration set and X_test are IID, then for any arbitrary conformal score function S, if we construct C(X_test) = { y | S(X_test, y)

  • @davidliu5075
    @davidliu5075 3 ปีที่แล้ว +3

    I have a question about qhat.
    Why at 09:17 we take 10% quantile and at 25:00 we take 90% quantile?
    Is there anything I missed?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +1

      Thanks for asking, this is a good question.
      The quantiles are different because in the first example of conformal prediction, when E_i is small, that means the model is more uncertain. That means to comply with the conformal procedure we outlined at 15:28, we would need to set the conformal score s_i=1-E_i, and take the 90% quantile of the s_i. But that is the same thing as just taking the 10% quantile of the E_i.
      In the example at 25:00, when E_i is large, that means the model is more uncertain. Therefore we can just set s_i=E_i and take the 90% quantile directly.
      In summary, it has to do with whether your uncertainty measure gets large or small when your model is uncertain.

    • @davidliu5075
      @davidliu5075 3 ปีที่แล้ว +1

      @@anastasiosangelopoulos Got it! Thank you so much!

  • @monkyspnk777
    @monkyspnk777 2 ปีที่แล้ว +1

    @28:11 is this chart you show similar to a scree plot for PCA?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว

      Kind of! It’s a plot of the sorted softmax scores. Scree is a plot of the sorted eigenvalues. One difference is in our plot, the values sum to 1.

  • @questions-n8f
    @questions-n8f 6 หลายเดือนก่อน +1

    Hi,
    My question is why it makes sense to assume that X and Y come from a joint distribution P, which to me at least would imply that X is random. In the frequentist setting like regression I am familiar with, one assumes that the predictors X are deterministic and only y is random, i.e. the conditional expectation of y given x is modelled. Can anyone explain the intuition behind this assumption? Is it just a more general assumption that doesn't contradict the frequentist view?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  5 หลายเดือนก่อน

      Hey! Good question. There are certainly some settings where you model X as fixed, or even analyze worst-case behavior over X, but that’s a harder problem setup. Random X is also a standard frequentist setup, although it’s easier than the one you brought up. (Actually most of the standard frequentist theory on, e.g., M-estimation is done using random X.)
      This assumption might be suitable when the inputs to your algorithm can be thought of as coming from a consistent population.

  • @mariofigueiredo5124
    @mariofigueiredo5124 ปีที่แล้ว +1

    Excellent lecture!

  • @jeffreyalidochair
    @jeffreyalidochair ปีที่แล้ว +1

    So I know to get the upper and lower quantiles for regression, you use the pinball loss function. But does the loss for the point/mean prediction in regression have to be a specific kind of loss? Or can we use mse, mae, CE?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  ปีที่แล้ว +1

      If you’re making a point prediction, you can use whatever you want! The pinball loss is only for the purpose of quantile prediction.

    • @jeffreyalidochair
      @jeffreyalidochair ปีที่แล้ว +1

      @@anastasiosangelopoulos great thank you, anastasios!

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  ปีที่แล้ว +1

      @@jeffreyalidochair my pleasure!

  • @vishalahuja2502
    @vishalahuja2502 2 ปีที่แล้ว +1

    Hi @Anastasios, I am going through the video and am quite eager to try it out. First question: does this method handle scenarios where a CNN predicts incorrectly but with high softmax scores

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว

      Conformal will always work, for any model.
      If your model is often confidently incorrect, then the sets will account for that by growing such that they obtain the marginal coverage guarantee.
      However, in the subgroup with confident misclassifications, it may not obtain correct coverage. That subgroup is close to adversarial and will probably have poor coverage, even if the marginal guarantee is satisfied.

  • @akshayparanjape4646
    @akshayparanjape4646 ปีที่แล้ว +1

    Thanks for the video and great work guys Anastasios and Stephen. This video really helped me to understand the concept behind conformal prediction quickly. I do have one question regarding CP for regression model. In the video, you have mentioned about training quantile regression models for (alpha/2) and (1-alpha/2). While this is possible by re-training the model based on pinball loss (I assume that the process remains the same for non-NN models as well), is there a way to get this quantile regression without re-training the model?
    I am particularly concerned with the cases where (1.) re-training is either very costly, or (2.) the model training is performed in an automated fashion (where changing the loss function is not possible).

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  ปีที่แล้ว

      You can train a new quantile regression on the errors of the pre-trained model!

    • @akshayparanjape4646
      @akshayparanjape4646 ปีที่แล้ว

      @@anastasiosangelopoulos thanks for the tip, that would also work !

  • @blakete
    @blakete ปีที่แล้ว +1

    Fantastic video, thank you!

  • @BastiSeen
    @BastiSeen 2 ปีที่แล้ว +1

    Very good video. I did enjoy it, thanks. Wrt to adaptive prediction sets and in the case of binary classification problems, Ei will be either

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +1

      Yes, the intuition is correct - but for binary classification you might instead look at selective classification (check out the GitHub repo) instead of APS.
      Regarding the Mondrian question, the p value will have the same form, but it will depend only on the subgroup across which you are stratifying.

    • @BastiSeen
      @BastiSeen 2 ปีที่แล้ว +1

      @@anastasiosangelopoulos Thanks for your answer.

  • @trolzzdrizzt5674
    @trolzzdrizzt5674 2 ปีที่แล้ว +1

    First of all, Thank you for the great presentation!
    I have some questions concerning your examples for the classification tasks.
    1. Did I understand it correctly, that it is necessary for the heuristic output (in your case softmax output) to add up to the same value (1 in your case) for all data points when all classes of the label space are considered?
    2. I tried to adapt your method with the adaptive prediction sets on my binary classification problem at hand (more precisely it is a anomaly detection problem). I observed the q threshold value to become very high resulting in the situation that in almost all cases the prediction set will contain both classes. Obviously this doesn't have much added value for me.
    I'd like to outline my approach. I used a Variational Autoencoder and trained it with normal data only. I have a mixed test set and computed the reconstruction probability (RecProb) for every data point. I was able to find a optimal decision threshold for anomaly detection by maximizing the Matthews Correlation Coefficient, but unfortunately the RecProbs of anomalous data points are only marginally greater than those of normal data points, so there is a significant amount of overlap.
    That's why I am looking for a (ideally not very complicated) approach to indicate the uncertainty of my model in some way. Now I thought about doing some kind of comparison of the relative frequencies for normal and anormal data for a given RecProb looking at the corresponding histogram bin.
    It would be very helpful for me, if you could maybe give me some advice.
    Many thanks in advance and keep up the great work!
    Regards, Tobias.

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +1

      1. That's actually not needed! They sum to one in this example, but that's not actually required for conformal to work. (See arxiv.org/abs/2009.14193 for an example score where the regularized probabilities do not sum to 1.)
      2. In binary classification, prediction sets are not usually useful. Instead, you might try using selective classification (i.e. the model learns to say "I dont' know", in such a way that it achieves an accuracy of, say, 95% when it chooses to speak, even if its marginal accuracy is lower). We'll soon release a V3 of the gentle intro that describes how to do this; for now, you can see how to do distribution-free selective classification here: arxiv.org/abs/2110.01052

    • @trolzzdrizzt5674
      @trolzzdrizzt5674 2 ปีที่แล้ว +1

      Thank you very much Anastasios!

  • @Herdogan80
    @Herdogan80 2 ปีที่แล้ว +1

    Great one. Really inspiring. Thanks for putting that tutorial online. ^-^

  • @EvanH-pn4sq
    @EvanH-pn4sq ปีที่แล้ว +1

    Great presentation! I'm curious, when applied to a multi-label problem, is the quantile function just setting a new decision threshold? The quantile is basically a decision threshold for which labels are predicted. And given an alpha, the quantile is a constant for all new predictions regardless of their matrix of dependent variables. So isn't this process basically just updating your decision threshold to account for a desired coverage probability?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  ปีที่แล้ว +1

      Yes, that’s right!
      If you want to incorporate dependencies, you can build them into the score function too, but at the end of the day, no matter what you do it’s just thresholding :)

    • @EvanH-pn4sq
      @EvanH-pn4sq ปีที่แล้ว

      @@anastasiosangelopoulos Thanks for the timely response!

  • @LukaszWiklendt
    @LukaszWiklendt 2 ปีที่แล้ว +1

    Is there meaningful/useful interpretation for calculating the alpha for when each label switches between being included vs excluded from the prediction set?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +1

      The liminal value of alpha for a label y is a p-value for the null hypothesis “This X-y pair is exchangeable with the calibration data.”
      In practice this interpretation probably only makes sense to think about if your score is good.

  • @paulscemama6517
    @paulscemama6517 2 ปีที่แล้ว +1

    Really wonderful complement to your paper "Gentle Intro to Conformal Prediction..". One question! - for each of the example algorithms, you create a three-box summary of what the algorithm does. In the first of these three boxes (for each of the algorithms), you have "score" on the y-axis of the histogram. However, I feel as though this should be named "heuristic-output" or, for example, "softmax output" (for multi-class classification). To me, "score" means the "E" in that first box, i.e., it is defined differently for each algorithm, and encodes the properties we want the prediction set function to have. Correct me if I'm wrong! I may be misunderstanding. Thanks again!

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +1

      Yeah, that's right. If I could go back and edit the slides, I'd put "softmax output" on the Y axis of the first box.
      It's unfortunate that the "softmax score" language clashes with the "conformal score" language. In this case we meant the first.

    • @paulscemama6517
      @paulscemama6517 2 ปีที่แล้ว +1

      @@anastasiosangelopoulos No worries at all, I just wanted to clarify so that I wouldn't be under the wrong impression. Seriously though, I admire this type of extra length taken to spread your knowledge on the subject. I haven't seen a better presentation in a long time.

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว

      @@paulscemama6517 Thank you so much😊

  • @janlauzy1404
    @janlauzy1404 2 ปีที่แล้ว +1

    Is there a library supporting this in Python/R?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว +1

      Yeah! First I would look at the paper version of the Gentle Introduction to Conformal Prediction, which has Python code samples in it. arxiv.org/abs/2107.07511
      Conformalized quantile regression is implemented here: github.com/yromano/cqr
      The “nonconformist” library is quite popular, but is currently missing some of the better modern methods: github.com/donlnz/nonconformist
      There is a PyTorch implementation of classification in github.com/aangelopoulos/conformal_classification
      I would look here for a more comprehensive list of software… this GitHub repo is very actively maintained: github.com/valeman/awesome-conformal-prediction

  • @mughairamir9200
    @mughairamir9200 2 ปีที่แล้ว +1

    Hi, really amazing explanation. I had a question, How do you do it for binary classification?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว

      See the "selective classification" setting of the gentle intro: arxiv.org/abs/2107.07511

    • @mughairamir9200
      @mughairamir9200 2 ปีที่แล้ว +1

      @@anastasiosangelopoulos Thank you

    • @mughairamir9200
      @mughairamir9200 2 ปีที่แล้ว

      @@anastasiosangelopoulos I want to use it for my predictions from GCNs but I am confused with the code given. Is there a resource you can point me towards? That'd be helpful

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว

      @@mughairamir9200 Have you seen this notebook? github.com/aangelopoulos/conformal-prediction/blob/main/notebooks/imagenet-selective-classification.ipynb

    • @mughairamir9200
      @mughairamir9200 2 ปีที่แล้ว

      @@anastasiosangelopoulos Yes, that's the one. I want to use it for the outputs of a GCN model for node prediction task

  • @jimshtepa5423
    @jimshtepa5423 ปีที่แล้ว +2

    how is it gentle introduction?

  • @TianyiChen-f6i
    @TianyiChen-f6i ปีที่แล้ว +1

    Awesome! I have been working on clinical trial outcome prediction recently. Do you think it is sensible or necessary to apply conformal prediction on a binary classification problem? I asked because sometimes I doubt the value of doing this. Well, a binary classification has only two classes and intuitively, it's kind of "small". And if it is worth doing, is the method in the vedio suitable, or do you think there're some other ways of quantifying the uncertainties for a two-class classification?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  ปีที่แล้ว

      For binary problems, I generally recommend conformal selective classification. See the relevant section in the Gentle Intro on arXiv!

  • @sankalpgilda6818
    @sankalpgilda6818 3 ปีที่แล้ว +3

    Very cool! What does 'conformal' mean here?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +2

      In conformal prediction, the score s(X_{n+1},Y_{n+1}) on the slide at time 16:00 measures how similar---or _conforming_---the new test example (X_{n+1},Y_{n+1}) is to all the previous samples, (X_1,Y_1),(X_2,Y_2),...,(X_n,Y_n). If the score is large, the pair (X_{n+1},Y_{n+1}) does not "conform" to the previous data---meaning it is unlikely to happen, in a heuristic sense. This is the origin of the word conformal. Vladimir Vovk actually answered the question in a very nice stackoverflow post here: mathoverflow.net/questions/266921/how-is-the-conformal-prediction-conformal
      Notably, the use of "conformal" here has nothing to do with conformal mappings from complex analysis.

  • @DistortedV12
    @DistortedV12 3 ปีที่แล้ว +1

    Hi, great video! Does this work in sequential contexts with users that maybe had label 1 at time point 1 and label 2 at time point 2 that may be dependent on label 1s data at time point 1? I may be missing something but curious if possible to apply these with sequential models given examples selected from the calibration set may not meet this i.i.d assumption (though I'm unsure)? Is there a method to overcome such a challenge if this is the case?
    Also how does one interpret these prediction sets. Are they like frequentist confidence intervals or bayesian credible intervals?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +1

      Technically, all of the material I presented above requires exchangeability. As you said, sequential settings with data dependencies don't always satisfy these assumptions, although there may be a way to shoehorn the problem into a workable form depending on the context. (For example, you could see if it makes sense to consider each sequence exchangeable, or something.)
      There are people that work on the sequential setting specifically, like this one: proceedings.mlr.press/v139/xu21h/xu21h.pdf
      I have not read it carefully myself, but it might be helpful to you.

    • @noahjones4925
      @noahjones4925 2 ปีที่แล้ว +1

      @@anastasiosangelopoulos Thanks!

  • @demohub
    @demohub 6 หลายเดือนก่อน +1

    Thank you

  • @MrMONODA
    @MrMONODA 3 ปีที่แล้ว +1

    It seems challenging to choose the step size dzeta: if chosen too small then you run into a multiple testing issue, but it chosen too large then you might choose a very suboptimal lambda. Any recommendations or thoughts on this point?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว

      Which step size are you talking about? In the versions of conformal prediction presented above, you don't really need to choose a step size or deal with multiple p-value comparisons.
      There are methods in distribution-free statistics where you do need to handle such multiple testing issues, namely when you expand your purview to general loss functions as in arxiv.org/abs/2110.01052 --- but that is a different topic that will soon get its own video ;)
      Maybe you can help me understand where you found dzeta? I may not remember my own video correctly (it has been a while since we recorded it).

    • @MrMONODA
      @MrMONODA 3 ปีที่แล้ว +1

      ​@@anastasiosangelopoulos Thanks for the quick and careful response! I should have been more specific since as you noticed there is no dzeta mentioned here. I found this video after reading the paper `Distribution-Free, Risk-Controlling Prediction Sets`: dzeta is used to iterate through values of lambda \in Lambda which indexes the set valued predictors.
      My comment overlooked that at the start of the paper it is shown that the decreasing and continuous assumptions on the risk allow you to apply the pointwise bound to show consistency. Really great work! I also like that you showed an extension to ranking.

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +1

      @@MrMONODA nice! Yeah, in RCPS you can discretize as finely as you want, with no penalty (so pick dzeta as small as is feasible for your computer... usually it doesn't need to be smaller than 0.001). :)
      In Learn then Test, our follow up, we show how to do it with non-monotone risks where you do need to do multiple testing correction - but there are various ways to engineer the multiple testing so the inequalities are basically tight!

  • @kobibas
    @kobibas 3 ปีที่แล้ว +1

    Thanks for this great tutorial!
    I have a question regarding the distribution-free assumption.
    In what sense the conformal prediction is distribution-free? Is there a probabilistic connection between the calibration set and the test set?
    Does this method still work if the sets are very different (due to out-of-distribution or adversarial attack)?

    • @sergiogarrido5111
      @sergiogarrido5111 3 ปีที่แล้ว +3

      Yes, they have to be i.i.d., that is they have to come from the same distribution. Conformal prediction also works under a slightly weaker condition of exchangeability. But for practical purposes just consider the samples i.i.d . For a reference, check section 3. exchangeability of Shafer and Vovk (2008). I have to add that there is work on extending conformal prediction to non i.i.d. settings, but the details escape my understanding. See, for example, Tibshirani et.al. (2019).
      References:
      - Shafer, G., & Vovk, V. (2008). A Tutorial on Conformal Prediction. Journal of Machine Learning Research, 9(3).
      - Tibshirani, R. J., Barber, R. F., Candès, E. J., & Ramdas, A. (2019). Conformal prediction under covariate shift. arXiv preprint arXiv:1904.06019.

  • @anaborovac2767
    @anaborovac2767 3 ปีที่แล้ว +1

    Great tutorial! I am just wondering if the adaptive prediction sets also work nicely in the binary classification problem. Do you happen to have some experience with such problems? Thank you.

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +3

      Hi Ana!
      APS will work for binary classification problems. It will work "nicely" in the sense that it will do exactly as promised---give you a prediction set that provides coverage. If that's what you want, there is no problem. The prediction set will either be {0}, {1}, or {0,1}. If it is {0} or {1}, it means the classifier is certain, and if it is {0,1}, then the classifier is uncertain. APS is guaranteed to give you that kind of behavior.
      However, there may be more informative quantities in the binary classification problem. For example, you might want the classifier's top probability to be calibrated. Loosely, this means that when the classifier puts, say, 90% of its mass on the top label, it is correct 90% of the time. There are distribution-free ways of doing calibration of the top label --- check out this paper by Chirag Gupta and Aaditya Ramdas: arxiv.org/abs/2107.08353
      As a final reflection, top-label calibration is in some sense a harder problem than the prediction set problem that APS addresses. It is more data hungry because you're calibrating an entire histogram, not just one threshold. However, in binary classification, I think top-label calibration is still the best way to go, unless you really only care about achieving 90% coverage. This is because it gives you a more granular sense of at what level your model is confident about a particular prediction.
      Hope this answers your question, and feel free to follow up.

    • @anaborovac2767
      @anaborovac2767 3 ปีที่แล้ว +1

      @@anastasiosangelopoulos thank you for the detailed response. Top-label calibration seems like the thing I am looking for.

    • @sanketwebsites9764
      @sanketwebsites9764 3 ปีที่แล้ว

      Exactly the question I had in mind. Also if the target is skewed in binary classification problem, it will have higher probability for dominant class most of the times. Seems hard to implement the APS in that case.

  • @rock2crack
    @rock2crack 2 ปีที่แล้ว

    Excuse me for being naive, but how do you get the prediction intervals/sets on unknown data with no labels? I understand that you calibrate the probabilities with the calibration set, but how do I know the prediction interval for model prediction yhat(X_i) where the ground truth y_i is not known?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว

      In classification, the model gives you a softmax vector, and the prediction set is all classes with a high enough softmax value.
      More generally, at prediction time, the model gives you some heuristic notion of uncertainty that you use to build a set.
      Hope this helps!

  • @christianhower8059
    @christianhower8059 2 ปีที่แล้ว

    Great video! Anyone recognize the app they are using?

  • @jeffreyalidochair
    @jeffreyalidochair 2 ปีที่แล้ว

    does this conformal prediction method for regression take covariance into account?

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  2 ปีที่แล้ว

      Yes, the covariance between X and Y is normally factored in by the quantile regression model.

  • @alexanderbalinsky5807
    @alexanderbalinsky5807 3 ปีที่แล้ว +1

    On the second page in the formula P[•••••] \geq 90%, what is a probability space?
    Suppose I'm a patient, what this probability exactly telling me? Will 90% of doctors produce these diagnosis or something else? Should we select a new calibration set for each new patient?
    I think CP promise more than it can deliver in practice.

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +3

      The probability is over all randomness --- {X_i, Y_i}_{i=1}^{n+1}.
      The guarantee is marginal --- a point that we clarify in close detail in the written form of the gentle intro (Section 4 of arxiv.org/abs/2107.07511 ).
      Zooming out, this has nothing to do with the doctor's diagnoses, only those produced by the prediction algorithm. The sets output by an algorithm calibrated via conformal prediction will contain the true response with marginal probability 90%.
      Looking more closely, the guarantee says that the process of picking a calibration set, calibrating, and then forming a prediction set has 90% probability of coverage. There is also a tail-bound version of conformal prediction that DOES condition on the calibration set. You can think of it as a Risk-Controlling Prediction Set (arxiv.org/abs/2101.02703 ), with the loss function being 0-1, and instead of a concentration bound, you use an exact binomial tail bound.
      Conformal prediction promises exactly what it delivers. However, it is important to have *calibrated* expectations ;) The guarantee is on average, not conditional. Nonetheless, I would say that conformal prediction often outperforms the guarantee in practice. When used with a good prediction algorithm, like quantile regression, that offers approximately conditional coverage, the conformalized version of the algorithm will also offer approximately conditional coverage. You can think about conformal prediction as an insurance policy on such algorithms---if they work well, the procedure will retain their good qualities, and slightly compensate them to guarantee marginal coverage. If the algorithms work badly, at least you will get marginal coverage at the end.

  • @sankalpgilda6818
    @sankalpgilda6818 3 ปีที่แล้ว +1

    What software did you use to make the presentation, and how did comment in real time (i.e., what stylus and pad did you use)? I absolutely love it!

    • @anastasiosangelopoulos
      @anastasiosangelopoulos  3 ปีที่แล้ว +1

      The notes were hand-drawn on an iPad Pro with an Apple Pencil 2 and the GoodNotes app. Stephen and I recorded the presentation together in a zoom meeting, but without sharing our screens, so it just captured our faces --- meanwhile, we were separately screen-recording our iPads individually. I took all three videos and edited them together using Adobe Premiere when we were done.
      If you're recording a presentation on your own, I also recommend OBS Studio!

  • @abrahamowos
    @abrahamowos 2 ปีที่แล้ว

    Please what writing tool is he using?

  • @jeffreyalidochair
    @jeffreyalidochair 2 ปีที่แล้ว

    is there conformal prediction for regression problems?