Applied ML 2020 - 10 - Calibration, Imbalanced data

Поделиться
HTML-код
  • Опубликовано: 6 окт 2024

Комментарии • 40

  • @mabk1196
    @mabk1196 3 года назад +5

    @Andreas Mueller at the vey beginning: where are those number 0.16, 0.5, 0.84 come from? If it is averaged probabilities, then it should be 0.26, 0.5 and 0.85...

  • @jeandersonbc
    @jeandersonbc 4 года назад +9

    I was just discussing about this topic with my advisor. This is what I call perfect timing :D thank you very much for sharing high quality content on the internet! +1 subscribed

    • @AndreasMueller
      @AndreasMueller  4 года назад

      I'm glad if it helps! This lecture still needs a bit of polish, though I hope it has some good pointers.

    • @walkingdad1806
      @walkingdad1806 4 года назад +1

      @@AndreasMueller , could you please explain the difference between underconfident and overconfident classifiers in terms of predicting classes 0 and 1?

    • @JoaoVitorBRgomes
      @JoaoVitorBRgomes 3 года назад

      @@walkingdad1806 at around 11:10 he says data point in x axis is the bin center, so I think there is an error in the slide.

  • @majusumanto9016
    @majusumanto9016 4 года назад +3

    Hi sir, can you explain how to calculate the numbers inside the parenthesis ? ( 0.16, 0.5, 0.84)

  • @elvisdias5094
    @elvisdias5094 4 года назад

    Didn't get much of the multiclass calibration but the balacing with that extra library was what I needed!! Thank you so much for these recorded lectures!

  • @Users_291w
    @Users_291w 4 года назад +1

    I was working on an imbalanced data. The video is great . Thanks for making the content publicly available.

  • @offchan
    @offchan 2 года назад

    32:40 I've been trying to get my head around this fitting and I have the exact same question about these points that are stuck at the top and bottom of the plot. Thanks for mentioning that.

  • @AndreasMueller
    @AndreasMueller  4 года назад +2

    As I mentioned, there was a bug in the balanced bagging classifier and the results are better than undersampling. The updated result are at amueller.github.io/COMS4995-s20/slides/aml-10-calibration-imbalanced-data/#45

  • @marianazari8301
    @marianazari8301 3 года назад

    Really great explanation, I loved the video, thank you so much for this!

  • @shnibbydwhale
    @shnibbydwhale 2 года назад

    Great lecture. One thing I am struggling with was the part at the beginning about how you said that you can have a model that has very well calibrated probabilities, but that the model can also be bad at making predictions or have a low accuracy/recall etc. If the probabilities are well calibrated and are representative of true probabilities, how can the model be bad at correctly classifying the data?

    • @AndreasMueller
      @AndreasMueller  Год назад

      Not sure why I missed this question. Basically if you have two balanced classes, and a classifier predicts a probability of 0.5 for class one for each data point, the classifier is perfectly classified. For every point, it says it's 50% certain that it's class 1, and it's correct in 50% of cases, so it perfectly reflects its own uncertainty.

  • @yussy552
    @yussy552 3 года назад

    Thank you so much for making these lectures public. Great lecture!. If I am training my model with Stratified cross validation, doesn't it deal with the imbalance? How are these more elaborate techniques different? Thanks

    • @AndreasMueller
      @AndreasMueller  3 года назад +1

      That depends a lot on what you mean by "dealing with". In scikit-learn, stratified cross validation actually does not do any undersampling or oversampling but instead ensures that the class proportions are stable across the folds. That means that if the data is imbalanced, then each split will be imbalanced in the same way. The goal of that is to provide a more stable and reliable estimate of generalization performance given the imbalance.

  • @Han-ve8uh
    @Han-ve8uh 3 года назад +1

    Why do we need to calibrate? I can't find any sources explaining it's practical use. Since calibration is a monotonic transformation that doesn't change ranks of results, i would expect it does not affect decision making at all? (I'm assuming people make decisions simply based on ranked choices). What are some real life scenarios where getting the exact probability right is so important? Or is it something of a "making the stats fit some theory better" kind of thing?

    • @AndreasMueller
      @AndreasMueller  3 года назад +2

      There are two very common practical use-cases: one is communicating predictions. Imagine going to a hospital and the diagnosis is "of the 100 people we looked at today, you ranked 89th in likelihood to have cancer". That seems basically useless as far as information goes. Similarly practically important is making cost-based decision (where cost could be dollars or hours worked or lives saved). Imagine knowing the cost of making a false negative or a false positive - or the win you get from making a true positive or true negative. It's actually quote common to have at least approximate knowledge of these costs. In this case, you need probabilities to translate the costs into a decision rule. Hth!

    • @Han-ve8uh
      @Han-ve8uh 3 года назад

      @@AndreasMueller Thinking about this again, I have some ideas. Maybe one reason for calibrating is when the same person is presented with 2 different probabilities from 2 different classifiers, and he needs to resolve this inconsistency to know which number to trust. Another reason is people may have a personal threshold of taking action, maybe 70%, and if calibrating moved a prediction from 65 to 75 or vice versa, that may motivate taking action or vice versa.
      Great point I forgot about incorporating costs. Can I see accurate probabilities as important for calculating Expected Value of a single customer (represented by a single input vector to be predicted), like EV = Prob response x profit + (1-Prob response) x cost. In this case, over/under estimating probabilities could lead to worse decisions as compared to calculating EV from probablities provided by calibrated models?
      What do you think of the above 2 paragraphs?

    • @AndreasMueller
      @AndreasMueller  3 года назад

      @@Han-ve8uh yes that's it.
      I was a bit abstract with the point about costs but you got it exactly right!

    • @Corpsecreate
      @Corpsecreate 10 месяцев назад

      It's not needed. The idea of calibration comes from a very pervasive misunderstanding of the basics of classification modelling.

  • @AkshayKumar-xo2sk
    @AkshayKumar-xo2sk 2 года назад

    @Andreas Mueller - In the top most bin, should the frequency of 1's be two? There are two 1's

  • @shubhamtalks9718
    @shubhamtalks9718 3 года назад +1

    6:57 I did not understand how the expected positive for 'bin0 is 0.16', 'bin1 is 0.5' and 'bin2 is 0.84'?

    • @AndreasMueller
      @AndreasMueller  3 года назад

      it's the mid-points of the bins (which is the same as their average value), for bins [0, 1/3], [1/3, 2/3], [2/3, 1].

    • @shubhamtalks9718
      @shubhamtalks9718 3 года назад

      @@AndreasMueller Are the bins created at equal intervals or does each bin contain same no. of datapoints?

    • @AndreasMueller
      @AndreasMueller  3 года назад

      @@shubhamtalks9718 Equal intervals, they are just uniformly spaced. And in an actual application you would usually use at least 10 bins, but I simplified to 3 here for illustration purposes.

    • @shubhamtalks9718
      @shubhamtalks9718 3 года назад

      @@AndreasMueller Got it. Thanks for the wonderful lecture.

  • @chiragsharma9430
    @chiragsharma9430 2 года назад

    Can we use Calibrated classifiers for multi-class Classification problems?
    If yes can you please provide jupyter notebook demonstrating that?
    And thanks for uploading these video's.

    • @AndreasMueller
      @AndreasMueller  2 года назад

      There's an example in the scikit-learn documentation: scikit-learn.org/stable/auto_examples/calibration/plot_calibration_multiclass.html you can download it as a notebook at the bottom.

  • @danielbaena4691
    @danielbaena4691 2 года назад

    Thanks for the video!

  • @AkshayKumar-xo2sk
    @AkshayKumar-xo2sk 2 года назад

    How did you get 16, 50 and 84% values? I mean for each bin, you have different percentage values. How did you get that?

    • @AnujKatiyal
      @AnujKatiyal 2 года назад +1

      3 equal buckets, 0-33, 33-67, 67-100. Means of these buckets are 16, 50 and 84.

  • @mdichathuranga1
    @mdichathuranga1 2 года назад

    So if you had 10 data points which the model predicted as True and if we get the mean probability of those 10 data points as 0.95 , but when we manually check those 10 data points and found out that only 8 of those data points are actually true which give the percentage of 0.8 , then we can conclude that for the data points which are in the 0.8 - 1 bin the model was over confident ….. Am i right ?

  • @teetanrobotics5363
    @teetanrobotics5363 4 года назад

    Sir Could you please upload the theoretical Machine Learning Course Counterpart?

    • @yuhuang8447
      @yuhuang8447 4 года назад +2

      Hi I think he only gives lecture on the application part and the theoretical part is given by some other professors.

  • @tahirullah4786
    @tahirullah4786 3 года назад

    that's great but from where we get the code of this video?

    • @AndreasMueller
      @AndreasMueller  3 года назад +1

      Link to the material is in the description. This lecture is at github.com/amueller/COMS4995-s20/tree/master/slides/aml-10-calibration-imbalanced-data

  • @Corpsecreate
    @Corpsecreate 10 месяцев назад

    You don't need calibration ever. This is so silly haha

  • @Corpsecreate
    @Corpsecreate 10 месяцев назад

    47:80 I can help you with that. These methods NEVER help.