Automated Machine Learning - Tree Parzen Estimator (TPE)

Поделиться
HTML-код
  • Опубликовано: 14 янв 2025

Комментарии • 34

  • @dfv2814
    @dfv2814 8 месяцев назад +1

    Such a clear explanation! Out of the woods, at last. Thank you for the time you put in making this video.

  • @fabiansturman7404
    @fabiansturman7404 10 месяцев назад +1

    Thank you! I was reading papers about TPEs and it seems they all either are very high level in how the calculations work (the original paper seems to explain some things but not others at all) or like 70 pages long and with tons of less useful information added in… this is such a great video :)

    • @aixplained4763
      @aixplained4763  10 месяцев назад +1

      Thank you for your kind comment :) Glad that it was helpful!

  • @ritwikpm
    @ritwikpm 2 месяца назад

    Beautiful Explanation - Reminded me of the good old Expectation Maximisation algorithms :)

  • @luistalavera6392
    @luistalavera6392 Год назад +2

    The explanation was amazing and very clear, thank you so much :D

  • @Vyshu5
    @Vyshu5 3 месяца назад

    Thank you, you saved my life😢❤

  • @TheLazyEngineer
    @TheLazyEngineer 10 месяцев назад +1

    Any rule of thumb for how many observations needed for optimizing N hyperparameters?

    • @aixplained4763
      @aixplained4763  10 месяцев назад +2

      That is a great question! Unfortunately, I do not know of such a rule of thumb. This really depends on the problem and the importance of the different hyperparameters. Normally, hyperparameter tuning is very expensive and we have to go with a small number of observations. TPE can more effectively explore the search space than, for example, random search. However, TPE is often warm-started with random search: you can for example test between 10 and 30 randomly selected configurations and use TPE to select new promising configurations.

    • @TheLazyEngineer
      @TheLazyEngineer 10 месяцев назад

      @@aixplained4763 Thanks! And great video btw.

  • @deema_c
    @deema_c 6 месяцев назад +1

    For the mixture of Gaussians, are we taking the mixture of both bad and good distributions for each observations?

    • @aixplained4763
      @aixplained4763  6 месяцев назад +1

      Good question! No, we are computing a mixture for the good observations and a mixture for the bad observations independently of each other. For example, we put a Gaussian distribution over every observation for the good distribution, and combine these Gaussians for every good point to fit a "good" distribution. Likewise for the bad points. Hope this helps!

  • @settyg
    @settyg 5 месяцев назад

    So well explained, thank you so much for sharing.

  • @JW-iw9fe
    @JW-iw9fe 2 года назад +2

    Amazing!.Thank you.As a newbie, I would like to ask a few questions.
    What is the point of using Gaussian mixture for each hyperparameter in this paper?
    I know the distribution of good point and bad point is calculated by kernel density estimation(Parzen Estimator).
    But I don't understand how the mixture of gaussian is related to the distribution of good point and bad point.
    I probably don't know much about TPE yet so the question might be a little weird.
    I hope you can understand. Looking forward to your reply !
    Thank you !

    • @aixplained4763
      @aixplained4763  2 года назад

      That's a good question! In TPE, we create two partitions: the good partition and bad partition. These contain the observations with either good or bad performance, respectively. As you mention, we do kernel density estimation for both partitions. Within kernel density estimation, we basically perform a weighted sum over all observations in a given partition, where the weight is given by the kernel. This summation naturally also occurs in a mixture of Gaussians (we sum up individual the probabilities of the individual Gaussians). So when each Gaussian corresponds to a given observation, we retrieve a summation over all data points, as in kernel density estimation. Hope this helps! :)

    • @JW-iw9fe
      @JW-iw9fe 2 года назад

      @@aixplained4763 Thank you for your answering! You help me a lot. I have another question.In Alois Bissuel Blog, he said 'One of the great drawback of tree-structured Parzen estimators is that they do not model interactions between the hyper-parameters. ' How do I understand this?
      Does it mean there is no relationship between hyperparameters?
      Looking forward to your reply !
      Thank you!

    • @aixplained4763
      @aixplained4763  2 года назад +1

      ​@@JW-iw9fe TPE models the densities for hyperparameters separately (for every hyperparameter, we fit l(x) and g(x)). As a consequence, it does not know how the individual densities are affected by other hyperparameter values, because it models them in isolation. This is a limitation of this method as in reality, the optimal value of a hyperparameter depends on the values of other hyperparameters.

    • @JW-iw9fe
      @JW-iw9fe 2 года назад

      @@aixplained4763 You are so nice. Thank you so much!
      Your answer completely solved my confusion!
      Even with these drawbacks, is TPE used more in HPO?
      And can you tell me some of the most popular technologies in HPO?
      Thank you!

    • @aixplained4763
      @aixplained4763  2 года назад +1

      @@JW-iw9fe In spite of this limitation, TPE is quite commonly used as a surrogate model for computing the expected improvement of configurations. Some popular HPO methods include SMAC (uses random forests instead of TPE as surrogate model) and BOHB (does use TPE). In BOHB, however, they used a multidimensional TPE such that interactions between hyperparameters can be modeled. If you are interested, you can read more here: www.automl.org/automated-algorithm-design/algorithm-configuration/smac/ and arxiv.org/pdf/1807.01774.pdf

  • @amirmn7
    @amirmn7 Год назад

    really nice explanation! thank you!

  • @Raven-bi3xn
    @Raven-bi3xn 2 года назад +2

    Nicely done. Thank you. Do you know why it’s called Tree Parzen Estimators?

    • @aixplained4763
      @aixplained4763  2 года назад +3

      Thank you. Great question! It is called "Parzen Estimtor" because it builds on Parzen windows for building probabilistic models (stats.stackexchange.com/questions/244012/can-you-explain-parzen-window-kernel-density-estimation-in-laymans-terms). The "tree" part refers to the fact that the configuration space of hyperparameters has a tree structure.

  • @no-lagteardown3558
    @no-lagteardown3558 Месяц назад

    Great video! I just have one question. How does this method prevent predictions from getting "stuck"? Can a prediction be the same point as an already existing one even if that point isnt a global or even local minimum? Is perhaps noise used when predicting or do we use the actual function of g(x)/l(x), scale it to be a valid distribution and take a random point from it?
    Sorry if this a silly question Im pretty new to these probabilistic optimizers.

    • @aixplained4763
      @aixplained4763  Месяц назад

      Great question! In general, Bayesian optimization, such as TPE, attempts to balance exploitation (searching around points known to be good) and exploration (searching in areas where we have not explored many points). For example, the probability l(x) can be very small in under-explored areas, leading to a very high expected improvement. Nonetheless, it is indeed still possible for the optimizer to get stuck in a non-global optimum. Many practical tricks aim to prevent this, such as randomly sampling a point every now and then (or using a larger random warmup), which would give a theoretical probability of 1 of eventually discovering the global optimum (given an infinite computation budget). But in practice, yes it is definitely possible for the optimizers to get stuck in non-global optima!

  • @jacobogerardogonzalezleon2161
    @jacobogerardogonzalezleon2161 2 года назад

    Amazing course! What theorethically implications there exists when using TPE as a sampler and Hyperband as a prunner?

    • @aixplained4763
      @aixplained4763  2 года назад +1

      Thank you :) That's a great question! Hyperband has a theoretical guarantee to find the optimal configuration with probability 1 given an infinite budget (because it does a random search in the final bracket). Using TPE as a sampler changes this, as TPE can get stuck in local optima and never explore parts of the search space (thus potentially missing out on the optimal configuration). A workaround to this would be to choose a configuration at random with a small probability (and TPE otherwise) so that we retrieve the guarantees of random search. These theoretical guarantees are, however, not very useful in practice.

  • @Maciek17PL
    @Maciek17PL 2 года назад

    great explenation!!!!! thank you

  • @icelover7166
    @icelover7166 Год назад

    Your video shows the optimisation of a hyperparameter using TPE. When the TPE optimises a set of parameters, how does it work?
    optimising each hyperparameter one by one with the other parameters fixed, or all hyperparameters together? I hope you can answer this, I am confused.

    • @aixplained4763
      @aixplained4763  Год назад

      That is a great question! When you have multiple hyperparameters you want to tune, you fit P(x|good) and P(x|bad) for every individual hyperparameter x (for example, we could do this for the learning rate and batch size). This means that every hyperparameter is optimized independently of the other hyperparameters and that interactions cannot be modeled. This is one of the main shortcomings of TPE, but can be addressed by using multivariable distributions like P(x_1, x_2|good), although that brings about new issues :)

    • @icelover7166
      @icelover7166 Год назад

      @@aixplained4763 Thank you for your answer, does this mean that using TPE for hyperparameter optimisation will ignore the link between the hyperparameters. For example, when tuning for learning rate and batch size, the results they get will be significantly different between optimising learning rate before batch size using TBE and optimising batch size before learning rate?

    • @aixplained4763
      @aixplained4763  Год назад

      @@icelover7166 Yes, exactly. TPE cannot learn the interaction effects (or links) between different hyperparameters. Concerning the latter example you mention: this can indeed be the case but it depends on how the tuning is performed. In the regular case, you perform a random search for a few iterations (a warm-up) where you select random hyperparameter configurations (one configuration = a certain value for the learning rate and batch size in our case). These can be used to fit the densities used by TPE, and for selecting promising configurations to try out next.
      If instead you decide to first tune the batch size and fixing all other hyperparameters, and afterward tune the learning rate, fixing all other hyperparameters, then the outcome can indeed be very different compared to the case described above. Hope this helps!

    • @icelover7166
      @icelover7166 Год назад

      @@aixplained4763 This is very helpful. Thank you

  • @andrewvirtual
    @andrewvirtual Год назад

    Very well explained

  • @jk-cw8ix
    @jk-cw8ix Год назад

    Good shit