Bayesian Deep Learning and Probabilistic Model Construction - ICML 2020 Tutorial

Andrew Gordon Wilson

Просмотров 41 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 12 сен 2024
Bayesian Deep Learning and a Probabilistic Perspective of Model Construction
ICML 2020 Tutorial
Bayesian inference is especially compelling for deep neural networks. The key distinguishing property of a Bayesian approach is marginalization instead of optimization. Neural networks are typically underspecified by the data, and can represent many different but high performing models corresponding to different settings of parameters, which is exactly when marginalization will make the biggest difference for accuracy and calibration.
The tutorial has four parts:
Part 1: Introduction to Bayesian modelling and overview (Foundations, overview, Bayesian model averaging in deep learning, epistemic uncertainty, examples)
Part 2: The function-space view (Gaussian processes, infinite neural networks, training a neural network is kernel learning, Bayesian non-parametric deep learning)
Part 3: Practical methods for Bayesian deep learning (Loss landscapes, functional diversity in mode connectivity, SWAG, epistemic uncertainty, calibration, subspace inference, K-FAC Laplace, MC Dropout, stochastic MCMC, Bayes by Backprop, deep ensembles)
Part 4: Bayesian model construction and generalization (Deep ensembles, MultiSWAG, tempering, prior-specification, posterior contraction, re-thinking generalization, double descent, width-depth trade-offs, more!)
Slides: cims.nyu.edu/~...
Associated Paper: "Bayesian Deep Learning and a Probabilistic Perspective of Generalization" (NeurIPS 2020)
arxiv.org/pdf/...
Thanks to Kevin Xia (Columbia) for help in preparing the video.

Комментарии • 17

@yeahzisue 3 года назад ⁺¹²
This is gold. Thank you professor
@gyeonghokim 2 года назад ⁺⁷
0:00 - Part 1: Introduction to Bayesian modelling and overview
30:51 - Part 2: The function-space view
59:55 - Part 3: Practical methods for Bayesian deep learning
1:26:29 - Part 4: Bayesian model construction and generalization
@anmolsmusings6370 Год назад ⁺¹
What a *beautiful* tutorial! Professor seems to really *feeling* every aspect of lecture. Hat's off Professor.
@meisherenow Год назад
Nicely explained, with lots of pointers to papers. Thanks!
@andreselias5863 2 года назад
The professor is incredible and the lecture brilliant
@foremarke 8 месяцев назад ⁺¹
🎯 Key Takeaways for quick navigation:
00:00 🤖 *Andrew Wilson introduces the ICML 2020 tutorial on Bayesian Deep Learning, acknowledging contributors and collaborators.*
03:21 🔄 *Flexibility and structured inductive biases are crucial for constructing models with good generalization; however, flexibility should not be confused with complexity.*
05:19 🔄 *Bayesian model averaging, emphasizing marginalization over optimization, is a key distinction in Bayesian approaches, especially relevant in deep learning.*
08:52 🧩 *Bayesian Deep Learning demystifies neural networks, providing insights through probability theory, addressing issues like double descent, model construction, and fitting random labels.*
09:33 🚀 *Bayesian Deep Learning has seen significant empirical progress, with methods offering practical advantages over classical training on various problems.*
10:18 🗂️ *The tutorial covers four parts: Bayesian foundations, a function space perspective, practical methods, and Bayesian model construction, exploring concepts and modern approaches.*
14:09 🔍 *Bayesian approaches provide interpretability in machine learning, such as addressing overfitting through regularization and expressing uncertainty with posterior probabilities.*
16:35 🔄 *Bayesian model averaging, involving marginalization over parameters, represents epistemic uncertainty, allowing consideration of all possible parameter settings.*
18:16 📈 *Bayesian model averaging contrasts with some model combination approaches, as it assumes a collapse onto a correct parameter setting with increasing data, different from enriching the hypothesis space.*
22:47 🎲 *Bayesian inference in a coin-flipping example illustrates the use of conjugate priors, providing a principled way to handle uncertainty and make predictions.*
23:15 🔄 *Bayesian marginalization over a beta distribution posterior yields more reasonable predictions than the maximum likelihood estimate, especially in cases where data is limited.*
24:11 🔄 *The Maximum A Posteriori (MAP) estimate, derived from the log posterior over parameters, may not always align with Bayesian marginalization, emphasizing the importance of honest representation in modeling beliefs.*
25:12 🔄 *Introduction to an example: estimating an unknown density using a mixture of Gaussians observation model; the importance of choosing parameter settings for a high likelihood.*
26:51 🔄 *Unbelievable solutions in modeling may occur due to incomplete representation of beliefs; the introduction of a prior or regularizer can address issues like point mass solutions.*
27:20 🔄 *The goal in Bayesian methods is to compute a Bayesian model average for the unconditional predictive distribution, considering all possible settings of parameters weighted by their posterior probabilities.*
28:16 🔄 *For models with non-analytic integrals, such as Bayesian neural nets, simple Monte Carlo approximations are common; methods like variational methods and Markov Chain Monte Carlo (MCMC) are discussed.*
30:25 🔄 *Estimating the unconditional predictive distribution is treated as an active learning problem under computational constraints; Deep Ensembles method is introduced as an approximate Bayesian method.*
30:55 🔄 *Part 2 begins with a focus on a function space perspective in Bayesian deep learning, covering topics like Gaussian processes, infinite neural nets, and Bayesian non-parametric deep learning.*
32:46 🔄 *Derivation of moments of an induced distribution over functions in a linear model with a Gaussian process, leading to the definition of a Gaussian process and its properties.*
37:40 🔄 *Introduction to the Radial Basis Function (RBF) kernel in Gaussian processes and its role in controlling correlations between functions based on distance in input space.*
43:21 🔄 *Derivation of the RBF kernel from an infinite number of basis functions, showcasing the flexibility of Gaussian processes and their ability to use models with an infinite number of parameters.*
47:46 🔄 *Discussion on how an infinite neural net converges to a Gaussian process, highlighting the excitement and debates in the machine learning community about using Gaussian processes versus neural nets.*
50:29 🧠 *Bayesian Deep Learning combines Gaussian processes (GPs) and neural nets, leveraging neural nets for adaptive basis functions and GPs for an infinite number of basis functions with finite computation.*
51:27 🔄 *Deep Kernel Learning integrates neural nets and GPs, creating an end-to-end model trained through the marginal likelihood of a GP, providing infinite adaptive basis functions.*
53:05 📊 *Non-Euclidean metric learning in neural nets and GPs reveals their flexibility in capturing similarity, emphasizing that Euclidean distance may not be suitable for representation learning.*
54:44 🔍 *Gaussian processes with deep kernels, like spectral mixture kernels, can effectively fit discontinuous data, demonstrating the importance of kernel choice in handling diverse data patterns.*
56:27 ⚙️ *Advances in hardware, like GPU acceleration and parallelization, enable scalable exact Gaussian processes, overcoming previous computational constraints and showcasing their benefits.*
57:12 🚀 *Non-parametric Gaussian processes with infinite basis functions automatically scale model capacity with data, contrasting with parametric models determined by a fixed set of parameters.*
59:32 🌐 *Combining Bayesian non-parametrics with deep learning offers exciting prospects for future research, emphasizing the need for inductive biases from neural nets in Gaussian processes.*
01:00:16 🔧 *Practical Bayesian deep learning methods, such as SWAG (Stochastic Weight Averaging Gaussian), address challenges like overconfidence in classical training by incorporating uncertainty.*
01:03:23 🔄 *Mode connectivity in neural net landscapes reveals that even seemingly isolated solutions are connected in subspaces, suggesting the need for exploration beyond local optima.*
01:05:18 🎨 *Visualizing mode connectivity helps understand the structure of neural net loss landscapes, leading to methods like SWAG, which samples from a region containing diverse low-loss models.*
01:08:41 📈 *SWAG procedure, a simple baseline for Bayesian uncertainty in deep learning, provides scalable and improved predictions, enhancing model generalization compared to classical training.*
01:12:57 🔄 *Bayesian marginalization in low-dimensional subspaces, as demonstrated in SWAG, challenges the notion that high-dimensional parameter spaces are essential for capturing functional variability in neural nets.*
01:14:34 📊 *Bayesian inference enhances epistemic uncertainty representation, crucial for predictive distribution in deep learning.*
01:17:11 🔄 *Bayesian marginalization, specifically using PCA subspace, leads to non-trivial gains in accuracy compared to classical training in neural networks.*
01:19:29 🌐 *Laplace approximation simplifies Bayesian inference but is constrained to unimodal approximations, limiting its ability to capture global uncertainty.*
01:21:12 🎭 *MC Dropout, a popular Bayesian approach, utilizes dropout during both training and testing to create an ensemble, addressing some limitations of model averaging.*
01:23:46 🔄 *Stochastic MCMC methods, like stochastic gradient Langevin dynamics, offer scalability and applicability in Bayesian deep learning, exploring complex loss surfaces effectively.*
01:24:35 🔄 *Deep ensembles involve training and retraining neural networks multiple times, creating diverse models for Bayesian model averaging, offering a practical and effective heuristic.*
01:26:56 🌊 *Deep ensembles provide a better approximation to Bayesian model averaging than many single-basin marginalization methods, especially in the presence of computational constraints.*
01:37:25 📈 *Bayesian methods, like Multi-SWAG, exhibit a monotonic improvement with increased model flexibility, contrasting with the double descent observed in classical training.*
01:37:51 📉 *Bayesian methods, especially multimodal marginalization, show significant empirical benefits in terms of both uncertainty representation and accuracy in deep neural networks.*
01:38:31 📊 *Gaussian priors in weight space induce a reasonable prior in function space, as demonstrated by the deep image prior and results from the paper by Zhang et al.*
01:39:56 🔥 *Tempering in Bayesian deep learning, altering the likelihood with a temperature parameter, is essential, and a standard Gaussian prior may lead to poor performance without proper tempering.*
01:40:39 🤔 *Prior misspecification, specifically with a standard Gaussian prior, can result in sample functions assigning one label to most classes, highlighting the importance of tuning the prior variance parameter (alpha).*
01:41:35 🌐 *Prior biases are quickly modulated by data, and the induced covariance function in the distribution over functions plays a crucial role in generalization, impacting it more than the signal variance parameter.*
Made with HARPA AI
@piratehussam 3 года назад ⁺¹¹
Part 1: 00:00:00
Part 2: 00:30:51
@dermitdembrot3091 3 года назад ⁺²
Part 3: 00:59:54
@heyjianjing 2 года назад
Part 4: 01:26:30
@mediaanalysis4708 2 года назад
Finished at 1:57:06😀
@David-rb9lh 2 года назад
Great explanation, thanks!
Little mistakes at 16:27 we should sum for y not for x to obtain the marginal of x .
@heyjianjing 2 года назад
What an outstanding lecture!
@bytearray4456 Год назад
Superb! Thank you.
@imanmossavat9383 2 года назад
really nice talk, thank you.
@shishuaiwang7407 Год назад
thanks！👏
@MrStrangerinthewind 3 года назад
Is it possible to share the ppt? Many thanks
@siddharthshrivastava5823 2 года назад
It's available in the description

Следующие

Автовоспроизведение