REFS: [0:02:25] **Academic position and research focus at ETH Zurich** | Jonas Hübotter is a doctoral researcher in the Learning and Adaptive Systems Group at ETH Zurich, working with Professor Andreas Krause on machine learning and local learning. | Jonas Hübotter jonhue.github.io/ [0:02:50] **The Pile benchmark dataset for language model evaluation** | The Pile is an 825 GiB English text corpus used for training large-scale language models, consisting of 22 diverse high-quality subsets including academic writing, Stack Exchange, and other sources. | Leo Gao et al. arxiv.org/abs/2101.00027 [0:05:52] **Framework for making machine learning accessible through teaching-focused approach** | Machine Teaching: A New Paradigm for Building Machine Learning Systems - Microsoft Research paper introducing the concept of machine teaching as a discipline focused on teachers rather than learners | Patrice Y. Simard et al. arxiv.org/abs/1707.06742 [0:07:35] **Foundational paper introducing RAG architecture combining pre-trained models with explicit memory access** | RAG (Retrieval-Augmented Generation) paper by Patrick Lewis et al. introducing the concept of combining parametric and non-parametric memory for language generation | Patrick Lewis et al. arxiv.org/abs/2005.11401 [0:09:50] **Comprehensive documentation of The Pile dataset including its mathematical components** | The Pile dataset including DeepMind Mathematics component, containing school-level math questions and other diverse text data | Stella Biderman et al. arxiv.org/pdf/2201.07311 [0:11:25] **Survey paper analyzing knowledge conflicts in LLMs between pre-training and in-context information** | Research on conflicts between in-context learning and pre-training knowledge in large language models | Chen, Zhixiu and Wang, Yuchen and Zhang, Zhihao and Wang, Xu and Li, Zhiwei arxiv.org/html/2403.08319v2 [0:13:40] **Study of ant foraging rules and pheromone trail network properties** | Research on ant colony foraging behavior and pheromone trail networks | Czaczkes, Tomer J pmc.ncbi.nlm.nih.gov/articles/PMC3291321/ [0:16:05] **Theory of instrumental convergence in superintelligent AI systems** | Instrumental convergence thesis in AI safety, discussing how superintelligent AI systems might develop predictable sub-goals regardless of their final goals | Nick Bostrom nickbostrom.com/superintelligentwill.pdf [0:18:45] **Seminal paper defining universal intelligence and its relationship to compression** | Marcus Hutter's fundamental work on universal intelligence and its relationship to compression, particularly in his collaboration with Shane Legg defining machine intelligence | Shane Legg and Marcus Hutter arxiv.org/pdf/0712.3329.pdf [0:20:50] **Paper connecting active inference, free energy principle, and maximum entropy methods in machine learning** | Discussion of active inference as a form of maximum entropy inverse reinforcement learning, which relates to the paper 'The Free Energy Principle for Perception and Action: A Deep Learning Perspective' discussing the relationship between active inference and maximum entropy methods | Pietro Mazzaglia et al. www.mdpi.com/1099-4300/24/2/301/pdf [0:23:10] **Paper explaining how active inference leads to autonomous organization in biological systems** | Discussion of emergence of self-sustaining behaviors through active inference relates to 'The Markov blankets of life: autonomy, active inference and the free energy principle', which explores how active inference leads to autonomous behavior | Karl J. Friston royalsocietypublishing.org/doi/10.1098/rsif.2017.0792 [0:23:30] **Technical framework for implementing intentional behavior in active inference agents** | Active Inference and Intentional Behaviour (2024) discusses how active inference frameworks can be used to create AI systems with constrained agency and specific preferences. | Karl J. Friston arxiv.org/html/2312.07547v2 [0:24:10] **Research on genetic constraints and behavioral plasticity in intelligence** | The Paradox of Intelligence: Heritability and Malleability Coexist in Hidden Gene-Environment Interplay (2018) explores how genetic constraints interact with environmental plasticity. | Bruno Sauce, Louis D. Matzel www.ncbi.nlm.nih.gov/pmc/articles/PMC5754247/ [0:26:55] **Foundational work establishing dual-process theory of cognition with System 1/2 framework** | System 1 (fast, intuitive, and emotional) and System 2 (slower, deliberative, and logical) thinking framework from 'Thinking, Fast and Slow'. Context: Discussion of cognitive architectures and their applicability to AI systems. | Daniel Kahneman www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555 [0:28:55] **Analysis of computational mechanisms behind in-context learning versus fine-tuning** | Computational differences between in-context learning and fine-tuning, with ICL requiring forward computation for each token while fine-tuning uses back-propagation. Context: Discussion of efficiency in different learning approaches. | Wei et al. arxiv.org/pdf/2212.10559 [0:30:55] **Foundational paper introducing nearest neighbor pattern classification** | Early work on nearest neighbor methods in the 1950s for pattern recognition and classification | Cover, T., Hart, P. ieeexplore.ieee.org/document/1053964
PART 2: [0:32:05] **Fundamental work establishing theoretical framework for transductive learning** | Vladimir Vapnik's work on transductive inference and statistical learning theory | Vladimir Vapnik www.springer.com/gp/book/9780387987804 [0:35:35] **Leading researcher in conformal prediction and machine learning at Royal Holloway** | Reference to Vladimir Vovk at Royal Holloway University, pioneer of conformal prediction | Vladimir Vovk pure.royalholloway.ac.uk/en/persons/vladimir-vovk [0:36:30] **Neuroscientist exploring consciousness and its relationship with emotional processing** | Reference to Mark Solms' work on consciousness and its relationship with ambiguity processing | Mark Solms ruclips.net/video/CmuYrnOVmfk/видео.html [0:40:00] **Foundational paper establishing active inference as a model of agency and choice behavior** | Karl Friston's active inference model of agency, which describes how biological systems maintain their state through prediction and action | Karl Friston www.frontiersin.org/journals/human-neuroscience/articles/10.3389/fnhum.2013.00598/full [0:41:45] **Novel approach for efficient test-time adaptation of language models** | SIFT (Sparse Inference Fine-Tuning) paper discussing local distribution learning in language models | Jonas Hübotter et al. arxiv.org/pdf/2410.08020 [0:43:35] **Research on improving LLM performance through test-time adaptation using nearest neighbors** | Test-Time Training on Nearest Neighbors for Large Language Models (2024). The paper discusses how updating models at test time with relevant data can improve performance, aligning with the speaker's points about local learning benefits. | Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig arxiv.org/html/2305.18466v3 [0:48:25] **Survey of active learning techniques addressing domain shift and multi-domain sampling** | Concept of active learning addressing distribution shift in machine learning systems. Active learning systems continuously retrain on shifting data distributions to maintain model performance over time. | Shayne Longpre et al. arxiv.org/abs/2202.00254 [0:50:55] **Research on combining retrieval and fine-tuning for in-context learning models** | Discussion of nearest neighbor retrieval and fine-tuning approach for local model adaptation. This relates to the naive approach mentioned in the conversation about retrieving nearest neighbors for fine-tuning. | Thomas et al. arxiv.org/abs/2406.05207 [0:54:05] **Original RoBERTa paper introducing the improved BERT-based model for NLP tasks** | RoBERTa (Robust Optimized BERT Approach) is a robustly optimized BERT pretraining approach that improves on BERT's masking strategy and training methodology. It's commonly used for generating embeddings in information retrieval tasks. | Yinhan Liu et al. arxiv.org/pdf/1907.11692.pdf [0:58:45] **Comprehensive guide to deep learning that includes detailed discussion of fine-tuning practices** | Deep Learning with Python by François Chollet discusses the challenges and best practices of fine-tuning neural networks, particularly regarding learning rate selection and gradient steps | François Chollet www.amazon.com/Learning-Python-Second-Fran%C3%A7ois-Chollet/dp/1617296864 [1:01:55] **Research paper examining the Linear Representation Hypothesis in neural networks** | Linear Representation Hypothesis (LRH) in neural networks, which posits that networks encode concepts as directions in activation space | Róbert Csordás et al. arxiv.org/abs/2408.10920 [1:03:10] **ML researcher specializing in mechanistic interpretability of neural networks** | Neel Nanda - Machine Learning Researcher at DeepMind, previously at Anthropic, known for work in mechanistic interpretability | Neel Nanda www.neelnanda.io/about [1:05:40] **Foundational paper introducing LIME for model interpretability** | LIME (Local Interpretable Model-agnostic Explanations) - A technique for explaining predictions of any classifier using local linear approximations | Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin arxiv.org/abs/1602.04938 [1:09:35] **Seminal paper on using influence functions for understanding black-box model predictions** | Influence Functions in machine learning as described in 'Understanding Black-box Predictions via Influence Functions' by Koh & Liang. The paper demonstrates how influence functions can trace model predictions back to training data. | Pang Wei Koh arxiv.org/abs/1703.04730 [1:11:45] **Comprehensive overview of dataset security vulnerabilities including data poisoning and backdoor attacks** | Data poisoning attacks in machine learning security, where training data manipulation can create backdoors and vulnerabilities in ML systems | Micah Goldblum et al. arxiv.org/abs/2012.10544 [1:16:05] **Fundamental textbook covering Bayesian linear regression with closed-form solutions** | Bayesian Linear Regression as described in 'Pattern Recognition and Machine Learning' by Bishop. The text discusses closed-form solutions for posterior computation with Gaussian priors and likelihood. Context: Speaker explains how linear surrogate models with Gaussian priors enable tractable posterior computation. | Christopher M. Bishop www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738 [1:17:10] **Paper demonstrating Bayesian neural networks for uncertainty quantification** | Discussion of uncertainty quantification in neural networks using Bayesian methods, referencing 'Bayesian Deep Convolutional Encoder-Decoder Networks for Surrogate Modeling and Uncertainty Quantification'. Context: Speaker contrasts traditional neural networks with Bayesian approaches for uncertainty estimation. | Yinhao Zhu arxiv.org/abs/1801.06879 [1:18:55] **Comprehensive review of variational inference methods including closed-form solutions with conjugate priors** | Closed-form Bayesian inference with Gaussian distributions (conjugate priors), as detailed in 'Variational Inference: A Review for Statisticians'. The paper discusses how conjugate priors lead to analytically tractable posterior distributions, particularly in the case of Gaussian distributions. | David M. Blei, Alp Kucukelbir, Jon D. McAuliffe arxiv.org/pdf/1601.00670 [1:26:15] **MindsAI's breakthrough in ARC challenge using test-time fine-tuning** | MindsAI team's achievement in the ARC (Abstraction and Reasoning Corpus) Challenge, reaching 54.5% performance using test-time fine-tuning approach in late 2024 | Mohamed Osman & MindsAI Team www.reddit.com/r/singularity/comments/1gexvmj/new_arcagi_high_score_by_mindsai_545_prize_goal/ [1:29:50] **Research on active inference for collaborative AI systems in unknown environments** | Active Inference in distributed AI systems as described in 'Collaborative AI Teaming in Unknown Environments via Active Goal Inference'. The paper discusses how active inference can be used in distributed AI systems for collaborative tasks. | Jaya Krishna Thota et al. arxiv.org/pdf/2403.15341 [1:32:55] **Introduction of OpenAI o1 model with novel inference-time scaling properties** | OpenAI o1 model's inference-time scaling capabilities, introduced in September 2024, showing performance improvements with both train-time and test-time compute allocation | OpenAI openai.com/index/learning-to-reason-with-llms/ [1:33:55] **Framework for active inference and uncertainty minimization in AI systems** | Active inference framework for AI systems, focusing on uncertainty minimization through predictive coding and exploration | Abdelrahman Sharafeldin www.sciencedirect.com/science/article/pii/S2666389924000977 [1:36:05] **Theoretical analysis of convergence in uncertainty-based active learning** | Research on convergence guarantees in uncertainty-based active learning, discussing how selecting informative data points based on uncertainty reduction can lead to optimal convergence | Yingzhen Yang et al. arxiv.org/pdf/2312.13927 [1:37:50] **Information-theoretic analysis of transductive learning generalization bounds** | Discussion of transductive learning theory and its relationship to inductive learning, particularly relevant to the proposed hybrid approach | Huayi Tang et al. arxiv.org/abs/2311.04561 [1:38:50] **Research paper establishing theoretical foundations of transductive learning in machine learning** | Discussion of transductive learning in machine learning context, referencing the theoretical framework of transductive vs inductive learning approaches | Mathieu Chalvidal arxiv.org/abs/2302.00328 [1:40:35] **Foundational paper establishing scaling laws for neural language models** | Reference to scaling laws in language model training, discussing the relationship between compute budget and model size | Jared Kaplan et al. arxiv.org/abs/2001.08361 [1:42:20] **Latest Apple Silicon chip optimized for ML workloads** | Apple M4 chip, announced in May 2024, represents significant advancement in local ML processing capabilities for MacBook Pro line | Apple Inc. www.apple.com/newsroom/2024/05/apple-introduces-m4-chip/ [1:42:40] **Advanced language model with improved performance and speed** | Claude 3.5 Sonnet by Anthropic, released in 2024, providing enhanced capabilities for model verification and complex reasoning tasks | Anthropic www.anthropic.com/news/claude-3-5-sonnet [1:45:45] **Information-based transductive active learning research with applications to safe exploration** | Reference to transactive fine-tuning and its future impact in AI systems, connecting to active learning and uncertainty estimation methods | Jonas Hübotter et al. arxiv.org/pdf/2405.05890
Reason can emerge in a connectionist system, but loops and alfgorithms can only be rolled out to something less deep than the model's depth. So we need system 2 thinking for longer algorithms.
So much groundwork to cover, bit snoozy, then about the hour mark it really starts to get meaty. Good questioning Tim! Jonas is a machine, so eloquent!
That was an interesting point about the amortization of SFT vs in-context learning; SFT is able to parallelize the operations across the batch vs in-context learning has to sequentially operate on all of the examples.
How does the current incarnation differ from Continual/Lifelong/Incremental learning? Sutton, LeCun, M.Mitchell, I.Rish, and others have eluded to this yet the field doesn’t seem to focus on this fundamental gap at all.
Jonas talks about this in his paper "Retrieval and active learning can be seen as two extreme ends of a spectrum: retrieval selects relevant but potentially redundant data, while active learning selects diverse but potentially irrelevant data." - also Jonas is advocating for transductive inference when we go from "particular to particular" i.e. build a new model for each prediction from the data (and models) we have access to
great talk! want to hear your opinion if agentic AI is a form of test time compute. since youre literally answer the request over longer time than a single current non TTC LLM calls.
22:47 through 23:22 23:36 through 24:01 24:53 through 25:52 25:58 through 26:44 As someone forever living the aftermath of numerous severe to moderate TBI and ABI - living with an, "acquired communication disorder" - as one can see, I am taking notes.
I see my error of "severe to moderate." Correct terminology is "moderate to severe." That said, a snapshot of illogical sequencing (symptom manifestation) is more valuable to me than an edited comment. This reply being a clear exception; totally editing it for the sole purpose of removing a line break. Second edit to remove four words because... "aesthetic." *deadpan jazz hands*
36:42 through 37:16 Oh! 38:38 through 39:11 reminds me of my neurofeedback sessions. The image on the screen is fuzzy but if you "do it correctly" the image becomes increasingly clear. Once clear, keep it clear.
I don't know if I understand or everyone is saying it wrong... for me, non-linear sequence in 1d is composed of two linear inputs in 2d, and so on in 3d, nd... all are simplified to linear as complex of linear transversals, where diagonals is a general combination... If everyone is in dilemma of approach regarding this, then I patent this as my definition 🤑
I do not see any meaningful impact from this talk or paper . 1. Embedding is not a proved method to measure relevant . Say what if you want to compare similar law cases , which need multiple dimensional similarity , embedding is not working at all . 2. A super highly flexible use cases is only existed in theory . We all have certain fixed used cases in real world and multiple qlora dapaters can be instantly plug in is good enough . I do not care how fancy your theory are , if you can not guarantee use embedding can lead a successfully retrieval on highly special cases , then your method is useless . for normal use cases , LoRA offers a compelling alternative because it's lightweight, adaptable, and allows for storing multiple adapters for different tasks. This avoids retraining or storing multiple fine-tuned models. Real world application will go for multiple qlora adapters while their used cases care always fixed . A super highly flexible use cases is only existed in theory . Now , tell me what's the meaningfulness for your paper or method ?
This is the paper we discussed - arxiv.org/pdf/2410.08020 You might want to get Claude to explain it to you to fill in some of your gaps in understanding. The embeddings are used to retrieve nearest neighbours (to the test instance), then a local model is constructed (which loosely resembles a kernel ridge regression model) to iteratively select data points which maximise the information gain i.e. balancing relevance and diversity. Embedding-only search does indeed suck, that's the whole point of this research. The key insight here is selecting an optimal set of examples using a local surrogate model to fine tune the source model but the fine tuning itself is not a key part of the discussion.
It would be much more accessible for the general audience if you at least explain the “big” words that you’d like to use. For example, it would be good if you explain what transduction means before talking more about other things.
I think we do explain it and show a figure but perhaps a little way into the show. Most ML models are "inductive" where you train on data and build a general-purpose decision function which you re-use in many future situations. Transduction (test time learning is a form of transduction) is when in every prediction situation you use data (usually test data, or retrieved data "related" to test data) to build a new model on the fly for the sole purpose of that prediction. MLST is pitched at a technical audience, but I appreciate we could do better at explaining things.
@@MachineLearningStreetTalk oh with this kind of content, I tend to watch it like a podcast whereby I usually listen rather watch the video. But it would be much better subjectively if you could explain them in words as well.
I'm definitely part of the non-technical audience. Listening to these interviews without a technical background is sort of like listening to a foreign language podcast, it's doable, but it takes a bit of effort initially. You can still get a lot out of it, but you may need to pause and look up a term here and there for a while to follow along. Just keep ChatGPT/Claude in a second tab! But after you do that a few times, you'll hear certain terms and concepts come up again and again in interviews and you'll be able to follow along with progressively less difficulty. By the way, the Mark Solms interview is much more approachable for those without a computer science/machine learning background.
REFS:
[0:02:25] **Academic position and research focus at ETH Zurich** | Jonas Hübotter is a doctoral researcher in the Learning and Adaptive Systems Group at ETH Zurich, working with Professor Andreas Krause on machine learning and local learning. | Jonas Hübotter
jonhue.github.io/
[0:02:50] **The Pile benchmark dataset for language model evaluation** | The Pile is an 825 GiB English text corpus used for training large-scale language models, consisting of 22 diverse high-quality subsets including academic writing, Stack Exchange, and other sources. | Leo Gao et al.
arxiv.org/abs/2101.00027
[0:05:52] **Framework for making machine learning accessible through teaching-focused approach** | Machine Teaching: A New Paradigm for Building Machine Learning Systems - Microsoft Research paper introducing the concept of machine teaching as a discipline focused on teachers rather than learners | Patrice Y. Simard et al.
arxiv.org/abs/1707.06742
[0:07:35] **Foundational paper introducing RAG architecture combining pre-trained models with explicit memory access** | RAG (Retrieval-Augmented Generation) paper by Patrick Lewis et al. introducing the concept of combining parametric and non-parametric memory for language generation | Patrick Lewis et al.
arxiv.org/abs/2005.11401
[0:09:50] **Comprehensive documentation of The Pile dataset including its mathematical components** | The Pile dataset including DeepMind Mathematics component, containing school-level math questions and other diverse text data | Stella Biderman et al.
arxiv.org/pdf/2201.07311
[0:11:25] **Survey paper analyzing knowledge conflicts in LLMs between pre-training and in-context information** | Research on conflicts between in-context learning and pre-training knowledge in large language models | Chen, Zhixiu and Wang, Yuchen and Zhang, Zhihao and Wang, Xu and Li, Zhiwei
arxiv.org/html/2403.08319v2
[0:13:40] **Study of ant foraging rules and pheromone trail network properties** | Research on ant colony foraging behavior and pheromone trail networks | Czaczkes, Tomer J
pmc.ncbi.nlm.nih.gov/articles/PMC3291321/
[0:16:05] **Theory of instrumental convergence in superintelligent AI systems** | Instrumental convergence thesis in AI safety, discussing how superintelligent AI systems might develop predictable sub-goals regardless of their final goals | Nick Bostrom
nickbostrom.com/superintelligentwill.pdf
[0:18:45] **Seminal paper defining universal intelligence and its relationship to compression** | Marcus Hutter's fundamental work on universal intelligence and its relationship to compression, particularly in his collaboration with Shane Legg defining machine intelligence | Shane Legg and Marcus Hutter
arxiv.org/pdf/0712.3329.pdf
[0:20:50] **Paper connecting active inference, free energy principle, and maximum entropy methods in machine learning** | Discussion of active inference as a form of maximum entropy inverse reinforcement learning, which relates to the paper 'The Free Energy Principle for Perception and Action: A Deep Learning Perspective' discussing the relationship between active inference and maximum entropy methods | Pietro Mazzaglia et al.
www.mdpi.com/1099-4300/24/2/301/pdf
[0:23:10] **Paper explaining how active inference leads to autonomous organization in biological systems** | Discussion of emergence of self-sustaining behaviors through active inference relates to 'The Markov blankets of life: autonomy, active inference and the free energy principle', which explores how active inference leads to autonomous behavior | Karl J. Friston
royalsocietypublishing.org/doi/10.1098/rsif.2017.0792
[0:23:30] **Technical framework for implementing intentional behavior in active inference agents** | Active Inference and Intentional Behaviour (2024) discusses how active inference frameworks can be used to create AI systems with constrained agency and specific preferences. | Karl J. Friston
arxiv.org/html/2312.07547v2
[0:24:10] **Research on genetic constraints and behavioral plasticity in intelligence** | The Paradox of Intelligence: Heritability and Malleability Coexist in Hidden Gene-Environment Interplay (2018) explores how genetic constraints interact with environmental plasticity. | Bruno Sauce, Louis D. Matzel
www.ncbi.nlm.nih.gov/pmc/articles/PMC5754247/
[0:26:55] **Foundational work establishing dual-process theory of cognition with System 1/2 framework** | System 1 (fast, intuitive, and emotional) and System 2 (slower, deliberative, and logical) thinking framework from 'Thinking, Fast and Slow'. Context: Discussion of cognitive architectures and their applicability to AI systems. | Daniel Kahneman
www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555
[0:28:55] **Analysis of computational mechanisms behind in-context learning versus fine-tuning** | Computational differences between in-context learning and fine-tuning, with ICL requiring forward computation for each token while fine-tuning uses back-propagation. Context: Discussion of efficiency in different learning approaches. | Wei et al.
arxiv.org/pdf/2212.10559
[0:30:55] **Foundational paper introducing nearest neighbor pattern classification** | Early work on nearest neighbor methods in the 1950s for pattern recognition and classification | Cover, T., Hart, P.
ieeexplore.ieee.org/document/1053964
PART 2:
[0:32:05] **Fundamental work establishing theoretical framework for transductive learning** | Vladimir Vapnik's work on transductive inference and statistical learning theory | Vladimir Vapnik
www.springer.com/gp/book/9780387987804
[0:35:35] **Leading researcher in conformal prediction and machine learning at Royal Holloway** | Reference to Vladimir Vovk at Royal Holloway University, pioneer of conformal prediction | Vladimir Vovk
pure.royalholloway.ac.uk/en/persons/vladimir-vovk
[0:36:30] **Neuroscientist exploring consciousness and its relationship with emotional processing** | Reference to Mark Solms' work on consciousness and its relationship with ambiguity processing | Mark Solms
ruclips.net/video/CmuYrnOVmfk/видео.html
[0:40:00] **Foundational paper establishing active inference as a model of agency and choice behavior** | Karl Friston's active inference model of agency, which describes how biological systems maintain their state through prediction and action | Karl Friston
www.frontiersin.org/journals/human-neuroscience/articles/10.3389/fnhum.2013.00598/full
[0:41:45] **Novel approach for efficient test-time adaptation of language models** | SIFT (Sparse Inference Fine-Tuning) paper discussing local distribution learning in language models | Jonas Hübotter et al.
arxiv.org/pdf/2410.08020
[0:43:35] **Research on improving LLM performance through test-time adaptation using nearest neighbors** | Test-Time Training on Nearest Neighbors for Large Language Models (2024). The paper discusses how updating models at test time with relevant data can improve performance, aligning with the speaker's points about local learning benefits. | Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig
arxiv.org/html/2305.18466v3
[0:48:25] **Survey of active learning techniques addressing domain shift and multi-domain sampling** | Concept of active learning addressing distribution shift in machine learning systems. Active learning systems continuously retrain on shifting data distributions to maintain model performance over time. | Shayne Longpre et al.
arxiv.org/abs/2202.00254
[0:50:55] **Research on combining retrieval and fine-tuning for in-context learning models** | Discussion of nearest neighbor retrieval and fine-tuning approach for local model adaptation. This relates to the naive approach mentioned in the conversation about retrieving nearest neighbors for fine-tuning. | Thomas et al.
arxiv.org/abs/2406.05207
[0:54:05] **Original RoBERTa paper introducing the improved BERT-based model for NLP tasks** | RoBERTa (Robust Optimized BERT Approach) is a robustly optimized BERT pretraining approach that improves on BERT's masking strategy and training methodology. It's commonly used for generating embeddings in information retrieval tasks. | Yinhan Liu et al.
arxiv.org/pdf/1907.11692.pdf
[0:58:45] **Comprehensive guide to deep learning that includes detailed discussion of fine-tuning practices** | Deep Learning with Python by François Chollet discusses the challenges and best practices of fine-tuning neural networks, particularly regarding learning rate selection and gradient steps | François Chollet
www.amazon.com/Learning-Python-Second-Fran%C3%A7ois-Chollet/dp/1617296864
[1:01:55] **Research paper examining the Linear Representation Hypothesis in neural networks** | Linear Representation Hypothesis (LRH) in neural networks, which posits that networks encode concepts as directions in activation space | Róbert Csordás et al.
arxiv.org/abs/2408.10920
[1:03:10] **ML researcher specializing in mechanistic interpretability of neural networks** | Neel Nanda - Machine Learning Researcher at DeepMind, previously at Anthropic, known for work in mechanistic interpretability | Neel Nanda
www.neelnanda.io/about
[1:05:40] **Foundational paper introducing LIME for model interpretability** | LIME (Local Interpretable Model-agnostic Explanations) - A technique for explaining predictions of any classifier using local linear approximations | Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
arxiv.org/abs/1602.04938
[1:09:35] **Seminal paper on using influence functions for understanding black-box model predictions** | Influence Functions in machine learning as described in 'Understanding Black-box Predictions via Influence Functions' by Koh & Liang. The paper demonstrates how influence functions can trace model predictions back to training data. | Pang Wei Koh
arxiv.org/abs/1703.04730
[1:11:45] **Comprehensive overview of dataset security vulnerabilities including data poisoning and backdoor attacks** | Data poisoning attacks in machine learning security, where training data manipulation can create backdoors and vulnerabilities in ML systems | Micah Goldblum et al.
arxiv.org/abs/2012.10544
[1:16:05] **Fundamental textbook covering Bayesian linear regression with closed-form solutions** | Bayesian Linear Regression as described in 'Pattern Recognition and Machine Learning' by Bishop. The text discusses closed-form solutions for posterior computation with Gaussian priors and likelihood. Context: Speaker explains how linear surrogate models with Gaussian priors enable tractable posterior computation. | Christopher M. Bishop
www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738
[1:17:10] **Paper demonstrating Bayesian neural networks for uncertainty quantification** | Discussion of uncertainty quantification in neural networks using Bayesian methods, referencing 'Bayesian Deep Convolutional Encoder-Decoder Networks for Surrogate Modeling and Uncertainty Quantification'. Context: Speaker contrasts traditional neural networks with Bayesian approaches for uncertainty estimation. | Yinhao Zhu
arxiv.org/abs/1801.06879
[1:18:55] **Comprehensive review of variational inference methods including closed-form solutions with conjugate priors** | Closed-form Bayesian inference with Gaussian distributions (conjugate priors), as detailed in 'Variational Inference: A Review for Statisticians'. The paper discusses how conjugate priors lead to analytically tractable posterior distributions, particularly in the case of Gaussian distributions. | David M. Blei, Alp Kucukelbir, Jon D. McAuliffe
arxiv.org/pdf/1601.00670
[1:26:15] **MindsAI's breakthrough in ARC challenge using test-time fine-tuning** | MindsAI team's achievement in the ARC (Abstraction and Reasoning Corpus) Challenge, reaching 54.5% performance using test-time fine-tuning approach in late 2024 | Mohamed Osman & MindsAI Team
www.reddit.com/r/singularity/comments/1gexvmj/new_arcagi_high_score_by_mindsai_545_prize_goal/
[1:29:50] **Research on active inference for collaborative AI systems in unknown environments** | Active Inference in distributed AI systems as described in 'Collaborative AI Teaming in Unknown Environments via Active Goal Inference'. The paper discusses how active inference can be used in distributed AI systems for collaborative tasks. | Jaya Krishna Thota et al.
arxiv.org/pdf/2403.15341
[1:32:55] **Introduction of OpenAI o1 model with novel inference-time scaling properties** | OpenAI o1 model's inference-time scaling capabilities, introduced in September 2024, showing performance improvements with both train-time and test-time compute allocation | OpenAI
openai.com/index/learning-to-reason-with-llms/
[1:33:55] **Framework for active inference and uncertainty minimization in AI systems** | Active inference framework for AI systems, focusing on uncertainty minimization through predictive coding and exploration | Abdelrahman Sharafeldin
www.sciencedirect.com/science/article/pii/S2666389924000977
[1:36:05] **Theoretical analysis of convergence in uncertainty-based active learning** | Research on convergence guarantees in uncertainty-based active learning, discussing how selecting informative data points based on uncertainty reduction can lead to optimal convergence | Yingzhen Yang et al.
arxiv.org/pdf/2312.13927
[1:37:50] **Information-theoretic analysis of transductive learning generalization bounds** | Discussion of transductive learning theory and its relationship to inductive learning, particularly relevant to the proposed hybrid approach | Huayi Tang et al.
arxiv.org/abs/2311.04561
[1:38:50] **Research paper establishing theoretical foundations of transductive learning in machine learning** | Discussion of transductive learning in machine learning context, referencing the theoretical framework of transductive vs inductive learning approaches | Mathieu Chalvidal
arxiv.org/abs/2302.00328
[1:40:35] **Foundational paper establishing scaling laws for neural language models** | Reference to scaling laws in language model training, discussing the relationship between compute budget and model size | Jared Kaplan et al.
arxiv.org/abs/2001.08361
[1:42:20] **Latest Apple Silicon chip optimized for ML workloads** | Apple M4 chip, announced in May 2024, represents significant advancement in local ML processing capabilities for MacBook Pro line | Apple Inc.
www.apple.com/newsroom/2024/05/apple-introduces-m4-chip/
[1:42:40] **Advanced language model with improved performance and speed** | Claude 3.5 Sonnet by Anthropic, released in 2024, providing enhanced capabilities for model verification and complex reasoning tasks | Anthropic
www.anthropic.com/news/claude-3-5-sonnet
[1:45:45] **Information-based transductive active learning research with applications to safe exploration** | Reference to transactive fine-tuning and its future impact in AI systems, connecting to active learning and uncertainty estimation methods | Jonas Hübotter et al.
arxiv.org/pdf/2405.05890
Good references! Much quality! Craftsmanship is appreciated!!
ETH Zurich's work on anymal (4-legged/dog robot platform) always amazes with their improvements
Reason can emerge in a connectionist system, but loops and alfgorithms can only be rolled out to something less deep than the model's depth. So we need system 2 thinking for longer algorithms.
So much groundwork to cover, bit snoozy, then about the hour mark it really starts to get meaty. Good questioning Tim! Jonas is a machine, so eloquent!
Tim looks great without hair! Jonas is a great guest. Thanks for bringing him back❤
how does 1 implement test time adaption when a deployed model is often quantized while training is often only stable at higher precisions?
That was an interesting point about the amortization of SFT vs in-context learning; SFT is able to parallelize the operations across the batch vs in-context learning has to sequentially operate on all of the examples.
We used similar techniques in our math lab and made 7% improvement in MATH level 5, on 7b model
1:00:00 you could use dropout only on biases..and train biases only when you finetune on your test data..disable otherwise
Im quebono100 one of your first subscribers. You guys still killing it. Such good unique work that you are doing. Thank you
How does the current incarnation differ from Continual/Lifelong/Incremental learning?
Sutton, LeCun, M.Mitchell, I.Rish, and others have eluded to this yet the field doesn’t seem to focus on this fundamental gap at all.
Jonas talks about this in his paper "Retrieval and active learning can be seen as two extreme ends of a spectrum: retrieval selects relevant but potentially
redundant data, while active learning selects diverse but potentially irrelevant data." - also Jonas is advocating for transductive inference when we go from "particular to particular" i.e. build a new model for each prediction from the data (and models) we have access to
@ Ty! Looking forward to reading it!
great talk! want to hear your opinion if agentic AI is a form of test time compute. since youre literally answer the request over longer time than a single current non TTC LLM calls.
22:47 through 23:22
23:36 through 24:01
24:53 through 25:52
25:58 through 26:44
As someone forever living the aftermath of numerous severe to moderate TBI and ABI - living with an, "acquired communication disorder" - as one can see, I am taking notes.
I see my error of "severe to moderate." Correct terminology is "moderate to severe." That said, a snapshot of illogical sequencing (symptom manifestation) is more valuable to me than an edited comment.
This reply being a clear exception; totally editing it for the sole purpose of removing a line break.
Second edit to remove four words because... "aesthetic."
*deadpan jazz hands*
36:42 through 37:16
Oh! 38:38 through 39:11 reminds me of my neurofeedback sessions. The image on the screen is fuzzy but if you "do it correctly" the image becomes increasingly clear. Once clear, keep it clear.
40:15 "Situational computation" sends my mind to perceptual adaptation. Not sure if there's value in that, might be a "rhythmic association thing"
42:50 I think, "...really big base model.." is my chosen cue to revisit the remainder of this video at a later point in time.
I don't know if I understand or everyone is saying it wrong... for me, non-linear sequence in 1d is composed of two linear inputs in 2d, and so on in 3d, nd... all are simplified to linear as complex of linear transversals, where diagonals is a general combination... If everyone is in dilemma of approach regarding this, then I patent this as my definition 🤑
Includes abstraction as a state of entropy rooted in this way 🤯🤯
So basically dreambooth?
The million dollar question hehe
Thanks
aitutorialmaker AI fixes this. "Test-Time Adaptation in AI"
I do not see any meaningful impact from this talk or paper .
1. Embedding is not a proved method to measure relevant . Say what if you want to compare similar law cases , which need multiple dimensional similarity , embedding is not working at all .
2. A super highly flexible use cases is only existed in theory . We all have certain fixed used cases in real world and multiple qlora dapaters can be instantly plug in is good enough .
I do not care how fancy your theory are , if you can not guarantee use embedding can lead a successfully retrieval on highly special cases , then your method is useless . for normal use cases , LoRA offers a compelling alternative because it's lightweight, adaptable, and allows for storing multiple adapters for different tasks. This avoids retraining or storing multiple fine-tuned models. Real world application will go for multiple qlora adapters while their used cases care always fixed . A super highly flexible use cases is only existed in theory .
Now , tell me what's the meaningfulness for your paper or method ?
This is the paper we discussed - arxiv.org/pdf/2410.08020
You might want to get Claude to explain it to you to fill in some of your gaps in understanding.
The embeddings are used to retrieve nearest neighbours (to the test instance), then a local model is constructed (which loosely resembles a kernel ridge regression model) to iteratively select data points which maximise the information gain i.e. balancing relevance and diversity. Embedding-only search does indeed suck, that's the whole point of this research. The key insight here is selecting an optimal set of examples using a local surrogate model to fine tune the source model but the fine tuning itself is not a key part of the discussion.
Well said
giga chad with a giga brain. nice combo.
❤
It would be much more accessible for the general audience if you at least explain the “big” words that you’d like to use. For example, it would be good if you explain what transduction means before talking more about other things.
I think we do explain it and show a figure but perhaps a little way into the show. Most ML models are "inductive" where you train on data and build a general-purpose decision function which you re-use in many future situations. Transduction (test time learning is a form of transduction) is when in every prediction situation you use data (usually test data, or retrieved data "related" to test data) to build a new model on the fly for the sole purpose of that prediction. MLST is pitched at a technical audience, but I appreciate we could do better at explaining things.
@@MachineLearningStreetTalk oh with this kind of content, I tend to watch it like a podcast whereby I usually listen rather watch the video. But it would be much better subjectively if you could explain them in words as well.
@@MachineLearningStreetTalk You literally explained what it was after you initially mentioned the word. Ignore that dude
I'm definitely part of the non-technical audience. Listening to these interviews without a technical background is sort of like listening to a foreign language podcast, it's doable, but it takes a bit of effort initially. You can still get a lot out of it, but you may need to pause and look up a term here and there for a while to follow along. Just keep ChatGPT/Claude in a second tab! But after you do that a few times, you'll hear certain terms and concepts come up again and again in interviews and you'll be able to follow along with progressively less difficulty. By the way, the Mark Solms interview is much more approachable for those without a computer science/machine learning background.