AI Papers Academy
AI Papers Academy
  • Видео 59
  • Просмотров 226 611
Writing in the Margins: Better LLM Inference Pattern for Long Context Retrieval
In this video, we explain the Writing in the Margins (WiM) method, introduced in a recent research paper, titled: "Writing in the Margins: Better Inference Pattern for Long Context Retrieval".
With WiM, the researchers were able to achieve significant performance improvement on long input sequences, for off-the-shelf large language models (LLMs) such as Phi-3, Qwen2 and Llama-3.1.
How it works?
As part of the LLM inference process, the WiM method feed the input context to the LLMs by chunks, rather than all at once. And when each chunk is being processed, the LLM is also instructed to generate a note about the information in the current chunk. Finally, both the context, and the notes (whic...
Просмотров: 559

Видео

Sapiens by Meta AI: Foundation for Human Vision Models
Просмотров 2,2 тыс.Месяц назад
In this video we dive into Sapiens, a new family of models for four fundamental human-centric tasks, presented by Meta AI in a recent research paper titled "Sapiens: Foundation for Human Vision Models". The model's architecture is based on Vision Transformer (ViT), which for the first time pushed to train on 1K resolution images, x5 in size than DINOv2's input images size! We cover the model's ...
Mixture of Nested Experts: Adaptive Processing of Visual Tokens | AI Paper Explained
Просмотров 517Месяц назад
In this video we dive into a recent research paper by Google, titled: "Mixture of Nested Experts: Adaptive Processing of Visual Tokens". While standard Mixture of Experts (MoE) is successfully applied in LLMs, and also in computer vision, to increase computational cost without a proportional increase to model size, it comes with a large memory footprint. The Mixture of Nested Experts (MoNE) whi...
Introduction to Mixture-of-Experts (MoE)
Просмотров 2,2 тыс.2 месяца назад
In this video we go back to the extremely important Google paper which introduced the Mixture-of-Experts (MoE) layer with authors including Geoffrey Hinton. The paper is titled Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. MoE today is widely used in various top Large Language Models and interestingly, it was published at the beginning of 2017, while the Atten...
Mixture-of-Agents (MoA) Enhances Large Language Model Capabilities
Просмотров 2,3 тыс.3 месяца назад
A new paper titled "Mixture-of-Agents Enhances Large Language Model Capabilities" shows a method to win GPT-4o on AlpacaEval 2.0 using open-source large language models (LLMs). In this video we explain what is the Mixture-of-Agents (MoA) method by diving into that research paper. Mixture-of-Agents (MoA) is inspired by the well-known Mixture-of-Experts (MoE) method, but unlike MoE, which embeds ...
Arithmetic Transformers with Abacus Positional Embeddings | AI Paper Explained
Просмотров 5814 месяца назад
In this video we dive into a recent research paper, titled: Transformers Can Do Arithmetic with the Right Embeddings. The paper introduces Abacus Embeddings, a new type of positional embeddings. Using Abacus Embeddings, the researchers were able to train state-of-the-art transformers for numbers addition, with impressive logical extrapolation capabilities - a model that was trained on 20-digit ...
CLLMs: Consistency Large Language Models | AI Paper Explained
Просмотров 9274 месяца назад
In this video we dive into Consistency Large Language Models (CLLMs), a new method which was introduced in a recent research paper, to significantly improve the inference latency of Large Language Models (LLMs). CLLMs efficiently decode multiple tokens in one forward pass, which makes the response generation faster since there is no need to do a forward pass for each generated token. CLLMs rely...
ReFT: Representation Finetuning for Language Models | AI Paper Explained
Просмотров 3 тыс.5 месяцев назад
Can LoReFT be a rival for LoRA? According to ReFT paper, it has the potential to replace LoRA in various cases. In this video we dive into the research paper that presents ReFT and LoReFT. We'll explain what is representation fine-tuning (ReFT), and how it is different than previous parameter-efficient fine-tuning (PEFT) methods, such as LoRA. ReFT is a family of methods that can be used to ada...
Stealing Part of a Production Language Model | AI Paper Explained
Просмотров 1,7 тыс.6 месяцев назад
Many of the top LLMs today are closed source. What if we could discover their internal weights? In this video we dive into a recent research paper from Google DeepMind which presents an attack on large language models. The attack targets transformer-based LLMs, that expose log probabilities as part of their API, which includes GPT-4 and PaLM-2. The researchers successfully used the attack to di...
The Era of 1-bit LLMs by Microsoft | AI Paper Explained
Просмотров 90 тыс.7 месяцев назад
In this video we dive into a recent research paper by Microsoft: "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits". This paper introduce an interesting and exciting architecture for large language models, called BitNet b1.58, which significantly reduces LLMs memory consumption, and speeds-up LLMs inference latency. All of that, while showing promising results, that do not fall...
V-JEPA by Meta AI - A Human-Like Computer Vision Video-based Model
Просмотров 5 тыс.7 месяцев назад
In this video we dive into V-JEPA, a new vision models collection, created by Meta AI. V-JEPA stands for Video Joint-Embedding Predictive Architecture, and is part of the Meta AI's implementation of Yann LeCun's vision for a more human-like AI. In this video we dive deep into the researcher paper which presented V-JEPA, titled: "Revisiting Feature Prediction for Learning Visual Representations ...
Self-Rewarding Language Models by Meta AI - Path to Open-Source AGI?
Просмотров 3,7 тыс.8 месяцев назад
In this video we review a new paper titled "Self-Rewarding Language Models" by Meta AI. This paper is published on the same day that Mark Zuckerberg announced that Meta AI is working towards building an open-source AGI, and this paper may be a step in that direction. The paper introduces a method to self-align a pre-trained large language model (LLM) that can replace standard RLHF and RLAIF. Th...
Fast Inference of Mixture-of-Experts Language Models with Offloading
Просмотров 1,3 тыс.8 месяцев назад
In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading". Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays. However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to ru...
TinyGPT-V: Small but Mighty Multimodal Large Language Model
Просмотров 1,5 тыс.9 месяцев назад
In this video we explain how TinyGPT-V model was built, by reviewing its presenting research paper: "TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones". TinyGPT-V is a small multimodal large language model (MLLM) that is based on Phi-2 as its backbone LLM. By being based on Phi-2, TinyGPT-V has only 2.8B params which makes it smaller comparing to other MLLMs that are base...
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Просмотров 4,1 тыс.9 месяцев назад
In this video we review a recent important paper from Apple, titled: "LLM in a flash: Efficient Large Language Model Inference with Limited Memory". This paper presents a method to run large language models (LLMs) on devices that does not have enough memory to store the entire model's weights. This is an exciting progress in LLMs democratization as it brings closer to using top large language m...
Introduction to Vision Transformers (ViT) | An Image is Worth 16x16 Words
Просмотров 2,4 тыс.9 месяцев назад
Introduction to Vision Transformers (ViT) | An Image is Worth 16x16 Words
Orca 2 by Microsoft: Teaching Small Language Models How to Reason
Просмотров 2,1 тыс.10 месяцев назад
Orca 2 by Microsoft: Teaching Small Language Models How to Reason
LCM-LoRA: From Diffusion Models to Fast SDXL with Latent Consistency Models
Просмотров 2,9 тыс.10 месяцев назад
LCM-LoRA: From Diffusion Models to Fast SDXL with Latent Consistency Models
CODEFUSION by Microsoft: A Pre-trained Diffusion Model for Code Generation
Просмотров 1,1 тыс.11 месяцев назад
CODEFUSION by Microsoft: A Pre-trained Diffusion Model for Code Generation
Table-GPT by Microsoft: Empower LLMs To Understand Tables
Просмотров 7 тыс.11 месяцев назад
Table-GPT by Microsoft: Empower LLMs To Understand Tables
Vision Transformers Need Registers - Fixing a Bug in DINOv2?
Просмотров 2,5 тыс.11 месяцев назад
Vision Transformers Need Registers - Fixing a Bug in DINOv2?
Emu by Meta AI: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Просмотров 793Год назад
Emu by Meta AI: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
NExT-GPT: Any-to-Any Multimodal LLM
Просмотров 7 тыс.Год назад
NExT-GPT: Any-to-Any Multimodal LLM
Large Language Models As Optimizers - OPRO by Google DeepMind
Просмотров 3,1 тыс.Год назад
Large Language Models As Optimizers - OPRO by Google DeepMind
FACET by Meta AI - Fairness in Computer Vision Evaluation Benchmark
Просмотров 431Год назад
FACET by Meta AI - Fairness in Computer Vision Evaluation Benchmark
Code Llama Paper Explained
Просмотров 2,1 тыс.Год назад
Code Llama Paper Explained
WizardMath from Microsoft - Best Open Source Math LLM with Reinforced Evol-Instruct
Просмотров 3,5 тыс.Год назад
WizardMath from Microsoft - Best Open Source Math LLM with Reinforced Evol-Instruct
Shepherd by Meta AI - A Critic for Large Language Models
Просмотров 696Год назад
Shepherd by Meta AI - A Critic for Large Language Models
Soft Mixture of Experts - An Efficient Sparse Transformer
Просмотров 4,8 тыс.Год назад
Soft Mixture of Experts - An Efficient Sparse Transformer
Universal and Transferable LLM Attacks - A New Threat to AI Safety
Просмотров 2,6 тыс.Год назад
Universal and Transferable LLM Attacks - A New Threat to AI Safety

Комментарии

  • @thienthuoan1081
    @thienthuoan1081 9 дней назад

    nice vedioeasy to understand the principle

  • @aamir122a
    @aamir122a 22 дня назад

    Is there an impementation somewhere

  • @armaneshaghi6732
    @armaneshaghi6732 28 дней назад

    Are these models also good for segmentation ?

  • @liangzijian4452
    @liangzijian4452 29 дней назад

    nice video!

  • @wainrebGilad
    @wainrebGilad Месяц назад

    thank you for the clear explanation

  • @ariamehrmaleki8964
    @ariamehrmaleki8964 Месяц назад

    Thanks for the video Can you also do a video on how to use these models from github in google colab please?

  • @xxlvulkann6743
    @xxlvulkann6743 Месяц назад

    Great explanation! It is interesting to see how attention matrices aid in interpretability research and in getting better representations! I wonder how this could be applied to other modalities (such as audio).

  • @TurboKoder
    @TurboKoder 2 месяца назад

    Sorry but this paper is quite a briefing to 1 bit LLMs and the video itself is not explaining anything more than reading it out loud. There are multiple questions like what is the viable option to train such models, how it influences activation functions and what's the real benefit here as it suggests without multiplications today's GPUs would not be required which is not really true there. And requirement for new optimized hardware is not really a cool path to go forward.

  • @marzi869
    @marzi869 2 месяца назад

    Thanks, but remove the music in background.

  • @OpenAITutor
    @OpenAITutor 2 месяца назад

    I love this approach. I created a version using groq and open-webui ! It rocks !!

  • @geraldkenneth119
    @geraldkenneth119 2 месяца назад

    It reminds me of BYOL, but with an enhanced training scheme

  • @stevenkies802
    @stevenkies802 2 месяца назад

    Another excellent episode. Your channel is underappreciated.

  • @karthickdurai2157
    @karthickdurai2157 2 месяца назад

    I think the math wizard is removed from huggingface

  • @menkiguo7805
    @menkiguo7805 3 месяца назад

    what is the background music btw

  • @fallinginside3001
    @fallinginside3001 3 месяца назад

    Thank you

  • @gabrielsandstedt
    @gabrielsandstedt 3 месяца назад

    How feasible is it to adapt BitNet b1.58's ternary quantization (-1, 0, 1) for quantum computing using qutrits, given the current state of qutrit-based hardware, error correction, and the development of specialized quantum algorithms?

  • @SparshGarg-n8e
    @SparshGarg-n8e 3 месяца назад

    Thanks a lot!

  • @RobBrogan
    @RobBrogan 3 месяца назад

    A little bit like how I’m using Perplexity that lets me refresh a response with a different model. Except I’m using my human brain to choose an ideal model or draw info from different ones. Or how back in the day, the best search engine was this one that used multiple services (dogpile? Can’t remember). Maybe my comparison is bad, but definitely look forward to a tool that combines multiple LLMs.

  • @vladyslavkorenyak872
    @vladyslavkorenyak872 4 месяца назад

    This channel is amazing! I feel inspired by the simplicity of the ideas and their results. So many low-hanging fruits!

  • @TheSparkoi
    @TheSparkoi 4 месяца назад

    thank you so much for all your complex explanation :)

  • @TheSparkoi
    @TheSparkoi 4 месяца назад

    hey do you think we can have more than 0.7 frame par second if you render only 500X500 with a 4090 as hardware

  • @eladwarshawsky7587
    @eladwarshawsky7587 4 месяца назад

    Great job on the video. I read this paper a while ago, and this is a great explanation. I hear the accent, so if you’re ever in tel aviv I’d be happy to meet up.

  • @jameswhitaker4357
    @jameswhitaker4357 5 месяцев назад

    So interesting! 👀

  • @StrugglingIdiot
    @StrugglingIdiot 5 месяцев назад

    Is it over already? I was sleeping. 😴

  • @aryamanarora4967
    @aryamanarora4967 5 месяцев назад

    Thank you for making this excellent video about our work! Minor note: at the end you mention 18-minute training time for our instruction-following ReFT, but that number is only for the small 1K subset of Ultrafeedback (last row in table). It takes a couple hours to train on the whole dataset, but we wanted to show that ReFT is also data-efficient through that number.

    • @aipapersacademy
      @aipapersacademy 5 месяцев назад

      Thank you Aryaman for the kind feedback and for the correction 🙏

  • @xuantungnguyen9719
    @xuantungnguyen9719 5 месяцев назад

    Thanks

  • @SuperCombatarms
    @SuperCombatarms 6 месяцев назад

    Is there any code associated with this study?

  • @ameynaik2743
    @ameynaik2743 6 месяцев назад

    I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.

  • @TommyJefferson1801
    @TommyJefferson1801 6 месяцев назад

    Or else let's do Distillation + less bit transformer 😅

  • @caseyalanjones
    @caseyalanjones 6 месяцев назад

    Interesting, but what does "JOINT" mean in this context?

    • @兴宣大院君-h4s
      @兴宣大院君-h4s 6 месяцев назад

      Maybe means the embeddings of target and context. They are joint?

    • @caseyalanjones
      @caseyalanjones 6 месяцев назад

      @@兴宣大院君-h4syes, that could be - thanks!

  • @caseyalanjones
    @caseyalanjones 6 месяцев назад

    Thanks!

  • @imaginebaggins2691
    @imaginebaggins2691 6 месяцев назад

    Good content, but i think it would be better if you went a little more in depth into all the smaller details, for me it felt like it was going a bit too fast and didnt understand a bit of what was happening, so i would prefer if there were more in depth explanation of the technical details.

  • @sanesanyo
    @sanesanyo 6 месяцев назад

    Is there benchmarking data available for larger LLMs like GPT4-Turbo or Claude-3-Opus?

  • @oryxchannel
    @oryxchannel 6 месяцев назад

    The thinking around BitNet b1.58 is intimately tied to the .gif in the paper “Stanford engineers propose a simpler design for quantum computers.” See the short .gif in action. Funding for that research began prior to 2021. Funding was provided largely by the US Department of Defense. Guess who virtually IS the US military by virtue of having a $ 3 T market cap to keep secret projects, secret? Thats right. Microsoft.

  • @arjavgarg5801
    @arjavgarg5801 6 месяцев назад

    Model weights will make a lot more sense

  • @burthacklin
    @burthacklin 6 месяцев назад

    This is something I predicted would happen in AI. It's cool to see a concrete usage of it. Ternary computers are the most efficienty computers and base 3 is the most efficient base. So this isn't surprising. Read up on Radix Economy to learn more.

    • @antonf.9278
      @antonf.9278 6 месяцев назад

      How would you represent ternaries in hardware? Would you leave pins floating, force them to the middle with a voltage divider or add a second pin? * Also, in general computing multiplication by unknowns and division by non powers of 2 are rare operations. All of that ignores the added complexity that would nullify the advantages of radix economy because it would increase the complexity of division by abandoning the check in binary long division for the guess and check needed in bases larger than 2. * In the first case you could not run at high clock speeds because strai capacitance and inductance would cause errors. Second case: Transistors become inefficient at the midpoint between high and low, causing massive energy consumption and heating. Third case: A second line allows you to use nibbles, meaning you just ignore certain states out of principle and wasting computational power.

    • @burthacklin
      @burthacklin 6 месяцев назад

      @@antonf.9278 Just use negative voltages. Also division by non powers of 2 are VERY common in computing. As in most division for applications will not be a power of 2, like in machine learning.

  • @anilaxsus6376
    @anilaxsus6376 6 месяцев назад

    but how is the accuracy ?

  • @giacintoboccia9386
    @giacintoboccia9386 6 месяцев назад

    We had a lecture about single bit neural networks at one of my uni courses, some 5 years ago. It was interesting.

  • @maxvell77
    @maxvell77 6 месяцев назад

    Thanks!

  • @maxvell77
    @maxvell77 6 месяцев назад

    Well explained! Thanks for the well-written script, it helped me so much. Keep going!

  • @xianghaisheng7800
    @xianghaisheng7800 6 месяцев назад

    It's a bit difficult to understand your accent, probably because I'm not a native speaker. Do you consider using an AI synthesized voice?

    • @rkvkydqf
      @rkvkydqf 6 месяцев назад

      Please don't. Most TTS engines have became my personal heuristic for low effort spam (sometimes including automated content farms). Voice acting is a skill and will improve over time if you let it. Individuality, the subtle inflection and candidness of a person's interior thoughts matching the waveforms you hear, that neither a hired voice actor nor a TTS model could replicate.

  • @hypervanse
    @hypervanse 6 месяцев назад

    wonder why people don’t use this approach from the beginning. It’s like LLMs in assembly language. And as far as I know, every linear operator has a kernel. The kernel means that a linear operator H always maps the zero vector to itself. When we use a computer, we represent the zero vector as a column matrix of n zeros. Since the layers of LLMs are in the same vector space, we have H\vec{0} = \vec{0} for any H. I apologize for my bad LaTeX, but \vec{0} is supposed to be a vector. It’s important to remember that 0 is the trivial element in the kernel. For example, let Z be the set of all integers, and let H be the multiplication operator. Then, in ordinary algebra, we have positive, zero, and negative integers. The operator is \cdot, not x. The multiplication operator is often used in quantum mechanics of many particles, where the vector space grows exponentially, just like the number of bits for multiple objects.

  • @chodnejabko3553
    @chodnejabko3553 6 месяцев назад

    This might have advantage even more when we get dedicated hardware, since tri-state logic is already a thing in CMOS. A dedicated tri-state matrix multiplication architecture for this type of networks should be easy to engineer with modern processes. NVIDIA should be all over that.

  • @Tohidkhan-lt4pd
    @Tohidkhan-lt4pd 6 месяцев назад

    🎉😊❤

  • @adamhafchadi4924
    @adamhafchadi4924 7 месяцев назад

    what is that accent?

    • @ilianos
      @ilianos 6 месяцев назад

      was looking for the same

    • @Jonas-gm4my
      @Jonas-gm4my 6 месяцев назад

      I would guess french

  • @forheuristiclifeksh7836
    @forheuristiclifeksh7836 7 месяцев назад

    2:13

  • @fernandos-bs6544
    @fernandos-bs6544 7 месяцев назад

    I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.

  • @ntal5859
    @ntal5859 7 месяцев назад

    So in summary everything is either Yes=1, Never mind =0, No = - 1 If only women were so simple to work out.

  • @yash1152
    @yash1152 7 месяцев назад

    1:56 is the "same peformance" with "perito improvement" just an illustration of theoretical prediction or actual model weights data from real trit-model?

  • @pmarreck
    @pmarreck 7 месяцев назад

    This is great! FYI, you can create a model of your voice in ElevenLabs, do a voice-to-voice transformation, and out would come perfectly pronounced English. I found this out by accident because I created a model of Arnold Schwarzenegger's voice, but everything I made it say LOST the accent but kept his tone of voice, LOL

    • @hypervanse
      @hypervanse 6 месяцев назад

      That's maybe be fun, but I clearly can be potentially much more dangerous than a password leak. You trained with your voice right? would you want someone to make some calls with hate speech using your voice for example?