LLM2LLM: Synthetic Data for Fine-Tuning (UC Berkeley)

Economist explains why India can never grow like China

OK. Now I'm Scared... AI Better Than Reality!

"The Reality of the Situation" | Inanimate Insanity S2E16

No. 3 Texas Longhorns vs. No. 10 Michigan Wolverines Highlights | FOX College Football

HIGHLIGHTS | South Africa v All Blacks | Cape Town, 2024

Synthetic Data: AI Model Collapse!

Idea Supply Chain

Просмотров 310

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 8 сен 2024
Today lets explore the critical challenges in AI training with our dive into the recent issues shown within the idea of synthetic data, input bias, and the risks they pose to the stability of AI models. This video, inspired by recent research, hopes to unpack the complexity of AI development.
Check them out: myaicofounder.... & submind.leogui...
📈 Subscribe for more analysis on the evolving landscapes and more. Hit the bell for notifications on our latest videos.
💡 Like and share to spark a conversation on the future of the channel, our discussions and where this journey goes.
#data #article #ai

Комментарии • 5

@novantha1 15 дней назад
While it is true that model collapse is an issue when taking a model's outputs and directly piping them back into the model in training, that's not realistically how people are creating or using synthetic data. Ideally, synthetic data isn't "fake" data, it's a stylized output of an LLM based on some form of seed data. As an example, you might ask an LLM to summarize a Wikipedia article from a novel perspective. By virtue of including the seed data, you more or less avert the major model collapse issues seen in that paper.
Adding onto that, there are other strategies, too. Even Meta noticed when training Llama 3 and 3.1, that the model didn't improve when training on its own outputs in a naive context, but when training on its own code outputs with the compiler feedback (feedback from the environment) the model continued to improve. There were also still improves with things like StaR and more advanced agentic workflows (Tree of Thought, MCTS, etc) which allow a model to analyze an output before it's placed back into a training pool which allow a model to generate its own data in a self-improving loop.
And even if the above paragraph were untrue, that doesn't mean there aren't other sources of scalable synthetic data. There are things like physics simulations, agent simulations (Chess, Go, Settlers of Catan, Minecraft, etc), all of which could probably be leveraged in novel ways to generate useful neural patterns and priors for use in other tasks, limiting the burden that natural data must carry in future models.
I'm not really sure why so many people have been so taken in by this paper; we have, depending on your opinion of exactly which body of research is applicable, anywhere between two years of hyperbolically accelerating research on this topic, or a decade of RL research, which shows that to scale beyond human capabilities a model can't just be trained to imitate humans; it has to play against itself. We know more or less which direction to take research to scale models with synthetic data, and we have an industry full of brilliant people establishing best practices on the matter as we speak.
This paper is a nothing burger, and while it is a good cautionary tale that naively training a model on its own outputs isn't useful, I thought just about everyone knew that for the last two years, and the industry has moved on from this approach.
@davidlloyd1526 Месяц назад ⁺²
Using an AI's output to train itself is a little like putting the microphone next to the loudspeaker.
@ideasupplychain Месяц назад
Yeah, that's definitely one way to look at it haha.
@Nope-qt1wj Месяц назад
How could they possibly think synthesising data could end well. It seems so short-sighted.
@ideasupplychain Месяц назад
In some ways, yes. But there are ways that synthetic data can be used effectively. I like to use a version of synthetic data by expanding small comments into bigger profiles. I've found that it doesn't need to be fully accurate when doing this across a dataset. Basically, the errors it might make are aggregated and directionally correct.
And when fine-tuning, I can see how it would be enticing to create a larger dataset using a small number of inputs to generate more. But you can't fully train models on fully synthetic data.

Следующие

Автовоспроизведение

LLM2LLM: Synthetic Data for Fine-Tuning (UC Berkeley)

LLM2LLM: Synthetic Data for Fine-Tuning (UC Berkeley)

Economist explains why India can never grow like China

Economist explains why India can never grow like China

OK. Now I'm Scared... AI Better Than Reality!

OK. Now I'm Scared... AI Better Than Reality!

"The Reality of the Situation" | Inanimate Insanity S2E16

"The Reality of the Situation" | Inanimate Insanity S2E16

No. 3 Texas Longhorns vs. No. 10 Michigan Wolverines Highlights | FOX College Football

No. 3 Texas Longhorns vs. No. 10 Michigan Wolverines Highlights | FOX College Football

HIGHLIGHTS | South Africa v All Blacks | Cape Town, 2024

HIGHLIGHTS | South Africa v All Blacks | Cape Town, 2024

I Bought Shark Tank Tools

I Bought Shark Tank Tools

AI is NOT Artificial Intelligence, the real threat of AI is "Automated Stupidity." | Words MADDER

AI is NOT Artificial Intelligence, the real threat of AI is "Automated Stupidity." | Words MADDER

This New AI Generates Videos Better Than Reality - OpenAI is Panicking Right Now!

This New AI Generates Videos Better Than Reality - OpenAI is Panicking Right Now!

Can synthetic data unlock AI recursive self-improvement? - Mark Zuckerberg

Can synthetic data unlock AI recursive self-improvement? — Mark Zuckerberg

Elon's Big Bet On Tesla's Future | Optimus 2

Elon's Big Bet On Tesla's Future | Optimus 2

It’s Here: The 2024 Gartner AI Hype Cycle™

It’s Here: The 2024 Gartner AI Hype Cycle™

Synthetic DATA Generation using LANGCHAIN 🦜️🔗

Synthetic DATA Generation using LANGCHAIN 🦜️🔗

Synthetic Data Generation using LLM: Crash Course for Beginners

Synthetic Data Generation using LLM: Crash Course for Beginners

[1hr Talk] Intro to Large Language Models

[1hr Talk] Intro to Large Language Models

Here are the Top AI Tools for Research Data Analysis

Here are the Top AI Tools for Research Data Analysis

😳 Заблокировали на парковке Порше Панамера и вот, что придумал владелец! | Новостничок

😳 Заблокировали на парковке Порше Панамера и вот, что придумал владелец! | Новостничок

Новый уровень твоей сосиски

Новый уровень твоей сосиски

Rossiyadagi o‘rmondan o‘zbekistonlik erkakning tana qoldiqlari topildi

Rossiyadagi o‘rmondan o‘zbekistonlik erkakning tana qoldiqlari topildi

НИВА - ЛЕГЕНДА РУССКОГО АВТОПРОМА / РАЗГОН

НИВА - ЛЕГЕНДА РУССКОГО АВТОПРОМА / РАЗГОН

чем закончился прикол, смотри в тг «хей! это марьяна!» @Dasha_Da_

чем закончился прикол, смотри в тг «хей! это марьяна!» @Dasha_Da_

Russian soldier catches Ukraine FPV drone with his bare hands and runs with it

Russian soldier catches Ukraine FPV drone with his bare hands and runs with it

Москва или Питер? #амирансардаров #эльдарджарахов

Москва или Питер? #амирансардаров #эльдарджарахов

Мама знает где все документы

Мама знает где все документы