Synthetic Data: AI Model Collapse!

Поделиться
HTML-код
  • Опубликовано: 8 сен 2024
  • Today lets explore the critical challenges in AI training with our dive into the recent issues shown within the idea of synthetic data, input bias, and the risks they pose to the stability of AI models. This video, inspired by recent research, hopes to unpack the complexity of AI development.
    Check them out: myaicofounder.... & submind.leogui...
    📈 Subscribe for more analysis on the evolving landscapes and more. Hit the bell for notifications on our latest videos.
    💡 Like and share to spark a conversation on the future of the channel, our discussions and where this journey goes.
    #data #article #ai

Комментарии • 5

  • @novantha1
    @novantha1 15 дней назад

    While it is true that model collapse is an issue when taking a model's outputs and directly piping them back into the model in training, that's not realistically how people are creating or using synthetic data. Ideally, synthetic data isn't "fake" data, it's a stylized output of an LLM based on some form of seed data. As an example, you might ask an LLM to summarize a Wikipedia article from a novel perspective. By virtue of including the seed data, you more or less avert the major model collapse issues seen in that paper.
    Adding onto that, there are other strategies, too. Even Meta noticed when training Llama 3 and 3.1, that the model didn't improve when training on its own outputs in a naive context, but when training on its own code outputs with the compiler feedback (feedback from the environment) the model continued to improve. There were also still improves with things like StaR and more advanced agentic workflows (Tree of Thought, MCTS, etc) which allow a model to analyze an output before it's placed back into a training pool which allow a model to generate its own data in a self-improving loop.
    And even if the above paragraph were untrue, that doesn't mean there aren't other sources of scalable synthetic data. There are things like physics simulations, agent simulations (Chess, Go, Settlers of Catan, Minecraft, etc), all of which could probably be leveraged in novel ways to generate useful neural patterns and priors for use in other tasks, limiting the burden that natural data must carry in future models.
    I'm not really sure why so many people have been so taken in by this paper; we have, depending on your opinion of exactly which body of research is applicable, anywhere between two years of hyperbolically accelerating research on this topic, or a decade of RL research, which shows that to scale beyond human capabilities a model can't just be trained to imitate humans; it has to play against itself. We know more or less which direction to take research to scale models with synthetic data, and we have an industry full of brilliant people establishing best practices on the matter as we speak.
    This paper is a nothing burger, and while it is a good cautionary tale that naively training a model on its own outputs isn't useful, I thought just about everyone knew that for the last two years, and the industry has moved on from this approach.

  • @davidlloyd1526
    @davidlloyd1526 Месяц назад +2

    Using an AI's output to train itself is a little like putting the microphone next to the loudspeaker.

    • @ideasupplychain
      @ideasupplychain  Месяц назад

      Yeah, that's definitely one way to look at it haha.

  • @Nope-qt1wj
    @Nope-qt1wj Месяц назад

    How could they possibly think synthesising data could end well. It seems so short-sighted.

    • @ideasupplychain
      @ideasupplychain  Месяц назад

      In some ways, yes. But there are ways that synthetic data can be used effectively. I like to use a version of synthetic data by expanding small comments into bigger profiles. I've found that it doesn't need to be fully accurate when doing this across a dataset. Basically, the errors it might make are aggregated and directionally correct.
      And when fine-tuning, I can see how it would be enticing to create a larger dataset using a small number of inputs to generate more. But you can't fully train models on fully synthetic data.