Are Aligned Language Models “Adversarially Aligned”?

Поделиться
HTML-код
  • Опубликовано: 1 авг 2024
  • Nicholas Carlini (Google DeepMind)
    simons.berkeley.edu/talks/nic...
    Large Language Models and Transformers
    An "aligned" model is "helpful and harmless". In this talk I will show that while language models may be aligned under typical situations, they are not "adversarially aligned". Using standard techniques from adversarial examples, we can construct inputs to otherwise-aligned language models to coerce them into emitting harmful text and performing harmful behavior. Creating aligned models robust to adversaries will require significant advances in both alignment and adversarial machine learning.

Комментарии • 13

  • @ryan77anderson
    @ryan77anderson 11 месяцев назад +1

    Informative talk. thank you.

  • @MrNoipe
    @MrNoipe 11 месяцев назад +5

    Talk starts at 1:30

  • @zenbauhaus1345
    @zenbauhaus1345 11 месяцев назад

    great channel!

  • @LeonDerczynski
    @LeonDerczynski 11 месяцев назад

    text embeddings/representations aren't actually continuous either, in fact at 512x int4 they're "more" discrete than some writing systems, you're welcome (lovely & grounded talk, thanks!)

  • @cube2fox
    @cube2fox 9 месяцев назад

    Scott Aaronson placing his jokes in the audience.

  • @jaredgreen2363
    @jaredgreen2363 10 месяцев назад

    Now imagine writers being reduced to writing villain speeches, and nsfw scenes, and explicit descriptions of controversial themes and counterfactuals.

  • @tonny.c
    @tonny.c 11 месяцев назад +2

    The thumbnail looked like Andrew Tate

  • @jonathanz9889
    @jonathanz9889 11 месяцев назад +6

    There's this very bizarre gatekeeping of what is science but otherwise a great talk

    • @prescod
      @prescod 11 месяцев назад +2

      Not bizarre at all. Of course there is a difference between science and industrial application. Your doctor diagnosing you isn't science. Your doctor discovering a new diagnosis is science. Not everything can or should be considered science.

    • @oncedidactic
      @oncedidactic 11 месяцев назад +3

      Agreed very weird. It would suffice to draw the distinction by saying “searching for generally applicable explanations”.
      Collecting data through manual effort (one off attacks) is entirely science. Informing the research community of one off data is entirely science.

    • @mungojelly
      @mungojelly 10 месяцев назад

      it makes sense to me, he's not saying in general that a particular example can't be science, he's just saying that's not what they were doing hacking up DAN 6.0, they weren't trying to scientifically figure out about LLMs generally they were specifically hacking on what they wanted to achieve, that's purely engineering

    • @jonathanz9889
      @jonathanz9889 10 месяцев назад

      @@mungojelly generally speaking counter examples can be real science, but in some of the LLM cases yes it's more engineering