Are Aligned Language Models “Adversarially Aligned”?
HTML-код
- Опубликовано: 1 авг 2024
- Nicholas Carlini (Google DeepMind)
simons.berkeley.edu/talks/nic...
Large Language Models and Transformers
An "aligned" model is "helpful and harmless". In this talk I will show that while language models may be aligned under typical situations, they are not "adversarially aligned". Using standard techniques from adversarial examples, we can construct inputs to otherwise-aligned language models to coerce them into emitting harmful text and performing harmful behavior. Creating aligned models robust to adversaries will require significant advances in both alignment and adversarial machine learning.
Informative talk. thank you.
Talk starts at 1:30
great channel!
text embeddings/representations aren't actually continuous either, in fact at 512x int4 they're "more" discrete than some writing systems, you're welcome (lovely & grounded talk, thanks!)
Scott Aaronson placing his jokes in the audience.
Now imagine writers being reduced to writing villain speeches, and nsfw scenes, and explicit descriptions of controversial themes and counterfactuals.
The thumbnail looked like Andrew Tate
comment of the year
There's this very bizarre gatekeeping of what is science but otherwise a great talk
Not bizarre at all. Of course there is a difference between science and industrial application. Your doctor diagnosing you isn't science. Your doctor discovering a new diagnosis is science. Not everything can or should be considered science.
Agreed very weird. It would suffice to draw the distinction by saying “searching for generally applicable explanations”.
Collecting data through manual effort (one off attacks) is entirely science. Informing the research community of one off data is entirely science.
it makes sense to me, he's not saying in general that a particular example can't be science, he's just saying that's not what they were doing hacking up DAN 6.0, they weren't trying to scientifically figure out about LLMs generally they were specifically hacking on what they wanted to achieve, that's purely engineering
@@mungojelly generally speaking counter examples can be real science, but in some of the LLM cases yes it's more engineering