Formalizing Explanations of Neural Network Behaviors

On Evaluating Adversarial Robustness

Panel Discussion

Deadpool and Wolverine - Cosmonaut Quickie

Something Is Wrong With The 2024 MLB Season

Introducing The Wizarding World of Harry Potter™ - Ministry of Magic™

Are Aligned Language Models “Adversarially Aligned”?

Simons Institute

Просмотров 6 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 авг 2024
Nicholas Carlini (Google DeepMind)
simons.berkeley.edu/talks/nic...
Large Language Models and Transformers
An "aligned" model is "helpful and harmless". In this talk I will show that while language models may be aligned under typical situations, they are not "adversarially aligned". Using standard techniques from adversarial examples, we can construct inputs to otherwise-aligned language models to coerce them into emitting harmful text and performing harmful behavior. Creating aligned models robust to adversaries will require significant advances in both alignment and adversarial machine learning.

Комментарии • 13

@ryan77anderson 11 месяцев назад ⁺¹
Informative talk. thank you.
@MrNoipe 11 месяцев назад ⁺⁵
Talk starts at 1:30
@zenbauhaus1345 11 месяцев назад
great channel!
@LeonDerczynski 11 месяцев назад
text embeddings/representations aren't actually continuous either, in fact at 512x int4 they're "more" discrete than some writing systems, you're welcome (lovely & grounded talk, thanks!)
@cube2fox 9 месяцев назад
Scott Aaronson placing his jokes in the audience.
@jaredgreen2363 10 месяцев назад
Now imagine writers being reduced to writing villain speeches, and nsfw scenes, and explicit descriptions of controversial themes and counterfactuals.
@tonny.c 11 месяцев назад ⁺²
The thumbnail looked like Andrew Tate
@zenbauhaus1345 11 месяцев назад
comment of the year
@jonathanz9889 11 месяцев назад ⁺⁶
There's this very bizarre gatekeeping of what is science but otherwise a great talk
@prescod 11 месяцев назад ⁺²
Not bizarre at all. Of course there is a difference between science and industrial application. Your doctor diagnosing you isn't science. Your doctor discovering a new diagnosis is science. Not everything can or should be considered science.
@oncedidactic 11 месяцев назад ⁺³
Agreed very weird. It would suffice to draw the distinction by saying “searching for generally applicable explanations”.
Collecting data through manual effort (one off attacks) is entirely science. Informing the research community of one off data is entirely science.
@mungojelly 10 месяцев назад
it makes sense to me, he's not saying in general that a particular example can't be science, he's just saying that's not what they were doing hacking up DAN 6.0, they weren't trying to scientifically figure out about LLMs generally they were specifically hacking on what they wanted to achieve, that's purely engineering
@jonathanz9889 10 месяцев назад
@@mungojelly generally speaking counter examples can be real science, but in some of the LLM cases yes it's more engineering

Следующие

Автовоспроизведение

Formalizing Explanations of Neural Network Behaviors

Formalizing Explanations of Neural Network Behaviors

On Evaluating Adversarial Robustness

On Evaluating Adversarial Robustness

Panel Discussion

Panel Discussion

Deadpool and Wolverine - Cosmonaut Quickie

Deadpool and Wolverine - Cosmonaut Quickie

Something Is Wrong With The 2024 MLB Season

Something Is Wrong With The 2024 MLB Season

Introducing The Wizarding World of Harry Potter™ - Ministry of Magic™

Introducing The Wizarding World of Harry Potter™ - Ministry of Magic™

Cody Rhodes Goes Sneaker Shopping With Complex

Cody Rhodes Goes Sneaker Shopping With Complex

The Impact of chatGPT talks (2023) - Capstone talk with Dr Stephen Wolfram (Wolfram Research)

The Impact of chatGPT talks (2023) - Capstone talk with Dr Stephen Wolfram (Wolfram Research)

Vanilla Bayesian Optimization Performs Great in High Dimensions

Vanilla Bayesian Optimization Performs Great in High Dimensions

Nicholas Carlini: The security of LLMs

Nicholas Carlini: The security of LLMs

Augmented Language Models (LLM Bootcamp)

Augmented Language Models (LLM Bootcamp)

The Unreasonable Effectiveness of Spectral Graph Theory: A Confluence of Algorithms, Geometry & ...

The Unreasonable Effectiveness of Spectral Graph Theory: A Confluence of Algorithms, Geometry & ...

Poisoning Web-Scale Training Datasets - Nicholas Carlini | Stanford MLSys #75

Poisoning Web-Scale Training Datasets - Nicholas Carlini | Stanford MLSys #75

Algorithmic Trading and Machine Learning

Algorithmic Trading and Machine Learning

This is why Deep Learning is really weird.

This is why Deep Learning is really weird.

WE GOT ACCESS TO GPT-3! [Epic Special Edition]

WE GOT ACCESS TO GPT-3! [Epic Special Edition]

притворился бедным дедом на ламборгини - ответь на вопрос, получи деньги - Финал

притворился бедным дедом на ламборгини - ответь на вопрос, получи деньги - Финал

Дорн и Агутин спели «Летний дождь» 💗 #музыка

Дорн и Агутин спели «Летний дождь» 💗 #музыка

НИКОГДА НЕ ПЛАВАЙТЕ НА ОЗЕРО ЗЛЫХ РУСАЛОК !

НИКОГДА НЕ ПЛАВАЙТЕ НА ОЗЕРО ЗЛЫХ РУСАЛОК !

ЭТО САМОЕ ДОРОГОЕ (И КРУТОЕ) РАСТЕНИЕ В PVZ!

ЭТО САМОЕ ДОРОГОЕ (И КРУТОЕ) РАСТЕНИЕ В PVZ!

Построил ДЕРЕВНЮ на ДЕРЕВЬЯХ!

Построил ДЕРЕВНЮ на ДЕРЕВЬЯХ!

Где искать водителей в транспортные компании! Колонным на заметку!

Где искать водителей в транспортные компании! Колонным на заметку!