Interpretability and Bias Identification in LMs - Vector Intern Talks

Поделиться
HTML-код
  • Опубликовано: 3 окт 2024
  • Chufei's talk examines how knowledge graphs can be used to automatically red-team LLMs for benchmarking safety towards social biases. She created a novel prompting method to convert natural language stereotype statements, such as "poor folks steal things," into a dynamic knowledge graph. Then, she uses retrieval-augmented generation (RAG) strategies to retrieve the most relevant stereotypes for a potentially biased scenario, and injects this knowledge to attack an LLM. This method of attack is automatic, interpretable, and shows increased bias in GPT-3.5 and GPT-4. Her work demonstrates the importance of safety training and alignment in LLMs.
    Do you want to gain hands-on ML experience? Explore Vector’s internship opportunities: vectorinstitut...

Комментарии •