"AI Safety Through Interpretable and Controllable Language Models" - Peter Hase, YRSS
ฝัง
- เผยแพร่เมื่อ 26 ธ.ค. 2024
- Originally presented on: Wednesday, November 20th, 2024 at 11:00am CT, TTIC, 6045 S. Kenwood Avenue, 5th Floor, Room 530
Title: "AI Safety Through Interpretable and Controllable Language Models"
Speaker: Peter Hase, University of North Carolina at Chapel Hill
Abstract: In a 2022 survey, 37% of NLP experts agreed that "AI decisions could cause nuclear-level catastrophe'' in this century. This survey was conducted prior to the release of ChatGPT. The research community’s now-common concern about catastrophic risks from AI highlights that long-standing problems in AI safety are as important as ever. In this talk, I will describe research on two core problems at the intersection of NLP and AI safety: (1) interpretability and (2) controllability. We need interpretability methods to verify that models use acceptable and generalizable reasoning to solve tasks. Controllability refers to our ability to steer individual behaviors in models on demand, which is helpful since pretrained models will need continual adjustment of specific knowledge and beliefs about the world. This talk will cover recent work on (1) open problems in interpretability, including mechanistic interpretability and chain-of-thought faithfulness, (2) fundamental problems with model editing, viewed through the lens of belief revision, and (3) scalable oversight, with a focus on weak-to-strong generalization. Together, these lines of research aim to develop rigorous technical foundations for ensuring the safety of increasingly capable AI systems.
Bio: Peter Hase is an AI Resident at Anthropic, working on the Alignment Science team. He recently completed his PhD at the University of North Carolina at Chapel Hill. His research focuses on NLP and AI Safety, with a particular emphasis on techniques for explaining and controlling model behavior. He has previously worked at AI2, Google, and Meta.
Timestamps:
00:00
00:05 Intro
00:43 Lecture
58:05 Q&A
#lm #languagemodel #artificialintelligence #ai #machinelearning #algorithm #computervision #robotics #research