"AI Safety Through Interpretable and Controllable Language Models" - Peter Hase, YRSS

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 ธ.ค. 2024
  • Originally presented on: Wednesday, November 20th, 2024 at 11:00am CT, TTIC, 6045 S. Kenwood Avenue, 5th Floor, Room 530
    Title: "AI Safety Through Interpretable and Controllable Language Models"
    Speaker: Peter Hase, University of North Carolina at Chapel Hill
    Abstract: In a 2022 survey, 37% of NLP experts agreed that "AI decisions could cause nuclear-level catastrophe'' in this century. This survey was conducted prior to the release of ChatGPT. The research community’s now-common concern about catastrophic risks from AI highlights that long-standing problems in AI safety are as important as ever. In this talk, I will describe research on two core problems at the intersection of NLP and AI safety: (1) interpretability and (2) controllability. We need interpretability methods to verify that models use acceptable and generalizable reasoning to solve tasks. Controllability refers to our ability to steer individual behaviors in models on demand, which is helpful since pretrained models will need continual adjustment of specific knowledge and beliefs about the world. This talk will cover recent work on (1) open problems in interpretability, including mechanistic interpretability and chain-of-thought faithfulness, (2) fundamental problems with model editing, viewed through the lens of belief revision, and (3) scalable oversight, with a focus on weak-to-strong generalization. Together, these lines of research aim to develop rigorous technical foundations for ensuring the safety of increasingly capable AI systems.
    Bio: Peter Hase is an AI Resident at Anthropic, working on the Alignment Science team. He recently completed his PhD at the University of North Carolina at Chapel Hill. His research focuses on NLP and AI Safety, with a particular emphasis on techniques for explaining and controlling model behavior. He has previously worked at AI2, Google, and Meta.
    Timestamps:
    00:00
    00:05 Intro
    00:43 Lecture
    58:05 Q&A
    #lm #languagemodel #artificialintelligence #ai #machinelearning #algorithm #computervision #robotics #research

ความคิดเห็น •