AstroAI Lunch Talk - September 16, 2024 - Dominic Chang

AstroAI Lunch Talk - July 22, 2024 - Aquib Moin

Scaling interpretability

อย่าปากดี #pubg #pubgmobile

The cute puppy wanted to be handsome, but...😨💀 #puppy #horror #cartoon

【鬥羅大陸】誰的肌肉最強呢？ #斗羅大陸 #唐三 #小舞 #唐舞桐 #美少女戰士

AstroAI Lunch Talk - September 9, 2024 - Shivam Raval

AstroAI

มุมมอง 85

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 24 ก.ย. 2024
Speaker: Shivam Raval (Harvard)
Title: If [0.32, 0.42, -0.18, ... 0.86] is Monday, [0.48, -0.27, 0.98, ... -0.22] is Interpretability, which direction is Shivam's AstroAI Lunch Talk?
Abstract: Frontier language models have unique abilities to combine and connect seemingly unrelated concepts to provide novel, surprising yet seemingly plausible responses. A natural question arises: do they really understand human-interpretable concepts and if so, can we extract them from the model internals? One of the main goals of machine learning interpretability is to identify and disentangle complex representations of inputs into human-interpretable concepts for transparency, control, and safety. The main focus of this talk will be on techniques used to understand what a brain scan of a model encodes and how to decompose it into its most atomic units. Recent findings [1,2] suggest that interpretable features might be represented surprisingly as linear directions in the high dimensional space of the model’s activations. I will briefly talk about empirical findings that support this hypothesis, and how they can be operationalized towards designing better, aligned AI systems [3]. This so-called linear representation hypothesis has led to the use of sparse coding to decode internal activations of Large language models, and the introduction of Sparse Autoencoders (SAEs) for interpretability and model steering [4]. Using toy examples and synthetic datasets, I will highlight some benefits and challenges of using SAEs for interpretability and the effect of architectural choices on the learned features, and what the future can look like for language model interpretability. Finally, I will describe Lumiscope, an in-development platform for interactive interpretability that would allow researchers to study the internals of frontier models without having to implement an interpretability technique. With a case study of Patchscopes [5], a recently introduced interpretability framework I will describe some early findings on training-free approaches to studying entity-attribute extraction and bias quantification using Lumiscope.
[1] Marks, Samuel and Max Tegmark. “The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.” ArXiv abs/2310.06824 (2023)
[2] A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, N Nanda. “Refusal in Language Models Is Mediated by a Single Direction” Mechanistic Interpretability Workshop at ICML (2024)
[3] Y Chen, A Wu, T DePodesta, C Yeh, K Li, NC Marin, O Patel, J Riecke, S Raval, O Seow, M Wattenberg and F Viégas. “Designing a Dashboard for Transparency and Control of Conversational AI” arXiv preprint arXiv:2406.07882 (2024)
[4] A Templeton and T Conerly and others. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” (2024) transformer-ci...
[5] N Hussein, A Ghandeharioun, R Mullins, E Reif, J Wilson, Ni Thain and L Dixon. “Can large language models explain their internal mechanisms?” (2024) pair.withgoogl...

ความคิดเห็น •

ต่อไป

เล่นอัตโนมัติ

AstroAI Lunch Talk - September 16, 2024 - Dominic Chang

AstroAI Lunch Talk - September 16, 2024 - Dominic Chang

AstroAI Lunch Talk - July 22, 2024 - Aquib Moin

AstroAI Lunch Talk - July 22, 2024 - Aquib Moin

Scaling interpretability

Scaling interpretability

อย่าปากดี #pubg #pubgmobile

อย่าปากดี #pubg #pubgmobile

The cute puppy wanted to be handsome, but...😨💀 #puppy #horror #cartoon

The cute puppy wanted to be handsome, but...😨💀 #puppy #horror #cartoon

【鬥羅大陸】誰的肌肉最強呢？ #斗羅大陸 #唐三 #小舞 #唐舞桐 #美少女戰士

【鬥羅大陸】誰的肌肉最強呢？ #斗羅大陸 #唐三 #小舞 #唐舞桐 #美少女戰士

What is this principle #gojosatoru #shorts

What is this principle #gojosatoru #shorts

Harvard Professor Explains Algorithms in 5 Levels of Difficulty | WIRED

Harvard Professor Explains Algorithms in 5 Levels of Difficulty | WIRED

ICML 2024 Tutorial"Machine Learning on Function spaces #NeuralOperators"

ICML 2024 Tutorial"Machine Learning on Function spaces #NeuralOperators"

FREEDOM of LESS: One Man's Minimalist Journey

FREEDOM of LESS: One Man's Minimalist Journey

What Do Neural Networks Really Learn? Exploring the Brain of an AI Model

What Do Neural Networks Really Learn? Exploring the Brain of an AI Model

Does Consciousness Extend Beyond Brains? The 2023 Holberg Debate, feat. Seth, Luhrmann, Sheldrake.

Does Consciousness Extend Beyond Brains? The 2023 Holberg Debate, feat. Seth, Luhrmann, Sheldrake.

Do you think that ChatGPT can reason?

Do you think that ChatGPT can reason?

AstroAI Lunch Talk - September 23, 2024 - Yanke Song

AstroAI Lunch Talk - September 23, 2024 - Yanke Song

Brian Greene and Leonard Susskind: Quantum Mechanics, Black Holes and String Theory

Brian Greene and Leonard Susskind: Quantum Mechanics, Black Holes and String Theory

Roger Penrose: Black Holes, Art and Science, and the Beginning and End of Time.

Roger Penrose: Black Holes, Art and Science, and the Beginning and End of Time.

มันจบแล้ว... พีชชี่ซื้อบ้านที่อังกฤษแล้วววววว!! | #สตีเฟ่นโอปป้า

มันจบแล้ว... พีชชี่ซื้อบ้านที่อังกฤษแล้วววววว!! | #สตีเฟ่นโอปป้า

ตูนป่วยค่ะ วัย 31 ปี มีทั้งดีและช็อค! | Alie

ตูนป่วยค่ะ วัย 31 ปี มีทั้งดีและช็อค! | Alie

[LIVE] : ONE ลุมพินี 80 | คู่เอก "รักษ์ vs ยอดนำชัย"

[LIVE] : ONE ลุมพินี 80 | คู่เอก "รักษ์ vs ยอดนำชัย"

skibidi toilet multiverse Special Episode 04

skibidi toilet multiverse Special Episode 04

🔴Live โหนกระแส เอาไงแน่ทอง 18K หรือ 18 มง!!! ผู้เสียหายร้องซื้อทองออนไลน์แต่ขายไม่ได้

🔴Live โหนกระแส เอาไงแน่ทอง 18K หรือ 18 มง!!! ผู้เสียหายร้องซื้อทองออนไลน์แต่ขายไม่ได้

สิริพงศ์แจงปมยึดมือถือครูเบญ บังคับเซ็นยอมรับคะแนนสอบจริงหรือไม่ l STORY LIVE EP.71 (HIGHLIGHT)

สิริพงศ์แจงปมยึดมือถือครูเบญ บังคับเซ็นยอมรับคะแนนสอบจริงหรือไม่ l STORY LIVE EP.71 (HIGHLIGHT)

’สึนามิ‘ สูง 200 เมตรถล่ม ‘กรีนแลนด์‘นานติดกัน 9 วัน พังราบเป็นหน้ากลอง | กรุงเทพธุรกิจNEWS

’สึนามิ‘ สูง 200 เมตรถล่ม ‘กรีนแลนด์‘นานติดกัน 9 วัน พังราบเป็นหน้ากลอง | กรุงเทพธุรกิจNEWS

มายคราฟ แต่ คุณไม่สามารถตายได้!!! #minecraft #พี่เก้า #มายคราฟ

มายคราฟ แต่ คุณไม่สามารถตายได้!!! #minecraft #พี่เก้า #มายคราฟ