What is mechanistic interpretability? Neel Nanda explains.
ฝัง
- เผยแพร่เมื่อ 21 ก.พ. 2023
- Art by @hamishdoodles
Clipped from episode 19 of AXRP: • 19 - Mechanistic Inter...
Transcript of that episode: axrp.net/episode/2023/02/04/e...
---
AXRP patreon: / axrpodcast
AXRP ko-fi: ko-fi.com/axrpodcast - วิทยาศาสตร์และเทคโนโลยี
The images are quite helpful, especially for a complete beginner to the field when it comes to terms like stochastic descent. This channel is very underrated.
Thanks - nice to hear!
Thank you!
What if we have an AI that does this for us? And an ai that interprets the interpreter and so on. Maybe an ai wave process in order to give us a constant state of interpretation of what is going on.
There are approaches that use this tactic for outer alignment. I highly recommend checking out the classics: Christiano IDA and debate, etc. It's definitely a common motif in this area of research. But then again, I've seen people raise concerns that automating interpretability tools may enable deceptively aligned policies/agents to further entrench themselves.
Check out "AGI-Automated Interpretability is Suicide" by RicG
thats great but how would you know its doing it correctly...
@@user-vt4bz2vl6j That is a fair question, idk. But at least its a step