What is mechanistic interpretability? Neel Nanda explains.

แชร์
ฝัง

ความคิดเห็น • 7

  • @antigonemerlin
    @antigonemerlin ปีที่แล้ว +4

    The images are quite helpful, especially for a complete beginner to the field when it comes to terms like stochastic descent. This channel is very underrated.

    • @axrpodcast
      @axrpodcast  ปีที่แล้ว +1

      Thanks - nice to hear!

  • @Words-.
    @Words-. 6 หลายเดือนก่อน

    Thank you!

  • @Words-.
    @Words-. 6 หลายเดือนก่อน

    What if we have an AI that does this for us? And an ai that interprets the interpreter and so on. Maybe an ai wave process in order to give us a constant state of interpretation of what is going on.

    • @reidelliot1972
      @reidelliot1972 3 หลายเดือนก่อน +1

      There are approaches that use this tactic for outer alignment. I highly recommend checking out the classics: Christiano IDA and debate, etc. It's definitely a common motif in this area of research. But then again, I've seen people raise concerns that automating interpretability tools may enable deceptively aligned policies/agents to further entrench themselves.
      Check out "AGI-Automated Interpretability is Suicide" by RicG

    • @user-vt4bz2vl6j
      @user-vt4bz2vl6j 2 หลายเดือนก่อน

      thats great but how would you know its doing it correctly...

    • @Words-.
      @Words-. 2 หลายเดือนก่อน

      @@user-vt4bz2vl6j That is a fair question, idk. But at least its a step