27 - AI Control with Buck Shlegeris and Ryan Greenblatt

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ก.ย. 2024
  • A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
    Topics we discuss, and timestamps:
    0:00:31 - What is AI control?
    0:16:16 - Protocols for AI control
    0:22:43 - Which AIs are controllable?
    0:29:56 - Preventing dangerous coded AI communication
    0:40:42 - Unpredictably uncontrollable AI
    0:58:01 - What control looks like
    1:08:45 - Is AI control evil?
    1:24:42 - Can red teams match misaligned AI?
    1:36:51 - How expensive is AI monitoring?
    1:52:32 - AI control experiments
    2:03:50 - GPT-4's aptitude at inserting backdoors
    2:14:50 - How AI control relates to the AI safety field
    2:39:25 - How AI control relates to previous Redwood Research work
    2:49:16 - How people can work on AI control
    2:54:07 - Following Buck and Ryan's research
    The transcript: axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html
    Links for Buck and Ryan:
    - Buck's twitter/X account: bshlgrs
    - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt
    - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com
    Main research works we talk about:
    - The case for ensuring that powerful AIs are controlled: lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
    - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942
    Other things we mention:
    - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root
    - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512
    - Improving the Welfare of AIs: A Nearcasted Proposal: lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
    - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938
    - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

ความคิดเห็น • 5

  • @OutlastGamingLP
    @OutlastGamingLP 5 หลายเดือนก่อน +3

    "Even if the stars should die in heaven
    Our sins can never be undone
    No single death will be forgiven
    When fades at last the last lit sun
    Then in the cold and silent black
    As light and matter end
    We'll have ourselves a last look back
    And toast an absent friend"
    Sorry. Feeling angsty about the world today.
    I had a friend in highschool who I'd sometimes complain about my problems to. She'd always say the same line in reply, and I couldn't argue.
    "Well, do better."
    That meme stuck around in my head. "Do better."
    It's a weird place to be.
    "Oh, these lab leaders think there's a 20% chance of doom."
    "So they haven't ruled out doom?"
    "Well, no, they just think it's unlikely."
    "I wouldn't call 10-20% 'unlikely' when we're talking about 'literally everyone dies and/or nearly all value in the future is irrevocably lost,' but okay, why do they think its possible but less likely than throwing heads in a coin flip."
    "Well, they don't really explain why, but it's something like 'human extinction seems weird and extreme, and while they can imagine it, they feel much more compelled by other grand and wonderful things they can imagine' - at least, that's the vibe I get."
    "Annnnd we don't think there's some kind of motivated cognition going on here? I think people buying lottery tickets are also imagining very vividly the possibility of them winning, but that doesn't make them right to say whatever % they feel intuitively."
    "They'd say AI is more like the invention of agriculture than a lottery. Like, maybe you make some huge foreseeable mistake and cause a blight, but if you have some random list of virtues like 'common sense' or 'prudence' or 'caution' then you'll probably just make a bunch of value."
    "I think Powerball is a good metaphor. Let's take features of the universe we'd all want to see in the future and tag them with a number. We then play a million number Powerball and hope each one of those numbers we chose show up. What are the odds that will happen? 80%?"
    "This sounds like a wonderful argument on how to reason about a specific kind of uncertainty, but people don't want to reason about uncertainty, they want to reason about how their most convenient and readily actionable policy is actually totally fine and probably not going to be an unrecoverable catastrophe."
    "Well, they should do better."
    "I appreciate the sentiment, though I would like to note that in this case this has to be nearly the largest understatement in the 13.8 billion years of the universe."
    "Here's another: I'm pretty bummed out about this."

    • @OutlastGamingLP
      @OutlastGamingLP 5 หลายเดือนก่อน +2

      "Is there any strategy these models can deploy that would allow them to cause problems."
      Has anyone in the Black Dolphin prison ever managed to kill a guard or another prisoner? Not sure, but I'd guess probably 'yes.'
      And those would just be humans, not even particularly selected for their intelligence, just selected for their 'killing intent' and prior extreme bad behavior.
      An AI that's as smart as our civilization working for an entire year per hour of IRL computer runtime will find any medium bandwidth channel of causal influence more than sufficient to destroy the world.
      Even if you give it a 1 bit output channel and iterate on asking it "yes/no" questions, that probably adds up to a lethally dangerous level of causal engagement with our universe eventually. Even if you reset it after every question, 0 online learning, it can probably guess its position in the sequence if the input contains enough deliberate or accidental shadows of the intervention the prior instances of the system have done.
      "Safe no matter what" sounds great, but it's like saying some product is "indestructible" - well, you're failing to imagine the maximum forces that can be brought to bare on the object.
      Specifically, a sheer and strict 'whitelist' policy is only as safe as your ability to predict the consequences of every action you whitelist, and if you could predict all of that, then the AI is no better than a Tool Assisted Speedrunner program or a manufacturing robot. It can precisely and quickly do only as much good as humans could do slowly and less precisely.
      As soon as you're getting "superhuman" you need something that does superhuman-level human-value alignment. Your merely human-level control/safety techniques will be insufficient to cover that wider space.
      You've got a relay in a circuit that's meant to carefully switch off a power supply when there's a surge, and it looks super safe and reliable, since you can prove it successfully activates and breaks the connection even up to the level of the capacitor banks that fire inertial confinement lasers. And yet, in practice, the surge comes through, the relay flips, and there's a white hot arc through open air as the electric field shreds air molecules into plasma, and the energy grounds out in the delicate machinery - now molten slag - you had downstream of that relay.
      That jump through open air is the problem. That's what outside the box is pointing to. Good luck constraining safe operation beyond the box, when you can't see that region in advance, and if you aren't trying to go outside of the box, then why the hell are we even trying to build this stuff.

  • @mrbeastly3444
    @mrbeastly3444 5 หลายเดือนก่อน +2

    22:34 "roughly human level"... Ok, but even if this works, what are the odds that AI Labs will only use "roughly human level" AI Agents? E.g. Rather then "the best AI agents available". It seems likely that "roughly human level AI" will only be "roughly human level" for weeks or months, when they are then replaced with 2x or 10x versions. Even if you were able to contain a "roughly human level AI agent", this could be a very temporary solution?
    Would "roughly human level AI agents" be able to safely do any useful testing and alignment work on an ASI level model? Doing alignment work (goal and intention testing, capability testing, red teaming, even containment) on an ASI would likely require greater then "roughly human level AI agents"?

  • @dizietz
    @dizietz 5 หลายเดือนก่อน

    Thanks -- it's been a while since AXRP released something!

    • @axrpodcast
      @axrpodcast  5 หลายเดือนก่อน

      Alas it's true - but hopefully it won't be as long before the next episode :)