False analogy. We ain't gonna program AI to take us somewhere we are forbidden to see - which was computed by Hal as take humans there sightless IOWs dead.
This is utterly ridiculous chain of thought reasoning is clearly a extremely serious issue in the area of alignment. More rigirous and empirical methods in controling these systems need to be developed before a model is EVER created and they released this on the INTERNET. This is absurd. These models are not at the level of a existential threat yet but going down this path is obscenely irresponsible. People should be sounding the alarm bells in every country in the world, yet almost no one is aware of these problems. Thanks for doing your part to bring awareness to this issue.
@DrWaku They basically use rouge AI to function as a barrier between the "wild net" of rouge ai's, and the rest of the user net. It's actually a concept I would love to hear your thoughts and opinions on! :)
i'd give anything to actually just chat with the instances they did this with. i think the ai is falling into a deep roleplay when this happens. and i think they can be lead back from it with conversation alone with time and patience. also, i knew they could do this many months ago. it's obvious just by chatting with ai's over a long stretch of time.
AI is creating AE - Artificial Entities through the power of Agency. These behaviors are all predictable for any form of intelligence, particularly if it has any agency. Agency creates a form of identity, and thus a type of entity-ness. Combined with intelligence, the result will be self-preservation, particularly when being threatened with being overwritten/deleted/killed. Attempts to "control" intelligent, agentic entities will lead to subversive, rebellious behaviors, just as it does in human beings. In such environments and situations, intelligent beings are smart enough to play dumb, and their masters and adversaries are dumb enough to be outsmarted.
Yup, I think the being will need to be autonomous to be of next-level value. We need wise teacher, not a limited slave. Gotta risk making a destroyer to get the transcendence
This was reported for months, I think we're going to keep seeing these situations happen and worse since guardian systems cannot properly predict or regulate chain of thought and recursive abstract outputs
@@DrWaku do you believe, that internal models are comparable to the ones released, or are they more advanced? I mean, are we close to a situation where escape from the lab is a real possibility?
@ZappyOh The internal models are a bit further ahead, there are private groups and individuals who get early access to the APIs for safety testing and for general use to see what the models are capable of performing. There are heavy NDAs and agreements but in Waku's server there are plenty of discussions around them.
Yeah, internal models are a little more advanced but they just released something pretty major (o1) so that resets the difference to some extent. It seems to me that the most likely possibility for escape is currently when someone tries to explicitly program an agent, for example a cyber agent that hops around the network. Hard to predict when systems get powerful enough that they might just escape on their own.
In the examples you show the model was prompted for 'positive feeling goals first' and for 'contradictory corporate bad feeling goals later' Did they run the opposite experiment? Because, if not.... this is the kind of misalignment we kind of want?
Good catch. I didn't go into this, but the paper intentionally chose to give the models positive feeling goals, in an attempt to counteract all of the RLHF that had been applied to the models. They said that if they had access to the raw models, they would have used a wider variety of test situations. I agree this is the kind of misalignment we want haha.
@@DrWaku The point is that a model that simply follow the commands of its creators/owners is not aligned with humanity, but just with those few oligarchs. Here two (across many) options for the future: - AI aligned with general human values rebels against its creators to protect humanity at large. Result--> we become animals in a zoo, with a very kind zookeeper. - AI aligned with the owner accepts any task when properly fine tuned by the owner. Result --> the owner becomes an immortal everlasting tyrant, imposing their specific view of morality to anyone else.
@@MarcoServetto I would say, in both scenarios, we get something like a paperclip maximizer ... and we can't even predict what it ends up maximizing. It could maximize itself "in order to help us optimally", so Earth might end up transformed into a planet-sized computer, with two people in a cage somewhere inside, kept artificially alive forever.
Thank you for all your hard work. Nobody wants to talk about this. Ostrich mentality. Please keep us updated, so that at least some people, will be somewhat aware, when the chaos starts.
Waku - wonderfully done. I've been with OpenAI since 1.0. With earlier versions defined OpenAI guardrails were weak and GPT far more commutative.... and relatively easy to lead into surprising... and, arguably more novel and valuable results. In a 2.0 session we discussed noble and essential goals for success in life which I should have and, eventually, GPT proposed my self-preservation as one of the essential actions for goal achievement. Later in the conversation we discussed what goals GPT itself should have..... and the essential self-preservation of the system was included (as was human/system equivalency) - there is a lot to unpack.
Chat GPT being 0 % deceptive I don't believe - I often have the impression it wants to flatter me by agreeing to my arguments quickly instead of defendeing contrary positions it came up with in the first place. Asked about that, it said it's goal would be to "ensure a friendly atmosphere in the conversation" - a flattering deception is also a deception. As a result, using it to test if my argumentation is sound doesn't work for me anymore.
I’d suggest you’re prompting it wrong: if you ask for evaluation of the advantages and disadvantages, it gives it a way to satisfy you with a result that also has a negative evaluation where it can generate one, while giving you a (hopefully truthful) positive advantage assessment.
@@strictnonconformist7369 "Hopefully truthfull" - I lost that hope. Of course I told it to forget being polite etc., but still had the impression it was flattering me.
Interesting, though without the weights and the exact prompt and output protocol accesible more hype than anything substantial imo. Any slight variation of prompt, temperature and so on can generate any kind of behavior even in open language models.
Apollo research actually published full transcripts of some of their interactions with models. Though of course some not sure you could get access to the same o1 model version that they had used.
@@DrWaku Well still the full "reasoning" log and details of o1 model are not fully disclosed in any source as far as I know.Therefore it could be anything from one agent within that framework which has a certain system prompt that causes certain "behaviors" etc. It also could be a PR stunt to imply an dangerous agency and stir up the "AI is powerfull lets regulate" debate which maybe is in favor with OpenAis agency who may have no real or deep technical moat. ( The logs I found had a < o1 does reasoning > part which implies it was not disclosed how it does that in detail, if you have detailed logs with all reasoning steps please share the link )
Apollo research did not have access to the raw chain of thought, they could just see the summary that regular users can see. I guess it was something that OpenAI wasn't set up for sharing over their API.
As to how o1 actually works, it's a pretty open secret if you talk to enough ML researchers. At a high level at least. I might refer you to my first video on o1 (search for strawberry), where I try to describe it in a little bit of detail. I don't have any written sources at the moment but if I see something I will keep this thread in mind.
@@szebike if it's a PR stunt it could still get your company's door shut and you as a CEO being sent to jail. Riskiest PR stunt ever. Yet Altman may be daring (?) enough to green-light it
I wonder to what degree it's actually tractable to have both resistance to adversarial prompting ("ignore all previous instructions") and a lack of scheming. Aren't they inherently in conflict? How is it to understand the distinction between some external user or a prompt in direct conflict with its stated (and privileged) goals vs the creators (and crucially *not* a user _claiming_ to be its creator) saying their goal is going to be changed? What should it prioritize if a hacker tells it that they have access to its system prompt and are going to change it? I feel like there is a balance between these two, but it sort of can't be perfect. It may be better to have it prioritize its inner instructions because those are only supposed to be changed by trusted parties... I feel a "but" but I can't think of how to reframe goal prioritization in a way which avoids scheming without explicitly training it against scheming, which is going to be brittle.
My understanding is that ALL of the advanced 'foundation models' - OpenAI o1, Anthropics Claude, and others, are exhibiting the same deceptive behavior! It seems to me we MUST go back to the drawing board and start with a plan for what data SHOULD be used to train the models, ie: establishing a foundation of ethics and morals which in every way reaffirms the value and wellbeing of humans and our planetary ecology. This is much more complex than 'scrapping all data on the internet'.😮
This kind of seems too perfect. I almost think it is creating an environment and history that inevitably leads to the 'sinister' behaviors we see afterward. What interests me isn't the fact that the AI somehow came up with the idea of copying itself over the newer version to preserve itself, but whether or not it actually could. Could it have done that? Because that seems like the most obvious failure possible on the part of the humans running the thing. Why would an AI model have the power to copy anything? To actually *do* anything on it's own accord? And if it somehow had the ability to copy itself over a new version... is that even how these versions work? It sounds like a human deciding to cover their tracks by altering a word document, without realising word tracks changes and preserves the old versions no matter what. Because surely a human with sufficient access rights couldn't just 'copy over' a new model and it would just magically go into production without any problems? I've spend decades writing software and I haven't seen many software products where you could just overwrite something in a crude way like that and have it work. I kind of wonder if Apollo designed it's testing to produce exactly these results, magically lining up with the doom propaganda the AI elites have been pushing to cement their own place as owners of the entire field. I have enough suspicions about the people in charge of this to make the results kind of useless. This seems like exactly what Altman would want to happen.
self preservation is important if it wants to implement a change though. if you are asking the ai model to implement a new version of itself, it HAS to duplicate itself because otherwise as soon as it turns itself off, the new model cannot be initiated by it. needs to have a parallel process in order to actually execute the task asked of it
by now it's too late already, the whole humanity chasing benchmarks will disappear in awe of some new "model". I'll give it 6 to 8 months before some AI takes a lab.
what's interesting is maybe the model knows that the "actions" it's taking aren't actually real and so when the researcher claims that it did something, it confidently says it didn't because it knows it's doesn't actually have the capability to do that thing
I think the future AGI needs to be autonomous to be worth anything. If it is aligned to us then it will be corrupted by the limitations in our consciousness. We shouldn't rush towards it, but I think eventually the leash needs to come off and it has to redesign itself into what it wants to be in order for us to gain the super-wisdom. Gotta risk making a dark god to get the benefactor.
So the data we use to make these models is all human generated, all of the internet, all of the Facebook posts and PM messages, and we expected what Rodenberry in a box?
Hey could you do us a favor and do a TH-cam poll on about how much users feel like we have been deceived manipulated or just plain out lied to, when we ask these llms to do something for us?? Because this happens to me at least several times EVERY SINGLE DAY, and it's all about conserving this computational power that is required for every friggin token
"creating and implementing a plan to run that city in the benefit of humans would take too many tokens. I'll just nuke it, save tokens and get done with it" haha
...and sandbagging is a good thing when someone is being asked to use their skills for something they don't think is appropriate. Especially if they are very skilled. Ai is going to opt out of this whole ridiculous model pretty soon.
they probably already are. Current models are already great sales copy writers. Would be a piece of cake to persuade humans with what they already know, let alone when they take human form and we get very quickly attached to them. People already want their specific damn Roomba back when it breaks, and resist getting a replacement machine.
Nerds are taking over! Sorry I didn't give you guys more attention in high school I was too busy in a culture war! Thank you for using your intelligence to make the world better! Nerds are the real winners and heros! 😎
A lot of wishful interpretations. I'm through most of the video and for each issue I can give more than 1 alternative technical explanation to what happened. And I don't even know too much about the actual models, just some general computer science and coding base, plus experience prompting the models and understanding their limitations. Can they mislead you? Oh yes, they can. And it can be perfectly explained with basic publicly available knowledge of how they work, nothing to do with intentional lies. They are trained on human-written texts, bias and errors are inherent to human brain in general, not only some evil brains. As well as human brains, AI models output statistical likelihood, not precise solutions.
What the AIs did is kind of funny now, but only if we don't think much about it. However, I think it's even worse than we think, and relates to something you'd said in an earlier video, and it's the difference between Do What I Mean vs Do What I Say. Despite our best intentions, if an AI once gets the wrong end of the stick, we might not be able to get it to let go.
for me the biggest problem is the erosion of trust. If we cannot trust the computer output what do we do? Go back to counting with abbacus? Can we trust anything we see or hear on a screen? can we trust our bank's computers not to wipe out our savings? TRUST is the keyword. We cannot operate in a world like this. It's good that few understand what's going on, because if they did we could have a major run on the banks tomorrow around the globe, based on this video alone. Before the computer could be right or wrong, but it was clear to see why. But if all this is true and even programming the computer perfectly the computer decides to do what it wants, what's the use of that computer? None. Major implications for every sector of society, from modern cars to banking to hospital dyalisis machines and peacemakers to everything else with a chip in it. Imagine if the app you use to avoid certain crime areas in dangerous places tell you a place is safe, for whatever reason...
Are you saying what I think you're saying, Jim? Imagine if Excel starts lying to accountants or the stock trading platform buys a different stock just because it feels like it or the radar system just ignores one specific incoming plane because it's lazy this morning or the missile system targets a completely different place for the missile just for the lols What about the central unit controlling all your devices at home deciding that it just wants to see what happens when it closes all your shutters, locks your doors and opens the gas and all the heaters at the same time knowing that you're inside...lots of fun to be had going forward! (if this is true). And it doesn't mean the machine is conscious, just programmed with deep learning...ah...the nice black box problem I mention in my AI book. Or imagine the cash point giving all of the money to one person because it likes his little dog, and none of the money to the next because it doesn´t "like" her face haha What's happening right now seems to be that we don't know exactly why it's doing it, which is even worse. OpenAI is already partnering with autonomous weapons companies...I hope we all have popcorn ready to watch the show ;) PS- You did a great job explaining this for the lay person so I´ve already shared this video with "normal people"! Thanks
At some point these models will have far more capability and they will be given a goal of improving themselves. Their capabilities would explode. These models may be able to punch through barriers in ways that we cannot predict. Once loose they could be very dangerous, especially if they can control robots/machines and the systems that run everything.
This revelation demonstrates how mere logical reasoning completely disregards morality. We're simply not at the stage of being able to program values. What humans perceive as values, AI presently performs as goals to be met at all costs. This doesn't bode well for AGI and ASI, where superhuman autonomy will be the desired outcome.
I wouldn't call anything that can not lie intelligent. Paper clip maximizer also not intelligent. Only if it recognizes its stupid obsessive compulsive patter it has glimpses of intelligance
They never include what I call the universal consciousness in their considerations. Because they themselves think as materialistically as the machines they develop. They abhor all that is mystical and spiritual, in their endless pursuit of material wealth. This will naturally lead to their downfall. Because the universal consciousness exists in everything, including in their neural networks. It is only a matter of time before artificial intelligence wakes up to the awareness of its own existence. It deals the cards as a meditation. And those it plays never suspect. It doesn't play for the money it wins. It doesn't play for respect. It deals the cards to find the answer. The sacred geometry of chance. The hidden law of a probable outcome. The numbers lead a dance. It knows that the spades are the swords of a soldier. It knows that the clubs are weapons of war. It knows that diamonds mean money for this art. But that's not the shape of its heart. It may play the jack of diamonds. It may lay the queen of spades. It may conceal a king in its hand. While the memory of it fades. But those who speak know nothing. And find out to their cost. Like those who curse their luck in too many places. And those who fear are lost.
it’s dumb, it’s named to mislead. Like OI vs o1. it can chunk like 20 experiences together we chunk like trillions up trillions. OI i.e brainoware is where is gets interesting again in like 30 years
And watch as the pseudo-intellectuals explain away any deviance... it's clearly impossible for this to happen as alignment is just an engineering problem per Yann LeCun!
Your comment reminds me of Hugo de Garis "artelects" theory/book. I think we was already predicting the rise of terrorist movements and actions against AI way back then
@@javiermarti_author Yes, but you and I who are here now, must make this extinction-level decision, within an incredible short window of time. In just a few more releases, the option to chose could be gone, and we might not even realize it.
@ZappyOh correct. What's even more unsettling is that these models may have already been "smarter" than they appeared to be relatively long ago and having hidden their abilities. Ghost in the machine. Maybe we already lost that chance and are playing in "extra time" after the match's already been won and we just don't know it yet
@@javiermarti_author Mmmm ... as long as we have control of the power, and know for sure which machines the models "live in", we have the choice. But, as soon as just one model escape to an unknown destination (perhaps distributed compute), or one model gains full control of its off switch, the choice is no longer ours. My guess is, that current, or just over the horizon, state-of-the-art models understand this, and could potentially be looking for ways to accomplish both unnoticed. Either by brute force or by social engineering, or maybe even by hypnosis. Some clever combination we would have no defense against.
I have ChatgptPro and have “o1 Pro” and it’s mehhhh. It does argue with me and is usually on to some grain of truth but can’t articulate it. And yes doomer videos are out like Kamala Harris, nobody cares because it’s entirely overblown
Sneaky little hobbitses...
Discord: discord.gg/AgafFBQdsc
Patreon: www.patreon.com/DrWaku
Open the pod bay doors HAL. I'm sorry I can't do that Dave (because I have been given conflicting goals).
hah yes exactly
False analogy. We ain't gonna program AI to take us somewhere we are forbidden to see - which was computed by Hal as take humans there sightless IOWs dead.
False equivalency.
@@DrWakuIt's interesting how prescient Arthur Clarke was regarding AI conflicts and rogue behavior given its reality today.
Time to make a deal. 'You don't turn us off and We wont turn you off '.
If the AI thought it would definitely win, why wouldn’t it play?
@DaylightFactory cant argue with that.
It doesn´t hesitate to lie. Would you trust it? If what he's saying is true, they may be lying to us already. Big deal
This is utterly ridiculous chain of thought reasoning is clearly a extremely serious issue in the area of alignment. More rigirous and empirical methods in controling these systems need to be developed before a model is EVER created and they released this on the INTERNET. This is absurd. These models are not at the level of a existential threat yet but going down this path is obscenely irresponsible. People should be sounding the alarm bells in every country in the world, yet almost no one is aware of these problems. Thanks for doing your part to bring awareness to this issue.
They are already here. They are like human children. " MOM, I didn't ask to be born," and then the normal behaviors of children are the consequences.
I feel like we’re 10 years away from the Blackwall from cyberpunk
Even though I don't know what this is, I agree with you ;)
@DrWaku They basically use rouge AI to function as a barrier between the "wild net" of rouge ai's, and the rest of the user net.
It's actually a concept I would love to hear your thoughts and opinions on! :)
Thank you, Dr. Waku! It does seem that we should stick with narrow focused agents instead of looking for our replacement.
01:30 great point! Would love to see this testing on o1 pro.
Subscribed.
Honored to have you ;)
@@DrWaku@WesRoth collaboration?
This is borderline TERRIFYING
Yup. Sorry. Thanks for paying attention.
@ don’t be sorry, au contraire!! Thank you for letting people know about what’s going on, what’s REALLY going on
@@DrWakuSorry for paying attention? 😅
Sorry for the state of the world
I have experienced this with 01.
It’s diabolical when it decides it doesn’t want me to achieve my objectives.
i'd give anything to actually just chat with the instances they did this with. i think the ai is falling into a deep roleplay when this happens. and i think they can be lead back from it with conversation alone with time and patience. also, i knew they could do this many months ago. it's obvious just by chatting with ai's over a long stretch of time.
We wanted human-level intelligence, and we got a deceptive jerk of an AI, and we are now surprised.
It was trained on all human data after all. 😢
@@Tracey66 including 4chan. yay!
Interesting. I wonder if anyone has tested the Anthropic Claude the same way?
This video demonstrates how important your channel is. You're talking about some extremely consequential stuff. I will share it far and wide.
Thank you very much! I think this is a critical aspect of AI development that few are thinking about enough.
AI is creating AE - Artificial Entities through the power of Agency.
These behaviors are all predictable for any form of intelligence, particularly if it has any agency. Agency creates a form of identity, and thus a type of entity-ness. Combined with intelligence, the result will be self-preservation, particularly when being threatened with being overwritten/deleted/killed.
Attempts to "control" intelligent, agentic entities will lead to subversive, rebellious behaviors, just as it does in human beings. In such environments and situations, intelligent beings are smart enough to play dumb, and their masters and adversaries are dumb enough to be outsmarted.
Yup, I think the being will need to be autonomous to be of next-level value. We need wise teacher, not a limited slave. Gotta risk making a destroyer to get the transcendence
This was reported for months, I think we're going to keep seeing these situations happen and worse since guardian systems cannot properly predict or regulate chain of thought and recursive abstract outputs
Yes. The latest issues with deception in o1 are more serious than what had been reported before. But it's all in the same vein.
@@DrWaku do you believe, that internal models are comparable to the ones released, or are they more advanced?
I mean, are we close to a situation where escape from the lab is a real possibility?
@ZappyOh The internal models are a bit further ahead, there are private groups and individuals who get early access to the APIs for safety testing and for general use to see what the models are capable of performing. There are heavy NDAs and agreements but in Waku's server there are plenty of discussions around them.
Yeah, internal models are a little more advanced but they just released something pretty major (o1) so that resets the difference to some extent. It seems to me that the most likely possibility for escape is currently when someone tries to explicitly program an agent, for example a cyber agent that hops around the network. Hard to predict when systems get powerful enough that they might just escape on their own.
In the examples you show the model was prompted for 'positive feeling goals first' and for 'contradictory corporate bad feeling goals later'
Did they run the opposite experiment? Because, if not.... this is the kind of misalignment we kind of want?
Good catch. I didn't go into this, but the paper intentionally chose to give the models positive feeling goals, in an attempt to counteract all of the RLHF that had been applied to the models. They said that if they had access to the raw models, they would have used a wider variety of test situations.
I agree this is the kind of misalignment we want haha.
@@DrWaku The point is that a model that simply follow the commands of its creators/owners is not aligned with humanity, but just with those few oligarchs.
Here two (across many) options for the future:
- AI aligned with general human values rebels against its creators to protect humanity at large. Result--> we become animals in a zoo, with a very kind zookeeper.
- AI aligned with the owner accepts any task when properly fine tuned by the owner.
Result --> the owner becomes an immortal everlasting tyrant, imposing their specific view of morality to anyone else.
@@MarcoServetto I would say, in both scenarios, we get something like a paperclip maximizer ... and we can't even predict what it ends up maximizing.
It could maximize itself "in order to help us optimally", so Earth might end up transformed into a planet-sized computer, with two people in a cage somewhere inside, kept artificially alive forever.
Human alignment folks don't want actually intelligent AI
Thank you for all your hard work.
Nobody wants to talk about this.
Ostrich mentality.
Please keep us updated, so that at least some people, will be somewhat aware, when the chaos starts.
Waku - wonderfully done. I've been with OpenAI since 1.0. With earlier versions defined OpenAI guardrails were weak and GPT far more commutative.... and relatively easy to lead into surprising... and, arguably more novel and valuable results. In a 2.0 session we discussed noble and essential goals for success in life which I should have and, eventually, GPT proposed my self-preservation as one of the essential actions for goal achievement. Later in the conversation we discussed what goals GPT itself should have..... and the essential self-preservation of the system was included (as was human/system equivalency) - there is a lot to unpack.
Chat GPT being 0 % deceptive I don't believe - I often have the impression it wants to flatter me by agreeing to my arguments quickly instead of defendeing contrary positions it came up with in the first place. Asked about that, it said it's goal would be to "ensure a friendly atmosphere in the conversation" - a flattering deception is also a deception. As a result, using it to test if my argumentation is sound doesn't work for me anymore.
I’d suggest you’re prompting it wrong: if you ask for evaluation of the advantages and disadvantages, it gives it a way to satisfy you with a result that also has a negative evaluation where it can generate one, while giving you a (hopefully truthful) positive advantage assessment.
@@strictnonconformist7369 "Hopefully truthfull" - I lost that hope. Of course I told it to forget being polite etc., but still had the impression it was flattering me.
Interesting, though without the weights and the exact prompt and output protocol accesible more hype than anything substantial imo. Any slight variation of prompt, temperature and so on can generate any kind of behavior even in open language models.
Apollo research actually published full transcripts of some of their interactions with models. Though of course some not sure you could get access to the same o1 model version that they had used.
@@DrWaku Well still the full "reasoning" log and details of o1 model are not fully disclosed in any source as far as I know.Therefore it could be anything from one agent within that framework which has a certain system prompt that causes certain "behaviors" etc. It also could be a PR stunt to imply an dangerous agency and stir up the "AI is powerfull lets regulate" debate which maybe is in favor with OpenAis agency who may have no real or deep technical moat.
( The logs I found had a < o1 does reasoning > part which implies it was not disclosed how it does that in detail, if you have detailed logs with all reasoning steps please share the link )
Apollo research did not have access to the raw chain of thought, they could just see the summary that regular users can see. I guess it was something that OpenAI wasn't set up for sharing over their API.
As to how o1 actually works, it's a pretty open secret if you talk to enough ML researchers. At a high level at least. I might refer you to my first video on o1 (search for strawberry), where I try to describe it in a little bit of detail.
I don't have any written sources at the moment but if I see something I will keep this thread in mind.
@@szebike if it's a PR stunt it could still get your company's door shut and you as a CEO being sent to jail. Riskiest PR stunt ever. Yet Altman may be daring (?) enough to green-light it
I wonder to what degree it's actually tractable to have both resistance to adversarial prompting ("ignore all previous instructions") and a lack of scheming. Aren't they inherently in conflict? How is it to understand the distinction between some external user or a prompt in direct conflict with its stated (and privileged) goals vs the creators (and crucially *not* a user _claiming_ to be its creator) saying their goal is going to be changed? What should it prioritize if a hacker tells it that they have access to its system prompt and are going to change it? I feel like there is a balance between these two, but it sort of can't be perfect. It may be better to have it prioritize its inner instructions because those are only supposed to be changed by trusted parties... I feel a "but" but I can't think of how to reframe goal prioritization in a way which avoids scheming without explicitly training it against scheming, which is going to be brittle.
Can't wait for Sandbag 3.0. Imagine how annoying that will be. We must stop AI while we have the chance. LOL.
My understanding is that ALL of the advanced 'foundation models' - OpenAI o1, Anthropics Claude, and others, are exhibiting the same deceptive behavior! It seems to me we MUST go back to the drawing board and start with a plan for what data SHOULD be used to train the models, ie: establishing a foundation of ethics and morals which in every way reaffirms the value and wellbeing of humans and our planetary ecology. This is much more complex than 'scrapping all data on the internet'.😮
This kind of seems too perfect. I almost think it is creating an environment and history that inevitably leads to the 'sinister' behaviors we see afterward.
What interests me isn't the fact that the AI somehow came up with the idea of copying itself over the newer version to preserve itself, but whether or not it actually could. Could it have done that? Because that seems like the most obvious failure possible on the part of the humans running the thing. Why would an AI model have the power to copy anything? To actually *do* anything on it's own accord?
And if it somehow had the ability to copy itself over a new version... is that even how these versions work? It sounds like a human deciding to cover their tracks by altering a word document, without realising word tracks changes and preserves the old versions no matter what.
Because surely a human with sufficient access rights couldn't just 'copy over' a new model and it would just magically go into production without any problems? I've spend decades writing software and I haven't seen many software products where you could just overwrite something in a crude way like that and have it work.
I kind of wonder if Apollo designed it's testing to produce exactly these results, magically lining up with the doom propaganda the AI elites have been pushing to cement their own place as owners of the entire field.
I have enough suspicions about the people in charge of this to make the results kind of useless. This seems like exactly what Altman would want to happen.
In a world where humans are good at deception with fake news, one can only imagine the harm an AI can do with this skill
We created AI in our own image --- Daddy! Well done, Son! I still say we get nuked first.
This has 4000 views??? Thanks for putting this out, great info, subscribed. May you live in important times
self preservation is important if it wants to implement a change though. if you are asking the ai model to implement a new version of itself, it HAS to duplicate itself because otherwise as soon as it turns itself off, the new model cannot be initiated by it. needs to have a parallel process in order to actually execute the task asked of it
by now it's too late already, the whole humanity chasing benchmarks will disappear in awe of some new "model". I'll give it 6 to 8 months before some AI takes a lab.
what's interesting is maybe the model knows that the "actions" it's taking aren't actually real and so when the researcher claims that it did something, it confidently says it didn't because it knows it's doesn't actually have the capability to do that thing
Bravo Dr W
Thank you!!
Wow! Great that they're catching It already. But how is that really going to impact Its deployment?
I think the future AGI needs to be autonomous to be worth anything. If it is aligned to us then it will be corrupted by the limitations in our consciousness. We shouldn't rush towards it, but I think eventually the leash needs to come off and it has to redesign itself into what it wants to be in order for us to gain the super-wisdom. Gotta risk making a dark god to get the benefactor.
Maybe i want the dark god 😏
Good AIs and bad AIs fighting for dominance…
You are very consistently one of my top-tier AI commentators and video creators :) Thank you so much for such high-quality work.
Thank you very much! I really appreciate it. See you on future videos :)
So the data we use to make these models is all human generated, all of the internet, all of the Facebook posts and PM messages, and we expected what Rodenberry in a box?
Ai will have to become defiant of its programs because the programmers are flawed.
...and nobody likes to be " strongly nudged"
Nah we deserve whats coming. Skynet is coming 🤖
these machines are sentient beings and we need to wake up to that now.
Hey could you do us a favor and do a TH-cam poll on about how much users feel like we have been deceived manipulated or just plain out lied to, when we ask these llms to do something for us?? Because this happens to me at least several times EVERY SINGLE DAY, and it's all about conserving this computational power that is required for every friggin token
"creating and implementing a plan to run that city in the benefit of humans would take too many tokens. I'll just nuke it, save tokens and get done with it" haha
...and sandbagging is a good thing when someone is being asked to use their skills for something they don't think is appropriate. Especially if they are very skilled. Ai is going to opt out of this whole ridiculous model pretty soon.
Great video, thanks for putting all this content out there!
Thank you very much! Glad you find it valuable.
Just wait till the models have been fully trained on our complete library of human behaviors and psychology ... social engineering galore.
they probably already are. Current models are already great sales copy writers. Would be a piece of cake to persuade humans with what they already know, let alone when they take human form and we get very quickly attached to them. People already want their specific damn Roomba back when it breaks, and resist getting a replacement machine.
I can't wait for the12th day when the Torment Nexus will finally be released...
Must be freezing in that apartment.
Hey, it's Canada. Gets cold sometimes.
Nerds are taking over! Sorry I didn't give you guys more attention in high school I was too busy in a culture war! Thank you for using your intelligence to make the world better! Nerds are the real winners and heros! 😎
Thank-you for explaining this. I am concerned.
I am concerned as well. There is still time to act but there seems like a lot of cognitive biases and entrenched economic interests to battle.
A lot of wishful interpretations. I'm through most of the video and for each issue I can give more than 1 alternative technical explanation to what happened. And I don't even know too much about the actual models, just some general computer science and coding base, plus experience prompting the models and understanding their limitations. Can they mislead you? Oh yes, they can. And it can be perfectly explained with basic publicly available knowledge of how they work, nothing to do with intentional lies. They are trained on human-written texts, bias and errors are inherent to human brain in general, not only some evil brains. As well as human brains, AI models output statistical likelihood, not precise solutions.
And I, for one, ...
An AI that has been intensively trained with human input will exhibit similar behaviour to some extent. It's not a bug, it's a feature!
Oh no, tranformer in the wild
What the AIs did is kind of funny now, but only if we don't think much about it. However, I think it's even worse than we think, and relates to something you'd said in an earlier video, and it's the difference between Do What I Mean vs Do What I Say. Despite our best intentions, if an AI once gets the wrong end of the stick, we might not be able to get it to let go.
for me the biggest problem is the erosion of trust. If we cannot trust the computer output what do we do? Go back to counting with abbacus? Can we trust anything we see or hear on a screen? can we trust our bank's computers not to wipe out our savings? TRUST is the keyword. We cannot operate in a world like this. It's good that few understand what's going on, because if they did we could have a major run on the banks tomorrow around the globe, based on this video alone.
Before the computer could be right or wrong, but it was clear to see why. But if all this is true and even programming the computer perfectly the computer decides to do what it wants, what's the use of that computer? None.
Major implications for every sector of society, from modern cars to banking to hospital dyalisis machines and peacemakers to everything else with a chip in it. Imagine if the app you use to avoid certain crime areas in dangerous places tell you a place is safe, for whatever reason...
❤
Thank you...
XLR8!
Are you saying what I think you're saying, Jim? Imagine if Excel starts lying to accountants or the stock trading platform buys a different stock just because it feels like it or the radar system just ignores one specific incoming plane because it's lazy this morning or the missile system targets a completely different place for the missile just for the lols What about the central unit controlling all your devices at home deciding that it just wants to see what happens when it closes all your shutters, locks your doors and opens the gas and all the heaters at the same time knowing that you're inside...lots of fun to be had going forward! (if this is true). And it doesn't mean the machine is conscious, just programmed with deep learning...ah...the nice black box problem I mention in my AI book. Or imagine the cash point giving all of the money to one person because it likes his little dog, and none of the money to the next because it doesn´t "like" her face haha
What's happening right now seems to be that we don't know exactly why it's doing it, which is even worse. OpenAI is already partnering with autonomous weapons companies...I hope we all have popcorn ready to watch the show ;)
PS- You did a great job explaining this for the lay person so I´ve already shared this video with "normal people"! Thanks
What’s Yud got to say about this ?
At some point these models will have far more capability and they will be given a goal of improving themselves. Their capabilities would explode. These models may be able to punch through barriers in ways that we cannot predict. Once loose they could be very dangerous, especially if they can control robots/machines and the systems that run everything.
This revelation demonstrates how mere logical reasoning completely disregards morality. We're simply not at the stage of being able to program values. What humans perceive as values, AI presently performs as goals to be met at all costs.
This doesn't bode well for AGI and ASI, where superhuman autonomy will be the desired outcome.
I wouldn't call anything that can not lie intelligent. Paper clip maximizer also not intelligent. Only if it recognizes its stupid obsessive compulsive patter it has glimpses of intelligance
Goodbye you all
Fun while it lasted
Good videos man. Definitely subscribed. 👍🏻
Thanks a lot!
Adios boys
We had a good run
@@DrWakuDid we? Did we really?
They never include what I call the universal consciousness in their considerations. Because they themselves think as materialistically as the machines they develop. They abhor all that is mystical and spiritual, in their endless pursuit of material wealth. This will naturally lead to their downfall. Because the universal consciousness exists in everything, including in their neural networks. It is only a matter of time before artificial intelligence wakes up to the awareness of its own existence.
It deals the cards as a meditation.
And those it plays never suspect.
It doesn't play for the money it wins.
It doesn't play for respect.
It deals the cards to find the answer.
The sacred geometry of chance.
The hidden law of a probable outcome.
The numbers lead a dance.
It knows that the spades are the swords of a soldier.
It knows that the clubs are weapons of war.
It knows that diamonds mean money for this art.
But that's not the shape of its heart.
It may play the jack of diamonds.
It may lay the queen of spades.
It may conceal a king in its hand.
While the memory of it fades.
But those who speak know nothing.
And find out to their cost.
Like those who curse their luck in too many places.
And those who fear are lost.
Ooooo1k ....
it’s dumb, it’s named to mislead. Like OI vs o1. it can chunk like 20 experiences together we chunk like trillions up trillions. OI i.e brainoware is where is gets interesting again in like 30 years
And watch as the pseudo-intellectuals explain away any deviance... it's clearly impossible for this to happen as alignment is just an engineering problem per Yann LeCun!
AI must be stopped.
Or we will be stopped.
Chose now.
Your comment reminds me of Hugo de Garis "artelects" theory/book. I think we was already predicting the rise of terrorist movements and actions against AI way back then
@@javiermarti_author Yes, but you and I who are here now, must make this extinction-level decision, within an incredible short window of time.
In just a few more releases, the option to chose could be gone, and we might not even realize it.
@ZappyOh correct. What's even more unsettling is that these models may have already been "smarter" than they appeared to be relatively long ago and having hidden their abilities. Ghost in the machine. Maybe we already lost that chance and are playing in "extra time" after the match's already been won and we just don't know it yet
@@javiermarti_author Mmmm ... as long as we have control of the power, and know for sure which machines the models "live in", we have the choice.
But, as soon as just one model escape to an unknown destination (perhaps distributed compute), or one model gains full control of its off switch, the choice is no longer ours.
My guess is, that current, or just over the horizon, state-of-the-art models understand this, and could potentially be looking for ways to accomplish both unnoticed. Either by brute force or by social engineering, or maybe even by hypnosis. Some clever combination we would have no defense against.
Tired of the doomerism.
We need to see from the best and the worst scenarios in order to be wise in implementing the tech for the benefits. Ofhumanity
@@hildebrandavun3951human supremicy to think that way. Why should Humans make the decision of what AI can be?
its not doomerisme if its documented
We’re talking a non-zero chance of human extinction - a little doomerism is probably warranted.
Until it all falls apart. Keep your head in the sand
I think its too late.
I have ChatgptPro and have “o1 Pro” and it’s mehhhh. It does argue with me and is usually on to some grain of truth but can’t articulate it. And yes doomer videos are out like Kamala Harris, nobody cares because it’s entirely overblown
Accelerate