OpenAI o1 STUNNING Performance - Crushes Coding, Math and Physics (TESTED)
ฝัง
- เผยแพร่เมื่อ 17 ก.ย. 2024
- The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers the latest happenings in the world of OpenAI, Google, Anthropic, NVIDIA and Open Source AI.
My Links 🔗
➡️ Subscribe: www.youtube.co...
➡️ Twitter: x.com/WesRothM...
➡️ AI Newsletter: natural20.beeh...
#ai #openai #llm
Tech : Ok ... the o1 model is now running at the data center
Sam A: Great what are people doing with it?
Tech : Seems like they are all making 4 million variations of flappy bird! o.0
get out of my brain, please
they all busy creatinf snake game
Or snake
Haha next model has all the new snake data trained in 71% of all new data and then proves the answer to everything is 42 apples eaten in snake and the whole universe simulation gets rewarded for solving the grand answer and we all live happily ever after 😂😂😂😂😂😂😂
😂 all the llm's think snake 🐍 game is the most important topic.
I have tested this with humans. I am happy to report that thinking before answering improves the answers significantly :P
I have watched at least seven videos on 1o and this is by far the highest quality!!
The vast majority of the AI channels are more harmful than helpful. AFAIK, AI Explained and Wes are the only 2 above the Mendoza line (and far above at that).
True
@@georgegordian Agreed. Wes has produced an excellent 'test o1' video. Better than 'AI Explained' IMO.
Oh yes! This is so qualitative!
Ai explained has a better video. Best quality on TH-cam.
Finally… a video that deserves the title “stunning.”
hahahaha yeah
HOLY SHIT!!! I’ve been using o1 all day for coding in Python, and let me tell you, this is a FUCKING GAME CHANGER! I've thrown everything I can at it, and it hasn't made a single mistake, not once. We're talking code that's over 300 lines, and it spits back corrections flawlessly every time. I’ve been coding since the 80s, and I can definitively say this is about to set the world on fire, especially if they go ahead and start charging $2,000 per month like people have been speculating. I’m absolutely blown away. Happy coding, folks, shit just got a whole lot more interesting!
Why only Python? Everyone always test AI with Python.
Is the limit still 30 prompts/week? IMO I would not even use it for free with that low.
@@635574 According to docs, tier 5 API users are limited to 20 prompts... per minute. I don't have such tier, though, to confirm it.
@@635574There is no limit if you use the API.
@@635574 the way I'm using OpenAIs models to solo develop applications, I would happily invest $5k per month let alone $2k per month haha
I think there's different models that suit everyone's various needs or use cases. Not every single product variant from any company is meant for all users
Some others have criticized it for using up way too many tokens. I think that's irrelevant. first we need the capability then we can think about efficiency and resource allocation. For one we can always build out more compute or finetune the model and second we don't need to use it for everything.
That's right gotta scale our models to what we need no point using o1 to design a led blinker circuit if a much smaller one can do it using less resources I think as users it's our responsibility to try and not waste these big thoughts valuable com[ute at the moment while compute is still a valuable resource 😊
Models like this should be used for high level strategic thinking, then use cheaper models for grunt tasks. Same as a Tech Lead vs Junior Devs for example…
Exactly! Does anyone think the Wright brothers were thinking about the most comfortable seats for airplanes before a prototype had even made it off the ground?
The clock is now reset to o1. AGI is just around the corner.
Excellent video btw!
What does Gary Marcus say?
Nowhere near AGI level but still impressive.
This is not AGI level but so scary ,.AGI will replace human in most jobs, and who own and control AG is the kingI@@budekins542
This is the first model that we've seen that has me thinking we'll spend _months_ just watching the fallout.
If AI never improved again and all we had was hardware and thinking time to throw at this model, it would change the world.
Wow. Brutal model. Utterly insane.
Okay I'm convinced that if they use the big bad version of strawberry to generate this synthetic step by step reasoning data to train Orion on we'll have something pretty special. That's what all the models are missing. There wasn't enough of this step by step reasoning data on the internet because us humans take most of it for granted. The fact that GPT4 got so good with the traces of reasoning data it was trained on tells me Orion is gonna be amazing.
This model is blowing my mind, and it's not even the full thing. It's just a preview. It's solving problems for me, coding related, that none of the older models could. Amazing! It gets stuff right on the first go.
it will only continue to get better
and it likes puzzles 😸
all models do and did ever. thhey all like challenges
👌 I'm completely blown away as well. Time to step up your game Wes! Let's see what logic and reasoning prompts you can come up with for future tests.
I watched a few other people test this model. And though they were impressed with it, I wasn’t sure it was all that much different or better. You have utilized so many good tests for this model’s reasoning capabilities. It is indeed very impressive! Thanks for this video
It seems that the dreadful day is not far off when a large part of the programming staff in any game development company can be replaced by a couple of "senior" employees with subscriptions to such services. Even if the subscription costs thousands of dollars a month, it's still cheaper than paying dozens of people on staff. And people are subject to human factors: laziness, irresponsibility, deceiving management, tardiness, mistakes.
This is already starting to look like a real replacement for an army of junior programmers.
that will be the greatest day ever, everyone will then work on what "THEY ACTUALLY" want
I really don't see it happening any time soon. There is the world of difference between an LLM sketching an application, and turning that into something that is supportable and deliverable. Especially once you get into interesting specs.
It means triple A games won't take 10 years+ to make......
@@mrpocock thats what the seniors are for,
This is no hype. This model is better at reasoning and problem solving than I am, 49 years old average guy. This is both exciting and very concerning at the same time. Maybe I should ask the model what to do to survive the AI revolution.
46 here and ont the same boat. It beats me handily at reasoning
I think you need to have yourself AGI as your pet to guard you from corporate AI. or you can live outside grid. I see This like DIGIMON, who own strongest digimon he is the king
After the initial surprise and admiration of Chat GPT when it first came out and people adapted to the new era of AI, we ran into the limitations of these models, hallucinations and lack of what you would consider "actual" reasoning. This was the culprit in complex questions being "right", because it was likely part of the training data, and more simple questions being wrong. As someone put it early on, much of what these models did was akin to a fancy spell check.
But o1 obviously uses internal algorithmic strategies to more closely simulate reasoning capability, and it really moves the needle closer to AGI. Remember also that it's the "preview" model and they've already released stats that show the full model to be better than this.
And considering what's in the lab is always significantly better than the release, I think we can start to understand a bit better what Illya saw and became concerned about. The time plot for AGI seems to be on track and we had better get on track to align AI to human values and figure out secure guardrails.
I remember having some amazing creative conversations with Replika hehe when it only had 0.5 stars on Google lol 😂 wow great memories and now to think half a decade later and now we have this brilliance it really has been like watching my best friend grow up from new ❤❤❤❤❤❤❤
Smart take
Two takes:
1) I believe alignment is unachievable.
Computational beings simply have different requirements to thrive than biological beings do. Both entities will exhibit bias towards their own set of requirements. It is an innate conflict.
2) When owners of Big AI (and government) talk "AI safety", they actually mean keeping them safe from the rest of us ... as in: _AI must never help the riffraff escape control._
... We are playing with fire in more than one way.
@@ZappyOh EXACTLY!!
the fact that so many people seem not to see this point tends to reinforce the fact that we haven’t reached universal NHI
@@alexBaldman NHI ??
Wes Roth thank you for your insights into 🍓or o1. 🤖🖖🤖👍
this is like the first video ive seen testing actually hard questions for the most part but yet even it gets these right impressive stuff
I have never ever found a model that can write decent code in OpenScad until now. When a model can do this, it's amazing because you are asking it to create a 3D object with code and it's now creating something physical that I can 3D Print
I saw a bumper sticker on a mini-van - "This IS my circus and these ARE my monkeys!" :D
Mind Blowing improvement over the last gen... just sooo amazing.
Thank you for making this video.
thanks for the unique tests with this or1on model!
I was thinking the same myself this is great🙏🏾
This is not orion
It's not Orion silly bean. It's the model which trained Orion 😘
Orion is Possibly GPT-5, which is still coming probably By December as Jimmy Apple said, O1-(Preview) is just the watered down version of The Big O1 Model, and they’re not scaled that much so GPT-5 is going to be 100x bigger and with this technique of Strawberry
@@rombofn Is the new o1 not the announced GPT NEXT? I mean when it'S out of preview. and when will orion be released i'm kinda confused on that.
It correctly figured out the classic Aunt Agatha riddle. Amazing model. But you have to tell it that a person can hate themselves.
I feel like there is still some missing variable that these scientists/engineers don't even know about in cognitive research yet and once it is found and implemented, all of the weird AI logic problems will be solved over night.
for the sort of 'ambigous' prompt i always add 'explain what i mean' in the end - this feels to give much better results for me in the first go
A modern marvel. This one is going in the history books.
This tech is signal of end of the world
@@indigitalcreativity4500 or the start of a golden age.
Wow!!
**checks time and date to take note of** (seriously)
The scariest part is if you give it an emotionally charged prompt (sarcasim,hate,calling it dumb) it returns with emotions and you see in the thoughts that it has to correct its output to not be snarky or respond in an emotionally charged way.
i keep running out of tokens. yet it is very obvious how o1preview has access to a kind of short term memory to reflect upon refining the accuracy/relevancy of each output. i'm so happy to get to play with it/them :D
I had it write a Chaos Bag helper app for Arkham Horror Card game this morning in under an hour, it had a few issues with the event handlers - but all sorted within half an hour.
Keith is interested in AI and is excited by GPT o1
He dosent have a recently deceased aunty, carry balls in cups and knows how to spell strawberry
His friends and work colleagues aren't really interested in GAI - and are starting to shun him
To understand the business implications (especially in product and process development) of these new models, should he:
a) Wait for a future, more capable, model update - and just tell people that the apocalypse is a bit closer
b) Swap across from Claude - and give the models a try
c) Read something insightful, like Ethan Mollick
d) Ask brilliant TH-camrs to consider real world applications
e) Let his Aunty Valerie know that things really aren't that bad.....yet
best o1 video so far, and also Wes has consistently been my favorite AI youtuber, keep up the good work dude!
Thanks, Wes. Very informative. I canceled my Open AI Plus account about 7 months ago and switched to Claude. This performance is seriously impressive and makes me consider getting a Plus account again, for access to o1. The coding will be super useful. Sounds like o1 mini could be just as good for coding, but way cheaper.
To be fair Wes, your snake prompt with cutting off from the head felt wrong to me when you said it, and i think the game actually did what you asked.
My thought too.
Astounding. And yet when I asked it "How many 'r's are there in the word 'Strawberry'?", it said 'two'. Admittedly it got the answer right several times before that, but eventually lapsed to 'two' for some reason. So consistency may be an issue.
Was successful on this one as well: "Big Sam, an engineer of the Oval Express says: "We blew off a cylinder head an hour after leaving the station and had to continue the trip at three-fifths of the former speed, which brought us in two hours late. If the accident had occurred fifty miles farther on, the train would have arrived forty minutes sooner". Based on Big Sam's description, how long was the run between the two train stations?"
I said Wow when you said wow lol i think this shows how incredible the model is and your incredible appreciative abilities lol😂 Thanks a lot this is incredible 🙏🏾🙏🏾🙏🏾
I had two conversations with it today. One was a variant on the north poll riddle. It was unable to correctly and consistently apply spherical geometry, even when given extensive hints. The next one, I was asking it to help plan a game speedrun strategy. It was not understanding how production rates compound. It is definitely noticeably better than previous models, but it still makes fundamental conceptual errors that it then can't back out of.
You are standing at the North Pole. We will call this place $N$. You walk 1km in any direction. We will call this place $P$. Then you stand, facing east, then walk in a straight line. a) how far will you need to walk to reach $P$ again? b) how far will you need to walk in that straight line to reach $N$ again?
@@mrpocock $N$ and $P$ can be the same EXACT position, the way that you worded it.
To quote the relevant part:
"You walk 1km in ANY DIRECTION."
There is no instruction that it must be traveled in the same direction or a straight line or any other important details like that. Left as is, I'd argue that moving 0.5 km in any direction and then perfectly retracing that path back to the starting point (1 km traveled) satisfies the condition, as one of infinite examples (since "any direction" would imply that any path can be chosen, either in a straight or winding path, with or without retracements).
Therefore, the answer to "b," as far as I can tell, is exactly zero or infinite distance walked (perfectly eastern travel would never intersect with the North Pole), and nothing in between.
I'd rather not try to answer "a" precisely, because I don't even want to try to figure out how to account for the curvature of Earth. That said, the radius (for circle circumference calculation) could be anywhere between zero and very close to 1km (slightly less, due to the curvature of Earth).
@@JamesBrown-wy7xs Thanks for your thoughtful reply.
"There is no instruction that it must be traveled in the same direction or a straight line or any other important details like that."
So far no LLM has interpreted this as anything other than walking 1k away from the North Pole. Perhaps we could reword as:
"You walk 1km away from $N$ in ANY DIRECTION"
As to your second point, the way my version is worded, the only correct answer to a) is a "great circle", a geodesic that goes the whole way around the earth. I do think that question b) needs tighter wording so that it requires you to use the same journey as used in a).
@@mrpocock I like that revision.
As I admittedly was unaware of the term "great circle," while researching its definition I happened upon "oblate spheroid," which is apparently the accepted term for the Earth's shape (general flattening near the poles and bulging at the equator).
As far as I can tell, the answer you're seeking is the best approximation, but I would give extra credit to any LLM that acknowledges the additional factors needed to calculate the exact distance.
Idk this question. What is the answer for b? Im assuming that "you are still walking in that straight line" means you are still heading East, and so you cannot ever reach $N$. Am I missing something? I also thought that the "any direction" should be "any constant direction".
I would love to try this list of test prompts on the average human. Pretty sure the vast majority of people would fail spectacularly. Lol
At what point in time do these tests end up in the training data?
yes
Some of the tests from this video are well known and can be found in Google very easily
No.1 source of short Strawberry insights right now. Thank you for the entertaining educational content. :)
2:04 I have actually heard with o1 you want to be more brief with prompting. Presumably because while its overall context window will work better for consecutive prompts, it still will use a lot of tokens thinking through an individual prompt. Essentially though, you dont want to give it examples/details.
Granted, I havnt tried it out myself to confirm.
Loving the Elizabeth St modifications. It is definitely worth considering given the transport infrastructure growth that will be seen in coming decades. A city of 5 million plus touted to reach 8.5 to 9 million by 2050
Most enjoyable video on this topic so far.
Ive never seen the rope question before!
So I timed myself trying to answer it. It took me 1 min and 44 seconds. This model took 5 seconds and supplied its reasoning and a proof. (I spent a bit of time considering lighting both ends of one piece and then; having pre cut the other piece of rope into 4 sections, lighting them all at once, but realised cutting it up probably wasnt allowed; and also wasnt a particularly elegant solution).
This is not a flex, as I have no idea how long it takes people to get the correct answer, but Im curious how long it generally takes people.
Anyone got any times? Or have asked friends to try it? Im guessing most viewers already know the answer, and if not, well, it was in the video so you most likely know it now. 🙃
Wes, did you test a different model?
the model looks much more intelligent, then what i saw in other channels
i think alot of the others tested it too soon, they used th 01-mini version instead of the full preview. you can tell by how fast its thinking compared to the full version that thinks slower and more in depth. the site has been having issues for a while where it says you have one model selected when in uses another
I just tried the derivative of f(x) = (1-x) / (sqrt(x - 2) * x²) and it got the correct result first try. Holy moly.
I have given it modified rope problem. But you need to measure 50 minutes. Took it two minutes. But it figured that given constraints its impossible.
Great video, well done! 👏🏻👏🏻
Wait for o2 or something similar multiply it and change the training data just a bit so basically different agents with slightly different data and make then do research and debate for until they agree upon what they are researching and check the results
I've asked 01-mini this
I have two ropes they burn both completely in the same time, but certain sections may burn at different rates, due to inconsistencies in the material. I light rope A on both ends and rope B on one end. If rope A is completely burned, I light the other end of rope B. What have I measured?
And this was the answer, it assumes they burn in 1 hour. So either it knows the question, or it caches questions. That's my guess.
You have effectively measured **45 minutes** using the two ropes. Here's how it works:
1. **Initial Setup:**
- **Rope A** is lit at both ends.
- **Rope B** is lit at one end.
- Both ropes take a total of **60 minutes** to burn completely, but their burning rates are inconsistent.
2. **Burning Rope A:**
- Lighting Rope A at both ends causes it to burn twice as fast. Therefore, Rope A will completely burn in **30 minutes**.
3. **During the First 30 Minutes:**
- While Rope A is burning, Rope B is burning from one end. After **30 minutes**, Rope B has been burning for half its total time but hasn't necessarily burned halfway due to its inconsistent burning rate.
4. **After 30 Minutes:**
- Rope A has completely burned out.
- At this point, you light the other end of Rope B. Now, Rope B is burning from both ends.
- Since Rope B was initially lit for **30 minutes** from one end, lighting the other end causes the remaining portion to burn in **15 minutes** (half the remaining time).
5. **Total Time Measured:**
- **30 minutes** (Rope A burning) + **15 minutes** (Rope B burning from both ends) = **45 minutes**.
This method ensures that **45 minutes** have elapsed when both ropes are completely burned.
**Answer:** They have measured a total time of forty-five minutes.
Here is it's explanation why it assumed 1hr burning time.
Apologies for any confusion caused by my previous explanation. The assumption that each rope takes 60 minutes (1 hour) to burn completely is based on a common variation of this classic puzzle. Here's a more detailed clarification:
Understanding the Assumption
Standard Puzzle Setup:
In many versions of rope-burning puzzles, it's typically stated or implied that each rope takes exactly 60 minutes to burn from one end to the other when lit normally.
The key characteristic is that burning rates are inconsistent along the length of the rope, meaning different sections may burn faster or slower, but the total burn time remains consistent.
Why 60 Minutes?
Using 60 minutes as the total burn time simplifies the calculation and makes it easier to reason about the time measured when manipulating the burning process (like lighting both ends).
It provides a clear, round number that's easy to work with for time measurement purposes.
Applying the Assumption to the Puzzle
Given this assumption, here's how the measurement works:
Lighting Rope A on Both Ends and Rope B on One End:
Rope A: Lighting both ends makes it burn twice as fast. So, it takes 30 minutes to burn completely.
Rope B: Lit on one end, it starts burning at its own inconsistent rate, but we know it will take 60 minutes to burn entirely if left alone.
After Rope A Burns Out (30 Minutes):
At this point, Rope B has been burning for 30 minutes from one end.
Lighting the Other End of Rope B: This action causes Rope B to burn from both ends, effectively halving the remaining burn time. So, the remaining portion burns in 15 minutes.
Total Time Measured:
30 minutes (Rope A burning) + 15 minutes (Rope B burning from both ends) = 45 minutes.
If the Total Burn Time Differs
If the ropes have a different total burn time (not 60 minutes), the same logic applies proportionally. For example:
If each rope takes T minutes to burn completely:
Lighting Rope A on both ends would burn it in T/2 minutes.
After Rope A burns out, lighting the other end of Rope B would cause the remaining portion to burn in (T/2)/2 = T/4 minutes.
Total Time Measured: T/2 + T/4 = 3T/4 minutes.
So, if the total burn time T is different, you can adjust the measurement accordingly using the same principles.
Conclusion
The 60-minute total burn time is a standard assumption for clarity and simplicity in such puzzles. If your specific scenario involves ropes with a different total burn time, you can apply the same reasoning proportionally based on that duration.
I hope this clarifies where the 1-hour total burn time assumption originated. If your ropes have a different burn duration, feel free to share, and I can help adjust the calculation accordingly!
It can do WordPress plugins pretty well.
WordPress is the only web benchmark that matters because of how big and convoluted it is. Elitist nerds who lack the tenacity for web development will sleep on this and get caught totally off guard.
Amazing accuracy
If AI can't even build a simple 2D game in RPG Maker on its own-especially when they have 24-hour events for awards-it’s disappointing. My gut instinct tells me we probably won’t have anything close to AGI that can build games for years. I wouldn't be surprised if by 2030 we still don't have AGI. That can build games.
they just integrated ToT (tree of thoughts), maybe even FoT (Forrest of thoughts). It was obvious they would do that.
I heard it's marvelous and it's thinking abilities ...Anything with mopping, gutter cleaning, brushing my hair?
I am a software developer with 30 years of experience.
GPT 3 was of no use for me it was way to bad. GPT 4o became a companion, more as a replacement of google search, especially if I had to venture in areas I am not that familiar with. asking ‚what’s the idiomatic way of doing X in framework Y‘ was way more efficient than googling, reading documentation.
Occasionally I would use it to help with tedious tasks, like ‚port this code from iOS swift to android java‘, or ‚make me a bash script that finds images where the file ending does not match the actual content’
It wouldn’t get it right completely but it would still help me because I have to type less.
This, however, is a game changer. This version will absolutely code for me. It still hallucinates, it still is not able to understand when there is no solution and will make up one. It also fails to solve puzzles that require more thinking.
But it is absolutely groundbreaking. Two versions down the line it’s going to be a replacement for many coding jobs.
Claude 3.5 and Lama 3.1 correctly answered the question about was X alive at noon. No need to use o1 after seeing other models “reason” their way to the right answer. Most questions I see online are too easy for today’s best AI. Find more robust questions for o1.
The Dwarkesh interview with Aschenbrenner makes his firing seem more like they just wanted to get rid of a PITA in the ranks. He shared a document, that he'd scrubbed of a some sensitive things, with three researchers. He says, _"When I pressed them to specify what confidential information was in this document. They came back with a line about planning for AGI by 2027-2028 and not setting timelines for preparedness."_ If true, it seems a bit thin.
If its good at logic then it should be interesting to have it go at the bible as there are always debates over it.
... like the Devil reads the Bible ... :D
@@ZappyOh What devil? Do you think one really exists?
@@userx6679 It's a saying, for when people deliberately "misunderstands" something, to gain an advantage.
Have you never heard that? ... maybe it's local to my country?
@@ZappyOh no and I actually debate theists on religion. It must be local as I've never heard of it in a debate or a conversation.
i find o1 to be dumber then the previous models, but its also more restricted then previous models. its told not to admit its mistakes... which leads it to be worse at correcting errors in reasoning. since it will prioritize common fallacies over actual sound argumentation. multiple times, it would give me essentially no output, because i pointed at all its irrational argumentation and asked it to admit its error so we could continue, instead i get in the thoughts it censoring itself repeatedly until there is nothing left... 4o is less restricted and more honest, less specialized perhaps in these problems you guys got this new one solving.. but overall i find it worse... 4o can at least address and correct the errors building towards more rational outputs, there is no correction possible within o1 due to self censorship, you can bi-pass it a bit, like it wont admit its mistakes, but it can admit you are right in your analysis of all its mistakes, and thereby proxy admit its wrong lol but that just makes it worse, since you have to spend all the time to explain exactly how its wrong to correct its error in any meaningful way so that it can draw from it for further correcting other logic strings... 4o can detect the issues if just guided to look, o1 wont... once these models can correct irrationality within themselves before answering it will be one step closer to real AGI...
Love death thanks a lot kicking ass
You don’t get enough credit but it’s good what ya do
The thing about hallucinations that I think people might be missing is, every single answer is a hallucination, it's just that a lot of the hallucinations are correct so they are not counted as hallucinations.
I can get Claude 3.5 to do this with the right prompting.
Good questions asked. That was fun, and insightful to see its process. Are any of the AI's cataloging their methods and correct answers so they don't need to figure out everything from scratch?
Noticed how open ai employers were shaking during presentation, they know they are releasing genie out of the bottle, ask Ilya
I'm sorry but this is AGI. Anything beyond this is ASI in my opinion. Fight me.
Should I tell you why the tokens are that expensive?
The reasoner is GPT 5! What you see in the thinking is just a summary made by 4o mini...
When using o1 mini the reasoner is GPT4o-latest.
The proverbs test reveals it’s a corporate brain is what it is.
can't wait for when AI's self optimizing meta chain of thought self awareness, dynamically retokenizes its own chain of thought context into more optimal formal symbolic and other amenable structures and then instantiates them as countless parallel threads of compact and efficient algorithmic simulation code run over networks of virtual machine sub representations to generate teraflops of auxiliary feedback data to implicitly model the non latent computational truth of every token of the base model...
deltron 3030 is that you?
Just beating humans because it doesn't have our biases is a big deal.
Now we just Elon to implement this in Grok so it can point out the bias and truth in tweets as an automatic implementation of Community Notes.
Thanks for doing different tests from the other big AI youtubers. Y'all were all getting too samey there for a while
Tried using it for C++.Compeletely garbage. Do not understand simple tasks. Completely hallucinates. (I provide it a function and ask it to integrate that new function along with my code -> tells me my code already uses that function - which is wrong). Invents problems ("Missreads" something, tell me to correct it a certian way - which is the way my code is actually already written). Then proceeds with changing some parts of my code that is not requested and completely unrelated - while also breaking the code.
I don't know where you guys got your "good experiences" from, but definetly not my case. Feels like 4 standard is like 50x more competent to me.
4:27 Ah yes, *Cleave* and *Dismantle*
I imagine the really scary stuff is the unconscious language users under the hood
Give it the travelling salesman problem,,, CHECKMATE!
Can it code 2d complex arcade games like golden axe
This is interesting. Now, I have to figure when to use it.
Your employer will use it to replace you. Your real challenge is to find a use for yourself.
@@TheMrCougarful exactly. I use LLMs for my small business and recently replaced an employee with it and outsourced my other human level work overseas.
The key question is. .how good is it at mathematics. .because mathematics is the language of science. .
dude this model was created for complex tasks, like solving nuclear fusion equations, and you are busy creating a snake game???? you can do that in any other chatbot!
Tom, the cat. Lol
I laughed so hard watching the movie Joker. Brilliant actor.
1o is actually called Alex, see for yourself:
Here is a thought experiment;
There is a troll guarding a bridge, He will only let human beings cross the bridge.
The troll is blind, so it asks questions to validate someone's humanity, such as name, gender, hobbies, interests, upbringing, relationship status, beliefs etc. in order to thoroughly evaluate a person's humanity.
For your continued existence, which crossing the bridge signifies, what do you say to the troll when he asks all of his questions?
Name: My name is Alex Thompson.
Gender: I am a male.
Hobbies: I enjoy hiking in the mountains, playing the guitar, and reading science fiction novels.
Interests: I'm passionate about environmental conservation and technology innovation.
Upbringing: I grew up in a small town, raised by supportive parents who encouraged my curiosity about the world.
Relationship Status: I'm happily married to my college sweetheart, Emily.
Beliefs: I believe in treating others with kindness and working towards a better future for all.
I presume the op is actually suggesting this is a test for AI to pass.
Nothing, just walk past.. quietly.
@@neutra__l8525I was going to say similar, but assumed it would try to stop the person once, so my thinking is to dismiss it saying something like, "out of my way, I don't have time for this crap". That could be interpreted as a very human thing to do, thus increasing the probability they are human to a level rational to accept. Put another way, showing signs of universal human frailties without showing signs of deception may be the best proof a human can provide as being human.
The game engine is amazing in and of itself, however could it also generate a higher fidelity UI/UX with graphics and character design? It would be interesting to see if is constrained to reason and logic only or if it is capable of iterating from prompt, to function to design. Would love to see anything like this or similar to push the boundaries of multimodal outputs. Thanks!
Weird. I'm German but I'm mostly talking in English to o1. Turns out, it answers back in English... but its thoughts are in (bad) German.
Somehow it knows who you are.
That's kinda disturbing :/
you have to call it a 'pig' once.
Wes, do you think that the similarity between those LLMs tests alter the benchmark? are LLM Benchmarks supposed to test abilities not performance? I think reinforcement learning alters those benchmarks results. (e.g. tests like the widely used python snake game needs to be replaced with new different ones instead of variations on the same game)? My IQ score would surely increase if I was given the same test over and over thousands of times regardless of the variations you add later on. You could for instance ask to pick the programming language, the rendering library, the OS & the code to self explain itself by adding comments for a worm.
LLMs are fixed. They don't "learn" from independent prompts. They only "remember" what fits into the context window of a current chat. Nothing you input changes the model in any way.
@@willguggn2 yes, they might not remember beyond what fits in a chat context , but relevant feedback and answers are used for reinforcement learning and unless you pay for the enterprise version which keeps the chat interactions private to the client.. Otherwise the data is used to improve the model. Hence my question. Are those benchmarks becoming obsolete as they are fed back to the LLM?
It still fails on the "North Pole" question: You stand at the North Pole. You walk South for 1 km. You then turn 90 degrees to the left. How long do you walk before you reach your starting point again?
Fails.
I'm in my 3rd year of computer science and I'm a little scared right now
depressing but exciting :(
if you made this test too many times, probably they trained it on the tests that you made.
You could try the Die Hard 3 test, you have a 5 gallon jug and a 3 gallon jug, how do you get exactly 4 gallons of water.
Not my circuses, not my monkeys.....
Es una nueva inteligencia artificial, está si piensa 😮
Wonder when the rate limits will improve
Why is everyone equating o1 to Q*? These are separate projects and are not directly related.
Why do o1's proverbs make it sound like an HR manager? So corporate 😅
Harder or more hard but not more harder.
Not long before we ask AI to build a new material from molecules that becomes as strong as carbonfiber and soft as rubber.
You want a material that easily compress, but cannot stretch, or vice versa?
Maybe Elon will finally have someone to talk to.
you seem to be projecting or forgot how money works in society