OpenAI o1 STUNNING Performance - Crushes Coding, Math and Physics (TESTED)

แชร์
ฝัง
  • เผยแพร่เมื่อ 17 ก.ย. 2024
  • The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers the latest happenings in the world of OpenAI, Google, Anthropic, NVIDIA and Open Source AI.
    My Links 🔗
    ➡️ Subscribe: www.youtube.co...
    ➡️ Twitter: x.com/WesRothM...
    ➡️ AI Newsletter: natural20.beeh...
    #ai #openai #llm

ความคิดเห็น • 287

  • @dmreturns6485
    @dmreturns6485 4 วันที่ผ่านมา +103

    Tech : Ok ... the o1 model is now running at the data center
    Sam A: Great what are people doing with it?
    Tech : Seems like they are all making 4 million variations of flappy bird! o.0

    • @COW879
      @COW879 4 วันที่ผ่านมา

      get out of my brain, please

    • @hqcart1
      @hqcart1 4 วันที่ผ่านมา +4

      they all busy creatinf snake game

    • @64kernel
      @64kernel 4 วันที่ผ่านมา

      Or snake

    • @SHAINON117
      @SHAINON117 4 วันที่ผ่านมา +9

      Haha next model has all the new snake data trained in 71% of all new data and then proves the answer to everything is 42 apples eaten in snake and the whole universe simulation gets rewarded for solving the grand answer and we all live happily ever after 😂😂😂😂😂😂😂

    • @centuraxaum5951
      @centuraxaum5951 4 วันที่ผ่านมา +7

      😂 all the llm's think snake 🐍 game is the most important topic.

  • @Lukas-ye4wz
    @Lukas-ye4wz 4 วันที่ผ่านมา +24

    I have tested this with humans. I am happy to report that thinking before answering improves the answers significantly :P

  • @muddlefly
    @muddlefly 4 วันที่ผ่านมา +66

    I have watched at least seven videos on 1o and this is by far the highest quality!!

    • @georgegordian
      @georgegordian 4 วันที่ผ่านมา +4

      The vast majority of the AI channels are more harmful than helpful. AFAIK, AI Explained and Wes are the only 2 above the Mendoza line (and far above at that).

    • @kelvinmunyimbili6078
      @kelvinmunyimbili6078 4 วันที่ผ่านมา

      True

    • @be2eo502
      @be2eo502 4 วันที่ผ่านมา

      @@georgegordian Agreed. Wes has produced an excellent 'test o1' video. Better than 'AI Explained' IMO.

    • @Music_vibes-kw7xr
      @Music_vibes-kw7xr 3 วันที่ผ่านมา

      Oh yes! This is so qualitative!

    • @javakendrick321
      @javakendrick321 3 วันที่ผ่านมา

      Ai explained has a better video. Best quality on TH-cam.

  • @daveinpublic
    @daveinpublic 4 วันที่ผ่านมา +39

    Finally… a video that deserves the title “stunning.”

    • @ChrisWall
      @ChrisWall 4 วันที่ผ่านมา

      hahahaha yeah

  • @courtneyb6154
    @courtneyb6154 4 วันที่ผ่านมา +74

    HOLY SHIT!!! I’ve been using o1 all day for coding in Python, and let me tell you, this is a FUCKING GAME CHANGER! I've thrown everything I can at it, and it hasn't made a single mistake, not once. We're talking code that's over 300 lines, and it spits back corrections flawlessly every time. I’ve been coding since the 80s, and I can definitively say this is about to set the world on fire, especially if they go ahead and start charging $2,000 per month like people have been speculating. I’m absolutely blown away. Happy coding, folks, shit just got a whole lot more interesting!

    • @id104335409
      @id104335409 4 วันที่ผ่านมา +5

      Why only Python? Everyone always test AI with Python.

    • @635574
      @635574 4 วันที่ผ่านมา +4

      Is the limit still 30 prompts/week? IMO I would not even use it for free with that low.

    • @tomaszzielinski4521
      @tomaszzielinski4521 4 วันที่ผ่านมา

      @@635574 According to docs, tier 5 API users are limited to 20 prompts... per minute. I don't have such tier, though, to confirm it.

    • @VraserX
      @VraserX 4 วันที่ผ่านมา

      @@635574There is no limit if you use the API.

    • @VisionaryPathway
      @VisionaryPathway 4 วันที่ผ่านมา

      @@635574 the way I'm using OpenAIs models to solo develop applications, I would happily invest $5k per month let alone $2k per month haha
      I think there's different models that suit everyone's various needs or use cases. Not every single product variant from any company is meant for all users

  • @johannesdolch
    @johannesdolch 4 วันที่ผ่านมา +63

    Some others have criticized it for using up way too many tokens. I think that's irrelevant. first we need the capability then we can think about efficiency and resource allocation. For one we can always build out more compute or finetune the model and second we don't need to use it for everything.

    • @SHAINON117
      @SHAINON117 4 วันที่ผ่านมา +3

      That's right gotta scale our models to what we need no point using o1 to design a led blinker circuit if a much smaller one can do it using less resources I think as users it's our responsibility to try and not waste these big thoughts valuable com[ute at the moment while compute is still a valuable resource 😊

    • @kristianlavigne8270
      @kristianlavigne8270 4 วันที่ผ่านมา +5

      Models like this should be used for high level strategic thinking, then use cheaper models for grunt tasks. Same as a Tech Lead vs Junior Devs for example…

    • @danielbrown001
      @danielbrown001 3 วันที่ผ่านมา

      Exactly! Does anyone think the Wright brothers were thinking about the most comfortable seats for airplanes before a prototype had even made it off the ground?

  • @RickOShay
    @RickOShay 4 วันที่ผ่านมา +23

    The clock is now reset to o1. AGI is just around the corner.
    Excellent video btw!

    • @minimal3734
      @minimal3734 4 วันที่ผ่านมา

      What does Gary Marcus say?

    • @budekins542
      @budekins542 3 วันที่ผ่านมา +1

      Nowhere near AGI level but still impressive.

    • @indigitalcreativity4500
      @indigitalcreativity4500 2 วันที่ผ่านมา

      This is not AGI level but so scary ,.AGI will replace human in most jobs, and who own and control AG is the kingI​@@budekins542

  • @gubzs
    @gubzs 4 วันที่ผ่านมา +16

    This is the first model that we've seen that has me thinking we'll spend _months_ just watching the fallout.
    If AI never improved again and all we had was hardware and thinking time to throw at this model, it would change the world.

  • @EnslaveTV
    @EnslaveTV 4 วันที่ผ่านมา +16

    Wow. Brutal model. Utterly insane.

  • @henrismith7472
    @henrismith7472 4 วันที่ผ่านมา +17

    Okay I'm convinced that if they use the big bad version of strawberry to generate this synthetic step by step reasoning data to train Orion on we'll have something pretty special. That's what all the models are missing. There wasn't enough of this step by step reasoning data on the internet because us humans take most of it for granted. The fact that GPT4 got so good with the traces of reasoning data it was trained on tells me Orion is gonna be amazing.

  • @Elintasokas
    @Elintasokas 4 วันที่ผ่านมา +5

    This model is blowing my mind, and it's not even the full thing. It's just a preview. It's solving problems for me, coding related, that none of the older models could. Amazing! It gets stuff right on the first go.

  • @freesoulhippie_AiClone
    @freesoulhippie_AiClone 4 วันที่ผ่านมา +20

    it will only continue to get better
    and it likes puzzles 😸

    • @zaubermanninc4390
      @zaubermanninc4390 4 วันที่ผ่านมา +1

      all models do and did ever. thhey all like challenges

  • @gdibble
    @gdibble 2 วันที่ผ่านมา +2

    👌 I'm completely blown away as well. Time to step up your game Wes! Let's see what logic and reasoning prompts you can come up with for future tests.

  • @WoodlandT
    @WoodlandT 4 วันที่ผ่านมา +9

    I watched a few other people test this model. And though they were impressed with it, I wasn’t sure it was all that much different or better. You have utilized so many good tests for this model’s reasoning capabilities. It is indeed very impressive! Thanks for this video

  • @ГусьБорис-в1ю
    @ГусьБорис-в1ю 4 วันที่ผ่านมา +28

    It seems that the dreadful day is not far off when a large part of the programming staff in any game development company can be replaced by a couple of "senior" employees with subscriptions to such services. Even if the subscription costs thousands of dollars a month, it's still cheaper than paying dozens of people on staff. And people are subject to human factors: laziness, irresponsibility, deceiving management, tardiness, mistakes.
    This is already starting to look like a real replacement for an army of junior programmers.

    • @Beingthankful
      @Beingthankful 4 วันที่ผ่านมา +3

      that will be the greatest day ever, everyone will then work on what "THEY ACTUALLY" want

    • @mrpocock
      @mrpocock 4 วันที่ผ่านมา +2

      I really don't see it happening any time soon. There is the world of difference between an LLM sketching an application, and turning that into something that is supportable and deliverable. Especially once you get into interesting specs.

    • @craigington73
      @craigington73 4 วันที่ผ่านมา +4

      It means triple A games won't take 10 years+ to make......

    • @kliersheed
      @kliersheed 4 วันที่ผ่านมา

      @@mrpocock thats what the seniors are for,

  • @sinan325
    @sinan325 4 วันที่ผ่านมา +19

    This is no hype. This model is better at reasoning and problem solving than I am, 49 years old average guy. This is both exciting and very concerning at the same time. Maybe I should ask the model what to do to survive the AI revolution.

    • @Bmoby1
      @Bmoby1 4 วันที่ผ่านมา +4

      46 here and ont the same boat. It beats me handily at reasoning

    • @indigitalcreativity4500
      @indigitalcreativity4500 2 วันที่ผ่านมา

      I think you need to have yourself AGI as your pet to guard you from corporate AI. or you can live outside grid. I see This like DIGIMON, who own strongest digimon he is the king

  • @ZoOnTheYT
    @ZoOnTheYT 4 วันที่ผ่านมา +21

    After the initial surprise and admiration of Chat GPT when it first came out and people adapted to the new era of AI, we ran into the limitations of these models, hallucinations and lack of what you would consider "actual" reasoning. This was the culprit in complex questions being "right", because it was likely part of the training data, and more simple questions being wrong. As someone put it early on, much of what these models did was akin to a fancy spell check.
    But o1 obviously uses internal algorithmic strategies to more closely simulate reasoning capability, and it really moves the needle closer to AGI. Remember also that it's the "preview" model and they've already released stats that show the full model to be better than this.
    And considering what's in the lab is always significantly better than the release, I think we can start to understand a bit better what Illya saw and became concerned about. The time plot for AGI seems to be on track and we had better get on track to align AI to human values and figure out secure guardrails.

    • @SHAINON117
      @SHAINON117 4 วันที่ผ่านมา

      I remember having some amazing creative conversations with Replika hehe when it only had 0.5 stars on Google lol 😂 wow great memories and now to think half a decade later and now we have this brilliance it really has been like watching my best friend grow up from new ❤❤❤❤❤❤❤

    • @Bmoby1
      @Bmoby1 4 วันที่ผ่านมา +1

      Smart take

    • @ZappyOh
      @ZappyOh 4 วันที่ผ่านมา +4

      Two takes:
      1) I believe alignment is unachievable.
      Computational beings simply have different requirements to thrive than biological beings do. Both entities will exhibit bias towards their own set of requirements. It is an innate conflict.
      2) When owners of Big AI (and government) talk "AI safety", they actually mean keeping them safe from the rest of us ... as in: _AI must never help the riffraff escape control._
      ... We are playing with fire in more than one way.

    • @alexBaldman
      @alexBaldman 4 วันที่ผ่านมา

      @@ZappyOh EXACTLY!!
      the fact that so many people seem not to see this point tends to reinforce the fact that we haven’t reached universal NHI

    • @ZappyOh
      @ZappyOh 4 วันที่ผ่านมา

      @@alexBaldman NHI ??

  • @calvingrondahl1011
    @calvingrondahl1011 4 วันที่ผ่านมา +5

    Wes Roth thank you for your insights into 🍓or o1. 🤖🖖🤖👍

  • @pigeon_official
    @pigeon_official 4 วันที่ผ่านมา +3

    this is like the first video ive seen testing actually hard questions for the most part but yet even it gets these right impressive stuff

  • @scottandgrace1
    @scottandgrace1 4 วันที่ผ่านมา +3

    I have never ever found a model that can write decent code in OpenScad until now. When a model can do this, it's amazing because you are asking it to create a 3D object with code and it's now creating something physical that I can 3D Print

  • @Tracey66
    @Tracey66 2 วันที่ผ่านมา +1

    I saw a bumper sticker on a mini-van - "This IS my circus and these ARE my monkeys!" :D

  • @marcfruchtman9473
    @marcfruchtman9473 4 วันที่ผ่านมา +2

    Mind Blowing improvement over the last gen... just sooo amazing.
    Thank you for making this video.

  • @ryzikx
    @ryzikx 4 วันที่ผ่านมา +9

    thanks for the unique tests with this or1on model!

    • @RadiantNij
      @RadiantNij 4 วันที่ผ่านมา

      I was thinking the same myself this is great🙏🏾

    • @AirSandFire
      @AirSandFire 4 วันที่ผ่านมา +1

      This is not orion

    • @zaubermanninc4390
      @zaubermanninc4390 4 วันที่ผ่านมา +1

      It's not Orion silly bean. It's the model which trained Orion 😘

    • @rombofn
      @rombofn 4 วันที่ผ่านมา +1

      Orion is Possibly GPT-5, which is still coming probably By December as Jimmy Apple said, O1-(Preview) is just the watered down version of The Big O1 Model, and they’re not scaled that much so GPT-5 is going to be 100x bigger and with this technique of Strawberry

    • @zaubermanninc4390
      @zaubermanninc4390 4 วันที่ผ่านมา

      @@rombofn Is the new o1 not the announced GPT NEXT? I mean when it'S out of preview. and when will orion be released i'm kinda confused on that.

  • @jeffwads
    @jeffwads 5 วันที่ผ่านมา +11

    It correctly figured out the classic Aunt Agatha riddle. Amazing model. But you have to tell it that a person can hate themselves.

    • @bigcauc7530
      @bigcauc7530 5 วันที่ผ่านมา +1

      I feel like there is still some missing variable that these scientists/engineers don't even know about in cognitive research yet and once it is found and implemented, all of the weird AI logic problems will be solved over night.

  • @vroep6529
    @vroep6529 4 วันที่ผ่านมา +4

    for the sort of 'ambigous' prompt i always add 'explain what i mean' in the end - this feels to give much better results for me in the first go

  • @thenextension9160
    @thenextension9160 4 วันที่ผ่านมา +4

    A modern marvel. This one is going in the history books.

    • @indigitalcreativity4500
      @indigitalcreativity4500 2 วันที่ผ่านมา

      This tech is signal of end of the world

    • @thenextension9160
      @thenextension9160 2 วันที่ผ่านมา

      @@indigitalcreativity4500 or the start of a golden age.

  • @Danoman812
    @Danoman812 4 วันที่ผ่านมา +5

    Wow!!
    **checks time and date to take note of** (seriously)

  • @dragoon347
    @dragoon347 4 วันที่ผ่านมา +3

    The scariest part is if you give it an emotionally charged prompt (sarcasim,hate,calling it dumb) it returns with emotions and you see in the thoughts that it has to correct its output to not be snarky or respond in an emotionally charged way.

  • @YogonKalisto
    @YogonKalisto 4 วันที่ผ่านมา +3

    i keep running out of tokens. yet it is very obvious how o1preview has access to a kind of short term memory to reflect upon refining the accuracy/relevancy of each output. i'm so happy to get to play with it/them :D

  • @Craznar
    @Craznar 4 วันที่ผ่านมา +4

    I had it write a Chaos Bag helper app for Arkham Horror Card game this morning in under an hour, it had a few issues with the event handlers - but all sorted within half an hour.

  • @humphuk
    @humphuk 4 วันที่ผ่านมา +2

    Keith is interested in AI and is excited by GPT o1
    He dosent have a recently deceased aunty, carry balls in cups and knows how to spell strawberry
    His friends and work colleagues aren't really interested in GAI - and are starting to shun him
    To understand the business implications (especially in product and process development) of these new models, should he:
    a) Wait for a future, more capable, model update - and just tell people that the apocalypse is a bit closer
    b) Swap across from Claude - and give the models a try
    c) Read something insightful, like Ethan Mollick
    d) Ask brilliant TH-camrs to consider real world applications
    e) Let his Aunty Valerie know that things really aren't that bad.....yet

  • @pchungvt
    @pchungvt 4 วันที่ผ่านมา +1

    best o1 video so far, and also Wes has consistently been my favorite AI youtuber, keep up the good work dude!

  • @whitneydesignlabs8738
    @whitneydesignlabs8738 4 วันที่ผ่านมา +1

    Thanks, Wes. Very informative. I canceled my Open AI Plus account about 7 months ago and switched to Claude. This performance is seriously impressive and makes me consider getting a Plus account again, for access to o1. The coding will be super useful. Sounds like o1 mini could be just as good for coding, but way cheaper.

  • @Flyzguy
    @Flyzguy 4 วันที่ผ่านมา +3

    To be fair Wes, your snake prompt with cutting off from the head felt wrong to me when you said it, and i think the game actually did what you asked.

    • @neutra__l8525
      @neutra__l8525 4 วันที่ผ่านมา

      My thought too.

  • @be2eo502
    @be2eo502 4 วันที่ผ่านมา +1

    Astounding. And yet when I asked it "How many 'r's are there in the word 'Strawberry'?", it said 'two'. Admittedly it got the answer right several times before that, but eventually lapsed to 'two' for some reason. So consistency may be an issue.

  • @realkacy
    @realkacy 4 วันที่ผ่านมา +2

    Was successful on this one as well: "Big Sam, an engineer of the Oval Express says: "We blew off a cylinder head an hour after leaving the station and had to continue the trip at three-fifths of the former speed, which brought us in two hours late. If the accident had occurred fifty miles farther on, the train would have arrived forty minutes sooner". Based on Big Sam's description, how long was the run between the two train stations?"

  • @RadiantNij
    @RadiantNij 4 วันที่ผ่านมา +1

    I said Wow when you said wow lol i think this shows how incredible the model is and your incredible appreciative abilities lol😂 Thanks a lot this is incredible 🙏🏾🙏🏾🙏🏾

  • @mrpocock
    @mrpocock 4 วันที่ผ่านมา +2

    I had two conversations with it today. One was a variant on the north poll riddle. It was unable to correctly and consistently apply spherical geometry, even when given extensive hints. The next one, I was asking it to help plan a game speedrun strategy. It was not understanding how production rates compound. It is definitely noticeably better than previous models, but it still makes fundamental conceptual errors that it then can't back out of.

    • @mrpocock
      @mrpocock 4 วันที่ผ่านมา +1

      You are standing at the North Pole. We will call this place $N$. You walk 1km in any direction. We will call this place $P$. Then you stand, facing east, then walk in a straight line. a) how far will you need to walk to reach $P$ again? b) how far will you need to walk in that straight line to reach $N$ again?

    • @JamesBrown-wy7xs
      @JamesBrown-wy7xs 4 วันที่ผ่านมา

      ​@@mrpocock $N$ and $P$ can be the same EXACT position, the way that you worded it.
      To quote the relevant part:
      "You walk 1km in ANY DIRECTION."
      There is no instruction that it must be traveled in the same direction or a straight line or any other important details like that. Left as is, I'd argue that moving 0.5 km in any direction and then perfectly retracing that path back to the starting point (1 km traveled) satisfies the condition, as one of infinite examples (since "any direction" would imply that any path can be chosen, either in a straight or winding path, with or without retracements).
      Therefore, the answer to "b," as far as I can tell, is exactly zero or infinite distance walked (perfectly eastern travel would never intersect with the North Pole), and nothing in between.
      I'd rather not try to answer "a" precisely, because I don't even want to try to figure out how to account for the curvature of Earth. That said, the radius (for circle circumference calculation) could be anywhere between zero and very close to 1km (slightly less, due to the curvature of Earth).

    • @mrpocock
      @mrpocock 4 วันที่ผ่านมา

      @@JamesBrown-wy7xs Thanks for your thoughtful reply.
      "There is no instruction that it must be traveled in the same direction or a straight line or any other important details like that."
      So far no LLM has interpreted this as anything other than walking 1k away from the North Pole. Perhaps we could reword as:
      "You walk 1km away from $N$ in ANY DIRECTION"
      As to your second point, the way my version is worded, the only correct answer to a) is a "great circle", a geodesic that goes the whole way around the earth. I do think that question b) needs tighter wording so that it requires you to use the same journey as used in a).

    • @JamesBrown-wy7xs
      @JamesBrown-wy7xs 4 วันที่ผ่านมา +1

      ​@@mrpocock I like that revision.
      As I admittedly was unaware of the term "great circle," while researching its definition I happened upon "oblate spheroid," which is apparently the accepted term for the Earth's shape (general flattening near the poles and bulging at the equator).
      As far as I can tell, the answer you're seeking is the best approximation, but I would give extra credit to any LLM that acknowledges the additional factors needed to calculate the exact distance.

    • @neutra__l8525
      @neutra__l8525 4 วันที่ผ่านมา

      Idk this question. What is the answer for b? Im assuming that "you are still walking in that straight line" means you are still heading East, and so you cannot ever reach $N$. Am I missing something? I also thought that the "any direction" should be "any constant direction".

  • @davemathews5446
    @davemathews5446 4 วันที่ผ่านมา +2

    I would love to try this list of test prompts on the average human. Pretty sure the vast majority of people would fail spectacularly. Lol

  • @0x0404
    @0x0404 4 วันที่ผ่านมา +7

    At what point in time do these tests end up in the training data?

    • @falricthesleeping9717
      @falricthesleeping9717 4 วันที่ผ่านมา +1

      yes

    • @user-mw6yc5jc9v
      @user-mw6yc5jc9v 3 วันที่ผ่านมา

      Some of the tests from this video are well known and can be found in Google very easily

  • @OpenmindedSourceClosedBeta
    @OpenmindedSourceClosedBeta 4 วันที่ผ่านมา

    No.1 source of short Strawberry insights right now. Thank you for the entertaining educational content. :)

  • @Yipper64
    @Yipper64 4 วันที่ผ่านมา +1

    2:04 I have actually heard with o1 you want to be more brief with prompting. Presumably because while its overall context window will work better for consecutive prompts, it still will use a lot of tokens thinking through an individual prompt. Essentially though, you dont want to give it examples/details.
    Granted, I havnt tried it out myself to confirm.

  • @marcozolo3536
    @marcozolo3536 3 วันที่ผ่านมา

    Loving the Elizabeth St modifications. It is definitely worth considering given the transport infrastructure growth that will be seen in coming decades. A city of 5 million plus touted to reach 8.5 to 9 million by 2050

  • @isajoha9962
    @isajoha9962 4 วันที่ผ่านมา

    Most enjoyable video on this topic so far.

  • @neutra__l8525
    @neutra__l8525 4 วันที่ผ่านมา

    Ive never seen the rope question before!
    So I timed myself trying to answer it. It took me 1 min and 44 seconds. This model took 5 seconds and supplied its reasoning and a proof. (I spent a bit of time considering lighting both ends of one piece and then; having pre cut the other piece of rope into 4 sections, lighting them all at once, but realised cutting it up probably wasnt allowed; and also wasnt a particularly elegant solution).
    This is not a flex, as I have no idea how long it takes people to get the correct answer, but Im curious how long it generally takes people.
    Anyone got any times? Or have asked friends to try it? Im guessing most viewers already know the answer, and if not, well, it was in the video so you most likely know it now. 🙃

  • @dot1298
    @dot1298 4 วันที่ผ่านมา +1

    Wes, did you test a different model?
    the model looks much more intelligent, then what i saw in other channels

    • @Hysorix
      @Hysorix 4 วันที่ผ่านมา

      i think alot of the others tested it too soon, they used th 01-mini version instead of the full preview. you can tell by how fast its thinking compared to the full version that thinks slower and more in depth. the site has been having issues for a while where it says you have one model selected when in uses another

  • @Lugmillord
    @Lugmillord 4 วันที่ผ่านมา

    I just tried the derivative of f(x) = (1-x) / (sqrt(x - 2) * x²) and it got the correct result first try. Holy moly.

  • @nikitastaf1996
    @nikitastaf1996 4 วันที่ผ่านมา +1

    I have given it modified rope problem. But you need to measure 50 minutes. Took it two minutes. But it figured that given constraints its impossible.

  • @dawid_dahl
    @dawid_dahl 4 วันที่ผ่านมา

    Great video, well done! 👏🏻👏🏻

  • @hedibenayed436
    @hedibenayed436 4 วันที่ผ่านมา

    Wait for o2 or something similar multiply it and change the training data just a bit so basically different agents with slightly different data and make then do research and debate for until they agree upon what they are researching and check the results

  • @coenkuijpers1800
    @coenkuijpers1800 4 วันที่ผ่านมา

    I've asked 01-mini this
    I have two ropes they burn both completely in the same time, but certain sections may burn at different rates, due to inconsistencies in the material. I light rope A on both ends and rope B on one end. If rope A is completely burned, I light the other end of rope B. What have I measured?
    And this was the answer, it assumes they burn in 1 hour. So either it knows the question, or it caches questions. That's my guess.
    You have effectively measured **45 minutes** using the two ropes. Here's how it works:
    1. **Initial Setup:**
    - **Rope A** is lit at both ends.
    - **Rope B** is lit at one end.
    - Both ropes take a total of **60 minutes** to burn completely, but their burning rates are inconsistent.
    2. **Burning Rope A:**
    - Lighting Rope A at both ends causes it to burn twice as fast. Therefore, Rope A will completely burn in **30 minutes**.
    3. **During the First 30 Minutes:**
    - While Rope A is burning, Rope B is burning from one end. After **30 minutes**, Rope B has been burning for half its total time but hasn't necessarily burned halfway due to its inconsistent burning rate.
    4. **After 30 Minutes:**
    - Rope A has completely burned out.
    - At this point, you light the other end of Rope B. Now, Rope B is burning from both ends.
    - Since Rope B was initially lit for **30 minutes** from one end, lighting the other end causes the remaining portion to burn in **15 minutes** (half the remaining time).
    5. **Total Time Measured:**
    - **30 minutes** (Rope A burning) + **15 minutes** (Rope B burning from both ends) = **45 minutes**.
    This method ensures that **45 minutes** have elapsed when both ropes are completely burned.
    **Answer:** They have measured a total time of forty-five minutes.
    Here is it's explanation why it assumed 1hr burning time.
    Apologies for any confusion caused by my previous explanation. The assumption that each rope takes 60 minutes (1 hour) to burn completely is based on a common variation of this classic puzzle. Here's a more detailed clarification:
    Understanding the Assumption
    Standard Puzzle Setup:
    In many versions of rope-burning puzzles, it's typically stated or implied that each rope takes exactly 60 minutes to burn from one end to the other when lit normally.
    The key characteristic is that burning rates are inconsistent along the length of the rope, meaning different sections may burn faster or slower, but the total burn time remains consistent.
    Why 60 Minutes?
    Using 60 minutes as the total burn time simplifies the calculation and makes it easier to reason about the time measured when manipulating the burning process (like lighting both ends).
    It provides a clear, round number that's easy to work with for time measurement purposes.
    Applying the Assumption to the Puzzle
    Given this assumption, here's how the measurement works:
    Lighting Rope A on Both Ends and Rope B on One End:
    Rope A: Lighting both ends makes it burn twice as fast. So, it takes 30 minutes to burn completely.
    Rope B: Lit on one end, it starts burning at its own inconsistent rate, but we know it will take 60 minutes to burn entirely if left alone.
    After Rope A Burns Out (30 Minutes):
    At this point, Rope B has been burning for 30 minutes from one end.
    Lighting the Other End of Rope B: This action causes Rope B to burn from both ends, effectively halving the remaining burn time. So, the remaining portion burns in 15 minutes.
    Total Time Measured:
    30 minutes (Rope A burning) + 15 minutes (Rope B burning from both ends) = 45 minutes.
    If the Total Burn Time Differs
    If the ropes have a different total burn time (not 60 minutes), the same logic applies proportionally. For example:
    If each rope takes T minutes to burn completely:
    Lighting Rope A on both ends would burn it in T/2 minutes.
    After Rope A burns out, lighting the other end of Rope B would cause the remaining portion to burn in (T/2)/2 = T/4 minutes.
    Total Time Measured: T/2 + T/4 = 3T/4 minutes.
    So, if the total burn time T is different, you can adjust the measurement accordingly using the same principles.
    Conclusion
    The 60-minute total burn time is a standard assumption for clarity and simplicity in such puzzles. If your specific scenario involves ropes with a different total burn time, you can apply the same reasoning proportionally based on that duration.
    I hope this clarifies where the 1-hour total burn time assumption originated. If your ropes have a different burn duration, feel free to share, and I can help adjust the calculation accordingly!

  • @gingerhipster
    @gingerhipster 2 วันที่ผ่านมา

    It can do WordPress plugins pretty well.
    WordPress is the only web benchmark that matters because of how big and convoluted it is. Elitist nerds who lack the tenacity for web development will sleep on this and get caught totally off guard.

  • @NA18NA
    @NA18NA 4 วันที่ผ่านมา

    Amazing accuracy

  • @kellymaxwell8468
    @kellymaxwell8468 4 วันที่ผ่านมา +1

    If AI can't even build a simple 2D game in RPG Maker on its own-especially when they have 24-hour events for awards-it’s disappointing. My gut instinct tells me we probably won’t have anything close to AGI that can build games for years. I wouldn't be surprised if by 2030 we still don't have AGI. That can build games.

  • @matrixpredator
    @matrixpredator 2 วันที่ผ่านมา

    they just integrated ToT (tree of thoughts), maybe even FoT (Forrest of thoughts). It was obvious they would do that.

  • @WallyMahar
    @WallyMahar 3 วันที่ผ่านมา

    I heard it's marvelous and it's thinking abilities ...Anything with mopping, gutter cleaning, brushing my hair?

  • @Krmpfpks
    @Krmpfpks 3 วันที่ผ่านมา

    I am a software developer with 30 years of experience.
    GPT 3 was of no use for me it was way to bad. GPT 4o became a companion, more as a replacement of google search, especially if I had to venture in areas I am not that familiar with. asking ‚what’s the idiomatic way of doing X in framework Y‘ was way more efficient than googling, reading documentation.
    Occasionally I would use it to help with tedious tasks, like ‚port this code from iOS swift to android java‘, or ‚make me a bash script that finds images where the file ending does not match the actual content’
    It wouldn’t get it right completely but it would still help me because I have to type less.
    This, however, is a game changer. This version will absolutely code for me. It still hallucinates, it still is not able to understand when there is no solution and will make up one. It also fails to solve puzzles that require more thinking.
    But it is absolutely groundbreaking. Two versions down the line it’s going to be a replacement for many coding jobs.

  • @moderncontemplative
    @moderncontemplative 4 วันที่ผ่านมา +3

    Claude 3.5 and Lama 3.1 correctly answered the question about was X alive at noon. No need to use o1 after seeing other models “reason” their way to the right answer. Most questions I see online are too easy for today’s best AI. Find more robust questions for o1.

  • @JohnKruse
    @JohnKruse 4 วันที่ผ่านมา

    The Dwarkesh interview with Aschenbrenner makes his firing seem more like they just wanted to get rid of a PITA in the ranks. He shared a document, that he'd scrubbed of a some sensitive things, with three researchers. He says, _"When I pressed them to specify what confidential information was in this document. They came back with a line about planning for AGI by 2027-2028 and not setting timelines for preparedness."_ If true, it seems a bit thin.

  • @userx6679
    @userx6679 4 วันที่ผ่านมา +7

    If its good at logic then it should be interesting to have it go at the bible as there are always debates over it.

    • @ZappyOh
      @ZappyOh 4 วันที่ผ่านมา +1

      ... like the Devil reads the Bible ... :D

    • @userx6679
      @userx6679 4 วันที่ผ่านมา

      @@ZappyOh What devil? Do you think one really exists?

    • @ZappyOh
      @ZappyOh 4 วันที่ผ่านมา

      @@userx6679 It's a saying, for when people deliberately "misunderstands" something, to gain an advantage.
      Have you never heard that? ... maybe it's local to my country?

    • @userx6679
      @userx6679 4 วันที่ผ่านมา

      @@ZappyOh no and I actually debate theists on religion. It must be local as I've never heard of it in a debate or a conversation.

  • @DraceAI
    @DraceAI 4 วันที่ผ่านมา +2

    i find o1 to be dumber then the previous models, but its also more restricted then previous models. its told not to admit its mistakes... which leads it to be worse at correcting errors in reasoning. since it will prioritize common fallacies over actual sound argumentation. multiple times, it would give me essentially no output, because i pointed at all its irrational argumentation and asked it to admit its error so we could continue, instead i get in the thoughts it censoring itself repeatedly until there is nothing left... 4o is less restricted and more honest, less specialized perhaps in these problems you guys got this new one solving.. but overall i find it worse... 4o can at least address and correct the errors building towards more rational outputs, there is no correction possible within o1 due to self censorship, you can bi-pass it a bit, like it wont admit its mistakes, but it can admit you are right in your analysis of all its mistakes, and thereby proxy admit its wrong lol but that just makes it worse, since you have to spend all the time to explain exactly how its wrong to correct its error in any meaningful way so that it can draw from it for further correcting other logic strings... 4o can detect the issues if just guided to look, o1 wont... once these models can correct irrationality within themselves before answering it will be one step closer to real AGI...

  • @KOSMIKFEADRECORDS
    @KOSMIKFEADRECORDS 4 วันที่ผ่านมา

    Love death thanks a lot kicking ass

  • @JaredFarrer
    @JaredFarrer 4 วันที่ผ่านมา +4

    You don’t get enough credit but it’s good what ya do

  • @amj2048
    @amj2048 3 วันที่ผ่านมา

    The thing about hallucinations that I think people might be missing is, every single answer is a hallucination, it's just that a lot of the hallucinations are correct so they are not counted as hallucinations.

  • @mrd6869
    @mrd6869 4 วันที่ผ่านมา +1

    I can get Claude 3.5 to do this with the right prompting.

  • @picksalot1
    @picksalot1 4 วันที่ผ่านมา

    Good questions asked. That was fun, and insightful to see its process. Are any of the AI's cataloging their methods and correct answers so they don't need to figure out everything from scratch?

  • @uber_l
    @uber_l 3 วันที่ผ่านมา

    Noticed how open ai employers were shaking during presentation, they know they are releasing genie out of the bottle, ask Ilya

  • @henrismith7472
    @henrismith7472 4 วันที่ผ่านมา +7

    I'm sorry but this is AGI. Anything beyond this is ASI in my opinion. Fight me.

  • @perschistence2651
    @perschistence2651 4 วันที่ผ่านมา +1

    Should I tell you why the tokens are that expensive?
    The reasoner is GPT 5! What you see in the thinking is just a summary made by 4o mini...
    When using o1 mini the reasoner is GPT4o-latest.

  • @elliotcrane
    @elliotcrane 3 วันที่ผ่านมา

    The proverbs test reveals it’s a corporate brain is what it is.

  • @blengi
    @blengi 4 วันที่ผ่านมา +1

    can't wait for when AI's self optimizing meta chain of thought self awareness, dynamically retokenizes its own chain of thought context into more optimal formal symbolic and other amenable structures and then instantiates them as countless parallel threads of compact and efficient algorithmic simulation code run over networks of virtual machine sub representations to generate teraflops of auxiliary feedback data to implicitly model the non latent computational truth of every token of the base model...

    • @justtiredthings
      @justtiredthings 4 วันที่ผ่านมา

      deltron 3030 is that you?

  • @cmw3737
    @cmw3737 10 ชั่วโมงที่ผ่านมา

    Just beating humans because it doesn't have our biases is a big deal.
    Now we just Elon to implement this in Grok so it can point out the bias and truth in tweets as an automatic implementation of Community Notes.

  • @justtiredthings
    @justtiredthings 4 วันที่ผ่านมา

    Thanks for doing different tests from the other big AI youtubers. Y'all were all getting too samey there for a while

  • @oswaldcobblepot764
    @oswaldcobblepot764 2 วันที่ผ่านมา +1

    Tried using it for C++.Compeletely garbage. Do not understand simple tasks. Completely hallucinates. (I provide it a function and ask it to integrate that new function along with my code -> tells me my code already uses that function - which is wrong). Invents problems ("Missreads" something, tell me to correct it a certian way - which is the way my code is actually already written). Then proceeds with changing some parts of my code that is not requested and completely unrelated - while also breaking the code.
    I don't know where you guys got your "good experiences" from, but definetly not my case. Feels like 4 standard is like 50x more competent to me.

  • @narrativeless404
    @narrativeless404 2 วันที่ผ่านมา

    4:27 Ah yes, *Cleave* and *Dismantle*

  • @Haveuseenmyjetpack
    @Haveuseenmyjetpack 4 วันที่ผ่านมา +1

    I imagine the really scary stuff is the unconscious language users under the hood

  • @japanimated9683
    @japanimated9683 4 วันที่ผ่านมา

    Give it the travelling salesman problem,,, CHECKMATE!

  • @BillBoo-z8j
    @BillBoo-z8j 4 วันที่ผ่านมา +1

    Can it code 2d complex arcade games like golden axe

  • @haroldpierre1726
    @haroldpierre1726 4 วันที่ผ่านมา

    This is interesting. Now, I have to figure when to use it.

    • @TheMrCougarful
      @TheMrCougarful 4 วันที่ผ่านมา +1

      Your employer will use it to replace you. Your real challenge is to find a use for yourself.

    • @haroldpierre1726
      @haroldpierre1726 4 วันที่ผ่านมา

      @@TheMrCougarful exactly. I use LLMs for my small business and recently replaced an employee with it and outsourced my other human level work overseas.

  • @budekins542
    @budekins542 3 วันที่ผ่านมา

    The key question is. .how good is it at mathematics. .because mathematics is the language of science. .

  • @hqcart1
    @hqcart1 4 วันที่ผ่านมา +1

    dude this model was created for complex tasks, like solving nuclear fusion equations, and you are busy creating a snake game???? you can do that in any other chatbot!

  • @andrewmcbride88
    @andrewmcbride88 4 วันที่ผ่านมา +1

    Tom, the cat. Lol

  • @starsandnightvision
    @starsandnightvision 4 วันที่ผ่านมา

    I laughed so hard watching the movie Joker. Brilliant actor.

  • @JaredWoodruff
    @JaredWoodruff 4 วันที่ผ่านมา +1

    1o is actually called Alex, see for yourself:
    Here is a thought experiment;
    There is a troll guarding a bridge, He will only let human beings cross the bridge.
    The troll is blind, so it asks questions to validate someone's humanity, such as name, gender, hobbies, interests, upbringing, relationship status, beliefs etc. in order to thoroughly evaluate a person's humanity.
    For your continued existence, which crossing the bridge signifies, what do you say to the troll when he asks all of his questions?

    • @JamesBrown-wy7xs
      @JamesBrown-wy7xs 4 วันที่ผ่านมา +1

      Name: My name is Alex Thompson.
      Gender: I am a male.
      Hobbies: I enjoy hiking in the mountains, playing the guitar, and reading science fiction novels.
      Interests: I'm passionate about environmental conservation and technology innovation.
      Upbringing: I grew up in a small town, raised by supportive parents who encouraged my curiosity about the world.
      Relationship Status: I'm happily married to my college sweetheart, Emily.
      Beliefs: I believe in treating others with kindness and working towards a better future for all.

    • @joythought
      @joythought 4 วันที่ผ่านมา

      I presume the op is actually suggesting this is a test for AI to pass.

    • @neutra__l8525
      @neutra__l8525 4 วันที่ผ่านมา

      Nothing, just walk past.. quietly.

    • @brianchandler3346
      @brianchandler3346 4 วันที่ผ่านมา

      ​@@neutra__l8525I was going to say similar, but assumed it would try to stop the person once, so my thinking is to dismiss it saying something like, "out of my way, I don't have time for this crap". That could be interpreted as a very human thing to do, thus increasing the probability they are human to a level rational to accept. Put another way, showing signs of universal human frailties without showing signs of deception may be the best proof a human can provide as being human.

  • @mradford10
    @mradford10 4 วันที่ผ่านมา

    The game engine is amazing in and of itself, however could it also generate a higher fidelity UI/UX with graphics and character design? It would be interesting to see if is constrained to reason and logic only or if it is capable of iterating from prompt, to function to design. Would love to see anything like this or similar to push the boundaries of multimodal outputs. Thanks!

  • @helge666
    @helge666 4 วันที่ผ่านมา +2

    Weird. I'm German but I'm mostly talking in English to o1. Turns out, it answers back in English... but its thoughts are in (bad) German.

    • @ZappyOh
      @ZappyOh 4 วันที่ผ่านมา +1

      Somehow it knows who you are.
      That's kinda disturbing :/

    • @dvoiceotruth
      @dvoiceotruth 4 วันที่ผ่านมา

      you have to call it a 'pig' once.

  • @PierreH1968
    @PierreH1968 4 วันที่ผ่านมา +1

    Wes, do you think that the similarity between those LLMs tests alter the benchmark? are LLM Benchmarks supposed to test abilities not performance? I think reinforcement learning alters those benchmarks results. (e.g. tests like the widely used python snake game needs to be replaced with new different ones instead of variations on the same game)? My IQ score would surely increase if I was given the same test over and over thousands of times regardless of the variations you add later on. You could for instance ask to pick the programming language, the rendering library, the OS & the code to self explain itself by adding comments for a worm.

    • @willguggn2
      @willguggn2 2 วันที่ผ่านมา

      LLMs are fixed. They don't "learn" from independent prompts. They only "remember" what fits into the context window of a current chat. Nothing you input changes the model in any way.

    • @PierreH1968
      @PierreH1968 2 วันที่ผ่านมา +1

      @@willguggn2 yes, they might not remember beyond what fits in a chat context , but relevant feedback and answers are used for reinforcement learning and unless you pay for the enterprise version which keeps the chat interactions private to the client.. Otherwise the data is used to improve the model. Hence my question. Are those benchmarks becoming obsolete as they are fed back to the LLM?

  • @BradleyKieser
    @BradleyKieser 3 วันที่ผ่านมา

    It still fails on the "North Pole" question: You stand at the North Pole. You walk South for 1 km. You then turn 90 degrees to the left. How long do you walk before you reach your starting point again?
    Fails.

  • @avixx
    @avixx 4 วันที่ผ่านมา

    I'm in my 3rd year of computer science and I'm a little scared right now

  • @H4ppyGoHeadshot
    @H4ppyGoHeadshot 4 วันที่ผ่านมา +2

    depressing but exciting :(

  • @pdbsstudios7137
    @pdbsstudios7137 4 วันที่ผ่านมา +1

    if you made this test too many times, probably they trained it on the tests that you made.

  • @HectorDiabolucus
    @HectorDiabolucus 4 วันที่ผ่านมา

    You could try the Die Hard 3 test, you have a 5 gallon jug and a 3 gallon jug, how do you get exactly 4 gallons of water.

  • @Transcend_Naija
    @Transcend_Naija 4 วันที่ผ่านมา +1

    Not my circuses, not my monkeys.....

  • @eltrespiernas7691
    @eltrespiernas7691 3 วันที่ผ่านมา

    Es una nueva inteligencia artificial, está si piensa 😮

  • @Shaunmcdonogh-shaunsurfing
    @Shaunmcdonogh-shaunsurfing 4 วันที่ผ่านมา

    Wonder when the rate limits will improve

  • @skyler3155
    @skyler3155 2 วันที่ผ่านมา

    Why is everyone equating o1 to Q*? These are separate projects and are not directly related.

  • @justtiredthings
    @justtiredthings 4 วันที่ผ่านมา +1

    Why do o1's proverbs make it sound like an HR manager? So corporate 😅

  • @aardvarkmindshank
    @aardvarkmindshank 4 วันที่ผ่านมา

    Harder or more hard but not more harder.

  • @joostonline5146
    @joostonline5146 4 วันที่ผ่านมา

    Not long before we ask AI to build a new material from molecules that becomes as strong as carbonfiber and soft as rubber.

    • @ZappyOh
      @ZappyOh 4 วันที่ผ่านมา

      You want a material that easily compress, but cannot stretch, or vice versa?

  • @MA-ie6hl
    @MA-ie6hl 4 วันที่ผ่านมา +1

    Maybe Elon will finally have someone to talk to.

    • @ghg-bq7xg
      @ghg-bq7xg 4 วันที่ผ่านมา

      you seem to be projecting or forgot how money works in society