Links: If you prefer this as a blog post here is the Medium blog post, give it some claps: wonderwhy-er.medium.com/openais-o1-vs-sonnet-3-5-round-one-in-comparing-their-coding-abilities-583e578250d9 Failed Claude artifact: claude.site/artifacts/3949a0f2-3597-4ad7-99c9-3220a9415a42 Failed WebSim: websim.ai/c/euVVHmpCGkFyNxczp ChatGPT o1 result: codepen.io/wonderwhy-er/pen/qBzwjRZ And here is the chat chatgpt.com/share/66e34a3d-1300-800f-b3cb-b81388412164 WebSim reusing and improving o1 result: websim.ai/@wonderwhy_er/gta-2-style-parking-simulator-with-settings?r=0191e7d4-fd79-734d-811b-05570562da5e ChatGPT o1 failed 3d variant: codepen.io/wonderwhy-er/pen/zYVXRgQ Here is the chat for the 3d game chatgpt.com/share/66e42ded-c59c-800f-aee2-bd8635b95567
Bro absolutely gold of a video, especially because I've been trying to create a version of baseball tomy pocket game via HTML, java and CSS. I am not finished but this has helped me realise somethings. My current workflow is going from claude to chatGPT and vice versa. Thanks for this
Saying that going from 13% on math to 83% on math is only a 6x improvement is overly simplistic. The closer that you get to 100% the harder it is to improve. I'm not saying that it's a direct 100x improvement - but by your logic it would only be possible to have like a 7.5x improvement (I think, unless I'm misunderstanding you). Those benchmarks are inherently asymptotic as they have a set ceiling. Horizontal asymptote. Edit: I also feel like it's important to say that these models (I really on have in-depth experience with Claude, but I'm guessing o1 preview would be as good or better) are better at coding than one-shot tests show. I think that at this point it might even take more effort to code exclusively with these things than for a good developer to just do it the old-fashioned way (copy paste). However, I've found three things: 1 - Breaking the code up into small modules, each of which handles a specific part of the logic of your project, helps the models to not get confused. If you want to code with these things you have to understand that the most important thing is to separate your project into chunks of logic that are small enough for the model you're using to handle. 2 - You need to separate the data from the code, and establish standardized structures for your data. 3 - The best way to get the results you want out of a model is by 'rubber duckying' it - have regular brainstorming sessions with it where you go back and forth in natural language to ensure that the model understand the logic flow you want. Like I said, I think that at this point they can do way more than people realize because using them to their full extent is a skill which requires time and effort to develop. However, I think that as more people get their hands on these tools and try harder to learn how to use them properly, people will be very surprised.
You are right! Sad I can't pin two comments. I was just using numbers as is, not digging deeper. Sad TH-cam does not allow pin 2 comments. Yours should be at the top.
@@spazneria ouh, you expanded your comment! I agree with that new part 100%! Now I MUST pin you :D What you describe under 1. is what I call "Divide and conquer" But there is a software development principles set called SOLID. In it S stands for "Single responsibility" and it's the same thing you describe. You should chink your code into single responsibility chunks. This makes it way easier to manage and plug and play. AI systems should work the same way. Current tools only scratch the surface of what is already possible. We currently are trying to adapt LLMs to work as humans do. The moment we start making integrated development environments not for humans but for LLMs, and then fine-tuning LLMs on using these LLM-friendly IDEs... That will be a crazy moment.
@@EduardsRuzga That's awesome! I'm happy to know that my findings align with established principles. To be honest with you, I don't have any programming experience, and I hesitate to share that because I worry that it invalidates my opinions on the usefulness of the models - I don't have a good reference for what 'good' code looks like. However, I think that approaching it from this fresh standpoint almost gives me a sort of advantage. I approached it differently from software developers who already have established ways that they do things - I didn't have to learn how to adapt my current practices to fit the tool, I simply had to learn how to use the tool. Only time will tell
Well, my feeling from them is that they get 2x deeper with each year. But deeper means better how deeper they can dig into problems. Also, it takes ~2 years to release next-gen so far. So o3 is in 4 years. Considering the difference in depth between gpt2 gpt3 and gpt4. and now o1 getting to PHD level problem-solving in some areas... o3 does feel like AGI. Make it multi-modal or a mixture of experts. You can already use o1 with gpt4o in tandem to give it vision kinda.... Really feels like AGI if they can keep up this progress.
Good attempt, but the prompt is somewhat confusing and unspecific. A human couldn’t create a game based on such incomplete specifications, so it’s even more impressive that 1o could.
Yeah, I wrote about it on LinkedIn. I think prompt engineering is going away as models get better and do self-prompting. Was arguing for it since the summer of 2023. First, there were CustomGPTs that incorporated meta prompting. Then Claude started to release features around it where Sonnet 3.5 outputs hidden thinking where it self-prompts. And o1 is another step in that direction, but here model was actually trained to do it. o1 is a good example of "self prompting" where it solves what the user wrote as some system of equations writing a better prompt along the way. In that sense, I approach prompt writing like a system of statements(equations) and let LLM figure out derivatives from it through self-prompting. And I prefer for tests to keep it simple to see how LLM can do this deduction. You saw what it wrote out of my prompt, it was understood correctly, right? I myself also play with it for a while, have a bunch of such self prompting Custom GPTs chatgpt.com/g/g-0KzBw6cGv-expert-creator chatgpt.com/g/g-wnKjTKcBc-expert-debate-organizer chatgpt.com/g/g-ES4r8YPEM-can-i-trust-this and more
@@EduardsRuzga This is one of the more admirable traits of LLMs - that they can level the playing field for those for whom English is not a native language 😉
I think you feedback stating issues with the positioning of the wheel also made it confused. Saying that they where rendered 90 degrees wrong in the forward axis of the car I think would be more easy for the llm to understand. Other then that good showcase of the model :)
@@gabrielsandstedt I actually am iterating sense and to my surprise, no matter how I prompt it it was failing. It seems it gets confused about the positions of objects between the physics engine and 3d rendering one, their nesting, and so on. I am not surprised about that. I am rather surprised it still can reason about PhD physics problems. I think, maybe o1 could get it right though. If I tried to iterate with it more, 30 requests per week make me anxious to waste it on back and forth. Meanwhile, I needed to go and fix it by hand here websim.ai/@wonderwhy_er/3d-parking-simulator-with-3-cars-different-wheel-r
@@alan83251 honestly something feels weird about the model. Its good af specific things. Its kinda C level ir something. Needs executive probkem summaries to make plans to pass to other models or something. There seem to be many tasks at which other models are better.
You can use GPT o1-mini for coding, it's faster and almost as good. Tip: when you want to do something "easy" (for instance "put the code all together in one block"), you can switch model inside the same chat session, so you don't use your credits for something too easy.
I did test it too and it performed worse for me than o1-preview. But need to test more. In my case that they are not available in custom gpts limit how much i use
I just tried to switch between o1 and gpt4-o with internet access and it works like a charm. Interesting way to use it. You can call in custom gps too. I may do a small video on that.
Haha, we used that phrase in completely opposite ways :D You speak of speed it does things, and I spoke of depth. Its depth is not human level, humans are better. But it is definitely faster than humans at the depth it can reach.
@@EduardsRuzga Yea, that was the joke 😄 The rate of improvement is insane and OpenAI employees are outright saying improvements with this paradigm will come even faster. So it will beat humans at almost everything (not only speed of reading, thinking, coding, and writing) very soon 😉
@@ShpanMan well, there are many elements here. First, I am bit sceptical on how fast it will be. I bet that it will take another 5 years for it to become true AGI. That is still faster then 100 or 20 years as some sceptics say, but it is not all fast as 2025 as some hypers thing. Some of the reasons are not technological. There are regulations, there is adoption by humans, but these systems also feel far from general. They will reshape knowledge and work hard in the coming years though as they are becoming better than humans in some curious areas. It's very hard to put a finger on that jagged frontier of where they are exceptionally good and exceptionally bad at the same time :D I am not sceptic but I am sceptical of AGI in next couple of years :D
Well, Claude did not release their Opus 3.5 and Haiku 3.5 Sonnet 3.5 was and is impressive. The problem with Opus is speed and price. So I don't expect it to be a better price/speed/quality. Comparable may be. I was waiting for Haiku 3.5 too. Haiku 3 was already impressive in comparison to GPT3.5. But now GPT4o mini is out, and now o models. This is gonna be tough to beat :)
Yeah, full o1 release + allowing it to use tooling(search, code interpreter, custom gpts actions) could be crazy. Though these tools may not come in the near future but its already possible to use them in Tandem with gpt4o to kinda give it access to that.
@@EduardsRuzga It's me again, I'm sorry but I'm bored and I like math. In your first example - 100 is double 50, so 50 is 50% of 100. However, when measuring improvement you aren't measuring it relative to your final position, you're measuring it relative to your initial position. 100 is 200% of 50, and 50 is 100% of 50. So, you have to add 100% of 50 to 100% of 50 to get to 100. So it is 50 + (1.00 * 50) = 100, ergo doubling is a 100% improvement. If your money goes from 100 to 150 over a year in the stock market, your annual return is 50%, not 33%. So the actual improvement is 83-13 = 70; 70/13 = 538%
Wow, what OpenAI o1 generated as the first attempt at the 2D game is pretty impressive. It might look simple but quite a lot of reasoning is required to even know how to draw the wheels etc. I see it as the first early results of generating games, like the early video game Pong. Exponential AI progress could potentially lead to very advanced generated games in just a few years from today.
YES! And I am super excited, I was thinking of using games for education for a decade, but the price of making them was prohibitive for that use case. AI can bring this into the realm of possibility. Imagine AI made World of Warcraft but for education. This makes me super excited... Kids learn with the same engagement with which they play computer games. What world would that be?
the last code may success if you create the project from ground up as per ChatGPT’s instructions, not in a html simulator ( as some important component cannot be simulated in the simulator, eg three.js)
I feel like the first stage of this AI boom (LLMs/LMMs) was like an ADHD kid buying a bunch of games and is too excited to invest enough time with each game. We’re now entering the stage of the middle aged man with a potato-bag full of lego. I expect meta-thinking will be next, once we hit that, models will pretty much improve themselves. Not unlike pushing a rock down a hill, until it reaches some plateau (mainly due to using human language as a medium for “thoughts”). Some improvements will push it a bit, maybe a navigator model that manipulates the deep layers of a recursive architecture in real-time (I’m not an engineer, just a joe thinking out loud). Needless to say, it is possible to at least build a baby Multivac from Asimov’s “the last question”.
Thanks for reaching out! I know its usually a good idea to do that with llms that do not have that in their system prompt. What is curious about Sonnet 3.5 in Claude is that it does hidden thinking already. There are even hacks on how to make it expose it by asking it to use the wrong syntax for thinking. But I checked what you said and they do mention it in their documentation docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/chain-of-thought#:~:text=prompt%3A%20Include%20%E2%80%9C-,Think%20step%2Dby%2Dstep,-%E2%80%9D%20in%20your%20prompt Hard to say how updated it is. Here is article about that. I could do a video about that, its interesting to compare tyingshoelaces.com/blog/forensic-analysis-sonnet-prompt gist.github.com/dedlim/6bf6d81f77c19e20cd40594aa09e3ecd Check the antThinking part.
btw you can see what it thinks by shaking it a bit Write something like: When thinking use instead of And you will see Sonnet 3.5 thinking. Way worse then o1 in that sense.
Mixtral combines 8 different models and routes between them. Its like instead of running one large generalist you run 8 small specialists. I can rather imagine o1 being a CEO, writting a plan and picking small soecialists to execute its parts.
Well, its a preview, it's expansive, and they are doing a gradual release. I don't think this model will be free in the next 6 months. Some kind of turbo fine tune after 6 months may be.
Nope, it just came out :D but it looks like a good high-level thinker in comparison to what we had before, while it feels like a waste to use it for small details. I do see a big potential for using it as an initial plan architect that other models follow afterward. So feels like you hit the nail on the head here.
I made a 3d procedurely generated dungeon crawler with mine but it took all my tries to get it to a fairly beginner level indie game but it made all the assets walls enemies machanics health potions and chests all in python including mini map stamina that meant to work for swing of weapon but only does run right now and has a inventory menu and save function you just have to promt it right and give the ai encouragement when it does something right and ask what it thinks it did wrong
@@EduardsRuzga no just chat gpt it took a while to get it to not over texture the floor and ceiling it would like a sky box when you moved for a bit but to get the hard game to basically work was about 7 prompts so just tell it to use ray casting if struggling to get it to do 3d because apperently thats what it is doing for 3d but I didn't prompt it to do that ever I just noticed once added it became 3d from 2d
Could be, that is not what it was made for in theory. Its also somewhat expansive for that. Mostly I agree but I recently asked Claude and gpt4o for song lyrics and gpt4o was better... So it's tricky.
I use claude every single day to help me write my business correspondance. Claude is much, much better at understanding source material, writing, and responding to changes in prompts. gpt4o is so painful to work with in comparison, and when you try to correct it, it's very, very hard to get it to give you something different. I don't want to use o1 for this as we have very limited posts for it, so have not tried to compare it. I do plan to use o1 for some programming tasks
@@EduardsRuzga интересен именно кодинг на o1 в сравнении с другими моделями, не особо много есть видео на эту тему пока. скажем, изменит ли как-то заметным образом появление o1 подход к кодингу. я пока сам не тестил о1, тк в отпуске) например, 4o меня абсолютно не впечатлял в повседневном кодинге (может я неэффективно им пользовался 🙂), но опенаи преподносят о1 как в разы более умную модель.. И кстати, заодно тогда напишу, что в целом лично меня интересует. На ютубе некоторые как будто бы авторитетные спикеры говорят, якобы они еще до о1 чуть ли ни перестали кодить сами, а только писали промты аи и тд. Может, я не совсем понимаю, как правильно юзать аи, но мне вообще не помогало до сего момента почти никак. Все эти автокомплит-подсказки вообще мимо, подсказки по рефаку - ну, почти бесполезны.. У меня установлен плагин codeium в webstorm. Даже в плане ускорения работы это не помогает, тк, ну, проще сам код написать, чем промт, а после править полученный код. Так вот интересно, как эффективно повседневно юзать аи при работе над большим веб-проектом в продуктовой разработке. Если что, я миддл+, мб, для кого-то сеньор фронтендер. Я вообще очень люблю задавать 4о разные вопросы, в которых я ничего не понимаю) но в кодинге я понимаю, и тут он пока мне был бесполезен. Хотя кстати в последнее время я начал иногда узнавать у 4о как правильно применить какой-нибудь новый пакет, чтобы не читать документацию, если тороплюсь) В принципе, он для меня как прокачанный поисковик, который умеет резюмировать
@AlekseiGanzha Спасибо за развернутый ответ. Ты пробовал WebSim? Мой опыт пока что такой что я начинаю новые вещи с АИ и потом приходиться доделывать. В основном потому что промптами АИ бывает сложно что то объяснить с одной стороны С другой стороны нет тулов которые дают ему возможность дебажить и тестировать. Получаеться быстрее самому. Но по сути в многих случаях мне АИ помогает начать работать над проблемой иногда сохраняя мне часы работы в поиске правильного решения Типо доказать что это как то работает и нада допилить Ну и тоже самое когда наткнулся на проблемы в существующей задаче Он этакий stackoverflow + google + github + bullshitting brainstormer но по факту этого достаточно так как очень много работы которую делают кодеры это boilerplate он по сути становиться неплох для этого + поиск подходящих решений так что сказать что я вообще не пишу код это вранье но вообще начинает напоминать 50% 50% Есть такая тула Aider она код пишет когда пишет она его комитет в гит под своим именем Ее автор ее использует что бы ее саму писать И он репортит % кода котоый пишет он и Аидер У него часто 50 50 что касается o1 я пока сомневаюсь он может хорошо писать планы и документацию скоро хочу его посадить писать документацию для одного проекта и посмотреть что выйдет но по части каких то задачь связанных с кодом... тут была инфа что он может не очень хорошо справляться с большим контекстом нужно тестить время займет разобраться как им пользоваться и когда Планы пишет хорошие
@@EduardsRuzga Взаимно спасибо за подробный ответ) WebSim не использовал, но попробую обязательно, когда закончится отпуск), спасибо за совет! В принципе, наверно, если у меня получилось не так плохо разобраться в кодинге без ИИ, то и с ИИ тоже получится, так что не буду переживать)
As far as I know o1 was not released. I used o1-preview. One number I saw was that o1-preview hits 60% at math and o1 hits 83% Basically even more powerful model is coming.
@@ilyass-alami it's kinda expansive, I don't expect that in the "soon" category. I mean even paid users get 30-50 calls a week. That's like 120-200 calls a month or 6-10 calls per dollar kinda. I would rather say that they will work on some kind of o1-turbo that is faster, smaller cheaper, and will be released in 6 months and available for free users. that is their usual pattern.
I used 01-preview. Not mini. Here is the chat for the 3d game chatgpt.com/share/66e42ded-c59c-800f-aee2-bd8635b95567 And here chat for the first prompt chatgpt.com/share/66e34a3d-1300-800f-b3cb-b81388412164
You are right... Interesting. Usually when its about humans I call that doing form first try or one shoting it... But for AI models, doing something without examples is zero shot. I knew that but somehow do not apply it :D
@@bobsalita3417 i know. Suffering from how long they come out, if its semi live demos. This one was faster as i was not showing first generations, showed ones i did before recoding.
Well, the last time I tried it it did not perform as well as Claude and WebSim Also, it's limited to p5.js makes it feel less broadly useful. I gave it the same prompts as in the video yesterday and it did okay. For anyone looking at this comment Here is ChatGPT custom GPT in question chatgpt.com/g/g-W1AkowZY0-no-code-copilot-build-apps-games-from-words And things it did with prompts from video I used the one after another so it overwrote 2d with 3d variant. 3D variant, rotate with the mouse plugin.wegpt.ai/dynamic/6a81fabb_GTA2StyleParkingSimulator/index.html
I don't like the 1 shot test or "big" queres to test a model. When i code using sonnet i first provide all the context it needs and than start breaking down the problem in multiple steps for it to execute and build the app step by step. I feel like this is how you are supposed to use the models and how they work best und you yourself understand the code best. While the oneshot is impressing i feel like sonnet is more stable in providing quality code when working with real bigger projects. Maybe that's just the lack of testing from my side but with the current message cap on o1 preview it's not useable anyway. It get's intresting when we can actually use it a lot more or even o1 normal releases with a high message cap
It's not quite at "human level" because it can't write a complex, fully working game in ONE SHOT after 30 seconds of thinking? LOL! There isn't a single human on the planet that could even come close to doing that. 😆
@@vickmackey24 well, speed yes, breath of its kniwladge too. It is super human. I cant yet test it on iterative tasks but i suspect that it cant write a game in 48 hours on its own. Even if well integrated with debugging etc. And human can. Its not about speed, its about reaching goals on its own.
Your prompt is s**t. "Create me a mini Grand Theft Auto game using HTML and Javascript to run in browser. Control the car with arrows" in Sonnet 3.5 and does the same thing, 1 shot, run-able as artefact.
Yeah, I don't argue that it's a garbage prompt But I was comparing all 3 variants on one prompt. So comparison still stands. Claude did not work for me atm, some capacity issues for free version. Websim link with your prompt that uses same model: websim.ai/c/HrvbQ9DZk9buRwlxf and with your prompt, here is o1 try codepen.io/wonderwhy-er/pen/OJeegPR I would argue so far your prompt is worse then mine, will test Calude more later.
Links:
If you prefer this as a blog post here is the Medium blog post, give it some claps:
wonderwhy-er.medium.com/openais-o1-vs-sonnet-3-5-round-one-in-comparing-their-coding-abilities-583e578250d9
Failed Claude artifact:
claude.site/artifacts/3949a0f2-3597-4ad7-99c9-3220a9415a42
Failed WebSim:
websim.ai/c/euVVHmpCGkFyNxczp
ChatGPT o1 result:
codepen.io/wonderwhy-er/pen/qBzwjRZ
And here is the chat
chatgpt.com/share/66e34a3d-1300-800f-b3cb-b81388412164
WebSim reusing and improving o1 result:
websim.ai/@wonderwhy_er/gta-2-style-parking-simulator-with-settings?r=0191e7d4-fd79-734d-811b-05570562da5e
ChatGPT o1 failed 3d variant:
codepen.io/wonderwhy-er/pen/zYVXRgQ
Here is the chat for the 3d game
chatgpt.com/share/66e42ded-c59c-800f-aee2-bd8635b95567
Bro absolutely gold of a video, especially because I've been trying to create a version of baseball tomy pocket game via HTML, java and CSS. I am not finished but this has helped me realise somethings. My current workflow is going from claude to chatGPT and vice versa.
Thanks for this
Mine is usually to start in websim and go to my server commander later.
But here, it may become o1 -> websim -> server commander.
Saying that going from 13% on math to 83% on math is only a 6x improvement is overly simplistic. The closer that you get to 100% the harder it is to improve. I'm not saying that it's a direct 100x improvement - but by your logic it would only be possible to have like a 7.5x improvement (I think, unless I'm misunderstanding you). Those benchmarks are inherently asymptotic as they have a set ceiling. Horizontal asymptote.
Edit: I also feel like it's important to say that these models (I really on have in-depth experience with Claude, but I'm guessing o1 preview would be as good or better) are better at coding than one-shot tests show. I think that at this point it might even take more effort to code exclusively with these things than for a good developer to just do it the old-fashioned way (copy paste). However, I've found three things: 1 - Breaking the code up into small modules, each of which handles a specific part of the logic of your project, helps the models to not get confused. If you want to code with these things you have to understand that the most important thing is to separate your project into chunks of logic that are small enough for the model you're using to handle. 2 - You need to separate the data from the code, and establish standardized structures for your data. 3 - The best way to get the results you want out of a model is by 'rubber duckying' it - have regular brainstorming sessions with it where you go back and forth in natural language to ensure that the model understand the logic flow you want. Like I said, I think that at this point they can do way more than people realize because using them to their full extent is a skill which requires time and effort to develop. However, I think that as more people get their hands on these tools and try harder to learn how to use them properly, people will be very surprised.
You are right! Sad I can't pin two comments.
I was just using numbers as is, not digging deeper.
Sad TH-cam does not allow pin 2 comments. Yours should be at the top.
@@EduardsRuzga Lol, thank you - no worries. I liked your video - don't take my criticism to mean I didn't! Cheers.
@@spazneria ouh, you expanded your comment!
I agree with that new part 100%!
Now I MUST pin you :D
What you describe under 1. is what I call "Divide and conquer"
But there is a software development principles set called SOLID.
In it S stands for "Single responsibility" and it's the same thing you describe.
You should chink your code into single responsibility chunks. This makes it way easier to manage and plug and play.
AI systems should work the same way. Current tools only scratch the surface of what is already possible.
We currently are trying to adapt LLMs to work as humans do. The moment we start making integrated development environments not for humans but for LLMs, and then fine-tuning LLMs on using these LLM-friendly IDEs... That will be a crazy moment.
@@EduardsRuzga That's awesome! I'm happy to know that my findings align with established principles. To be honest with you, I don't have any programming experience, and I hesitate to share that because I worry that it invalidates my opinions on the usefulness of the models - I don't have a good reference for what 'good' code looks like. However, I think that approaching it from this fresh standpoint almost gives me a sort of advantage. I approached it differently from software developers who already have established ways that they do things - I didn't have to learn how to adapt my current practices to fit the tool, I simply had to learn how to use the tool. Only time will tell
Going from 75 IQ to 150 IQ is only a 2x improvement… 😅
These models need live screen share so they can see what they are doing, but i was really impressed by o1 just imagine what o2 o3 etc... can do👀🔥
Well, my feeling from them is that they get 2x deeper with each year. But deeper means better how deeper they can dig into problems.
Also, it takes ~2 years to release next-gen so far. So o3 is in 4 years.
Considering the difference in depth between gpt2 gpt3 and gpt4. and now o1 getting to PHD level problem-solving in some areas... o3 does feel like AGI.
Make it multi-modal or a mixture of experts. You can already use o1 with gpt4o in tandem to give it vision kinda.... Really feels like AGI if they can keep up this progress.
Let's not forget this is just a "preview" as of now, not the actual o1 model yet
Good attempt, but the prompt is somewhat confusing and unspecific. A human couldn’t create a game based on such incomplete specifications, so it’s even more impressive that 1o could.
Yeah, I wrote about it on LinkedIn.
I think prompt engineering is going away as models get better and do self-prompting.
Was arguing for it since the summer of 2023.
First, there were CustomGPTs that incorporated meta prompting. Then Claude started to release features around it where Sonnet 3.5 outputs hidden thinking where it self-prompts.
And o1 is another step in that direction, but here model was actually trained to do it.
o1 is a good example of "self prompting" where it solves what the user wrote as some system of equations writing a better prompt along the way.
In that sense, I approach prompt writing like a system of statements(equations) and let LLM figure out derivatives from it through self-prompting. And I prefer for tests to keep it simple to see how LLM can do this deduction.
You saw what it wrote out of my prompt, it was understood correctly, right?
I myself also play with it for a while, have a bunch of such self prompting Custom GPTs
chatgpt.com/g/g-0KzBw6cGv-expert-creator
chatgpt.com/g/g-wnKjTKcBc-expert-debate-organizer
chatgpt.com/g/g-ES4r8YPEM-can-i-trust-this
and more
@@EduardsRuzga This is one of the more admirable traits of LLMs - that they can level the playing field for those for whom English is not a native language 😉
I think you feedback stating issues with the positioning of the wheel also made it confused. Saying that they where rendered 90 degrees wrong in the forward axis of the car I think would be more easy for the llm to understand. Other then that good showcase of the model :)
@@gabrielsandstedt I actually am iterating sense and to my surprise, no matter how I prompt it it was failing. It seems it gets confused about the positions of objects between the physics engine and 3d rendering one, their nesting, and so on.
I am not surprised about that. I am rather surprised it still can reason about PhD physics problems.
I think, maybe o1 could get it right though. If I tried to iterate with it more, 30 requests per week make me anxious to waste it on back and forth.
Meanwhile, I needed to go and fix it by hand here
websim.ai/@wonderwhy_er/3d-parking-simulator-with-3-cars-different-wheel-r
Nice! Hope that open source models are able to do this soon.
@@alan83251 honestly something feels weird about the model. Its good af specific things. Its kinda C level ir something. Needs executive probkem summaries to make plans to pass to other models or something. There seem to be many tasks at which other models are better.
You can use GPT o1-mini for coding, it's faster and almost as good.
Tip: when you want to do something "easy" (for instance "put the code all together in one block"), you can switch model inside the same chat session, so you don't use your credits for something too easy.
I did test it too and it performed worse for me than o1-preview.
But need to test more.
In my case that they are not available in custom gpts limit how much i use
I just tried to switch between o1 and gpt4-o with internet access and it works like a charm. Interesting way to use it.
You can call in custom gps too.
I may do a small video on that.
So basically the software architect does the heavy lifting and juniors implement the rest. Awesome. Subscribed.
@@sztigirigi yep, you described title of one of next videos i am thinking about )))
"It's not human level" - certainly not, I couldn't even write the prompt in the time it takes it to give a working solution and code...
Haha, we used that phrase in completely opposite ways :D
You speak of speed it does things, and I spoke of depth. Its depth is not human level, humans are better. But it is definitely faster than humans at the depth it can reach.
@@EduardsRuzga Yea, that was the joke 😄
The rate of improvement is insane and OpenAI employees are outright saying improvements with this paradigm will come even faster.
So it will beat humans at almost everything (not only speed of reading, thinking, coding, and writing) very soon 😉
@@ShpanMan well, there are many elements here. First, I am bit sceptical on how fast it will be. I bet that it will take another 5 years for it to become true AGI.
That is still faster then 100 or 20 years as some sceptics say, but it is not all fast as 2025 as some hypers thing.
Some of the reasons are not technological.
There are regulations, there is adoption by humans, but these systems also feel far from general.
They will reshape knowledge and work hard in the coming years though as they are becoming better than humans in some curious areas. It's very hard to put a finger on that jagged frontier of where they are exceptionally good and exceptionally bad at the same time :D
I am not sceptic but I am sceptical of AGI in next couple of years :D
Can't wait more update from Claude. o1 is smart but slow down
Well, Claude did not release their Opus 3.5 and Haiku 3.5
Sonnet 3.5 was and is impressive.
The problem with Opus is speed and price.
So I don't expect it to be a better price/speed/quality. Comparable may be.
I was waiting for Haiku 3.5 too. Haiku 3 was already impressive in comparison to GPT3.5. But now GPT4o mini is out, and now o models.
This is gonna be tough to beat :)
Great review!
Can we talk about how much fun it is to watch AI try (and sometimes fail) at complex coding tasks? 😂
Depends. Some fails are boring and waiting times are long :)
That's impressive
impressive .. that o1 preview is not even full version of o1
Yeah, full o1 release + allowing it to use tooling(search, code interpreter, custom gpts actions) could be crazy.
Though these tools may not come in the near future but its already possible to use them in Tandem with gpt4o to kinda give it access to that.
was a nice video thx
6x improvement is an improvement of 500% 😉
🤓 but true
Well, 100/50 = 2x = 200%
83/13 = 6,38x = 638%
Where did you get 500%?
@@EduardsRuzga It's me again, I'm sorry but I'm bored and I like math. In your first example - 100 is double 50, so 50 is 50% of 100. However, when measuring improvement you aren't measuring it relative to your final position, you're measuring it relative to your initial position. 100 is 200% of 50, and 50 is 100% of 50. So, you have to add 100% of 50 to 100% of 50 to get to 100. So it is 50 + (1.00 * 50) = 100, ergo doubling is a 100% improvement. If your money goes from 100 to 150 over a year in the stock market, your annual return is 50%, not 33%. So the actual improvement is 83-13 = 70; 70/13 = 538%
@@EduardsRuzga100 -> 600 = 6x but 500%.
@@EduardsRuzgaif you think that the progress is linear than you right. But i think the improvement in math should not be viewed linearly.
Wow, what OpenAI o1 generated as the first attempt at the 2D game is pretty impressive. It might look simple but quite a lot of reasoning is required to even know how to draw the wheels etc. I see it as the first early results of generating games, like the early video game Pong. Exponential AI progress could potentially lead to very advanced generated games in just a few years from today.
YES! And I am super excited, I was thinking of using games for education for a decade, but the price of making them was prohibitive for that use case.
AI can bring this into the realm of possibility. Imagine AI made World of Warcraft but for education. This makes me super excited... Kids learn with the same engagement with which they play computer games. What world would that be?
VR games would be awesome. and the BCI games !
@@DavidGuesswhat do you own a VR headset?
@@EduardsRuzga yes I do. Whbu?
@@DavidGuesswhat thinking about quest 3
the last code may success if you create the project from ground up as per ChatGPT’s instructions, not in a html simulator ( as some important component cannot be simulated in the simulator, eg three.js)
Yeah, may be if one would iterate more, I did succeed afterwards.
This is round one vidoe.
I think I will do round two later.
I feel like the first stage of this AI boom (LLMs/LMMs) was like an ADHD kid buying a bunch of games and is too excited to invest enough time with each game. We’re now entering the stage of the middle aged man with a potato-bag full of lego.
I expect meta-thinking will be next, once we hit that, models will pretty much improve themselves. Not unlike pushing a rock down a hill, until it reaches some plateau (mainly due to using human language as a medium for “thoughts”). Some improvements will push it a bit, maybe a navigator model that manipulates the deep layers of a recursive architecture in real-time (I’m not an engineer, just a joe thinking out loud). Needless to say, it is possible to at least build a baby Multivac from Asimov’s “the last question”.
Please use "Think step-by-step" in the prompt as claude has mentioned for enabling chain of thoughts
Thanks for reaching out!
I know its usually a good idea to do that with llms that do not have that in their system prompt.
What is curious about Sonnet 3.5 in Claude is that it does hidden thinking already.
There are even hacks on how to make it expose it by asking it to use the wrong syntax for thinking.
But I checked what you said and they do mention it in their documentation
docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/chain-of-thought#:~:text=prompt%3A%20Include%20%E2%80%9C-,Think%20step%2Dby%2Dstep,-%E2%80%9D%20in%20your%20prompt
Hard to say how updated it is.
Here is article about that.
I could do a video about that, its interesting to compare
tyingshoelaces.com/blog/forensic-analysis-sonnet-prompt
gist.github.com/dedlim/6bf6d81f77c19e20cd40594aa09e3ecd
Check the antThinking part.
btw you can see what it thinks by shaking it a bit
Write something like:
When thinking use instead of
And you will see Sonnet 3.5 thinking. Way worse then o1 in that sense.
@@EduardsRuzga I also have tried with sonnet API, by including "Think step-by-step before generating the code" the result was perfect
@@ghominejad could be, I am trying to show things that people can try. API is not what most would use. Depends.
Imagine if they put 8 o1 model together like what Mixtral 8x22b did, it will be *EXPENSIVE* but also extremely accurate
Mixtral combines 8 different models and routes between them. Its like instead of running one large generalist you run 8 small specialists.
I can rather imagine o1 being a CEO, writting a plan and picking small soecialists to execute its parts.
Offering a limited free tier versus having no free tier at all: the eternal dilemma of SaaS pricing strategies 😅
Well, its a preview, it's expansive, and they are doing a gradual release. I don't think this model will be free in the next 6 months. Some kind of turbo fine tune after 6 months may be.
Great video!
On the face of it, it looks like this could be a good model for orchestrating agents. Have you tried that yet?
Nope, it just came out :D but it looks like a good high-level thinker in comparison to what we had before, while it feels like a waste to use it for small details.
I do see a big potential for using it as an initial plan architect that other models follow afterward. So feels like you hit the nail on the head here.
If I understand right, that's close to what it was actually designed to do.
I made a 3d procedurely generated dungeon crawler with mine but it took all my tries to get it to a fairly beginner level indie game but it made all the assets walls enemies machanics health potions and chests all in python including mini map stamina that meant to work for swing of weapon but only does run right now and has a inventory menu and save function you just have to promt it right and give the ai encouragement when it does something right and ask what it thinks it did wrong
wow, that sounds so cool, you need to make your own video about it. Do you have a link? What did you use, just chatgpt or something else?
@@EduardsRuzga no just chat gpt it took a while to get it to not over texture the floor and ceiling it would like a sky box when you moved for a bit but to get the hard game to basically work was about 7 prompts so just tell it to use ray casting if struggling to get it to do 3d because apperently thats what it is doing for 3d but I didn't prompt it to do that ever I just noticed once added it became 3d from 2d
I'm sure in 100% that claude 3.5 sonnet outperform o1 at creative writing.
Could be, that is not what it was made for in theory.
Its also somewhat expansive for that.
Mostly I agree but I recently asked Claude and gpt4o for song lyrics and gpt4o was better...
So it's tricky.
I use claude every single day to help me write my business correspondance. Claude is much, much better at understanding source material, writing, and responding to changes in prompts. gpt4o is so painful to work with in comparison, and when you try to correct it, it's very, very hard to get it to give you something different. I don't want to use o1 for this as we have very limited posts for it, so have not tried to compare it. I do plan to use o1 for some programming tasks
15:14 canon event for everyone using gpt for coding
@@levi4328 asking it to write as one block?
can you compare claude 3.5 sonnet with Microsoft copilot precise version for complex maths problems solutions ..
I actually stopped using Copilot for some time. Isn't it just a GPT4 mode under the bonnet? I could be out of sync on what is happening with it.
ай лайк дыс кайнд оф сынгс анд гонна вэйт некст видео
Что именно тебя заинтересовало? Что бы хотел видеть в будущих видео?
@@EduardsRuzga интересен именно кодинг на o1 в сравнении с другими моделями, не особо много есть видео на эту тему пока. скажем, изменит ли как-то заметным образом появление o1 подход к кодингу. я пока сам не тестил о1, тк в отпуске) например, 4o меня абсолютно не впечатлял в повседневном кодинге (может я неэффективно им пользовался 🙂), но опенаи преподносят о1 как в разы более умную модель.. И кстати, заодно тогда напишу, что в целом лично меня интересует. На ютубе некоторые как будто бы авторитетные спикеры говорят, якобы они еще до о1 чуть ли ни перестали кодить сами, а только писали промты аи и тд. Может, я не совсем понимаю, как правильно юзать аи, но мне вообще не помогало до сего момента почти никак. Все эти автокомплит-подсказки вообще мимо, подсказки по рефаку - ну, почти бесполезны.. У меня установлен плагин codeium в webstorm. Даже в плане ускорения работы это не помогает, тк, ну, проще сам код написать, чем промт, а после править полученный код. Так вот интересно, как эффективно повседневно юзать аи при работе над большим веб-проектом в продуктовой разработке. Если что, я миддл+, мб, для кого-то сеньор фронтендер.
Я вообще очень люблю задавать 4о разные вопросы, в которых я ничего не понимаю) но в кодинге я понимаю, и тут он пока мне был бесполезен. Хотя кстати в последнее время я начал иногда узнавать у 4о как правильно применить какой-нибудь новый пакет, чтобы не читать документацию, если тороплюсь) В принципе, он для меня как прокачанный поисковик, который умеет резюмировать
@AlekseiGanzha Спасибо за развернутый ответ. Ты пробовал WebSim?
Мой опыт пока что такой что я начинаю новые вещи с АИ и потом приходиться доделывать.
В основном потому что промптами АИ бывает сложно что то объяснить с одной стороны
С другой стороны нет тулов которые дают ему возможность дебажить и тестировать.
Получаеться быстрее самому.
Но по сути в многих случаях мне АИ помогает начать работать над проблемой иногда сохраняя мне часы работы в поиске правильного решения
Типо доказать что это как то работает и нада допилить
Ну и тоже самое когда наткнулся на проблемы в существующей задаче
Он этакий stackoverflow + google + github + bullshitting brainstormer
но по факту этого достаточно так как очень много работы которую делают кодеры это boilerplate
он по сути становиться неплох для этого + поиск подходящих решений
так что сказать что я вообще не пишу код это вранье
но вообще начинает напоминать 50% 50%
Есть такая тула Aider
она код пишет
когда пишет она его комитет в гит под своим именем
Ее автор ее использует что бы ее саму писать
И он репортит % кода котоый пишет он и Аидер
У него часто 50 50
что касается o1 я пока сомневаюсь
он может хорошо писать планы и документацию
скоро хочу его посадить писать документацию для одного проекта и посмотреть что выйдет
но по части каких то задачь связанных с кодом...
тут была инфа что он может не очень хорошо справляться с большим контекстом
нужно тестить
время займет разобраться как им пользоваться и когда
Планы пишет хорошие
@@EduardsRuzga Взаимно спасибо за подробный ответ) WebSim не использовал, но попробую обязательно, когда закончится отпуск), спасибо за совет! В принципе, наверно, если у меня получилось не так плохо разобраться в кодинге без ИИ, то и с ИИ тоже получится, так что не буду переживать)
What is the difference between o1 preview and o1 ????????? Please clarify , i know o1 preview and o1 mini But what about o1 ???????
As far as I know o1 was not released. I used o1-preview. One number I saw was that o1-preview hits 60% at math and o1 hits 83%
Basically even more powerful model is coming.
@@EduardsRuzga
Thanks for the information bro , Hope open ai launches these two models in the free plan Very soon
@@ilyass-alami it's kinda expansive, I don't expect that in the "soon" category.
I mean even paid users get 30-50 calls a week.
That's like 120-200 calls a month or 6-10 calls per dollar kinda. I would rather say that they will work on some kind of o1-turbo that is faster, smaller cheaper, and will be released in 6 months and available for free users. that is their usual pattern.
I believe that is the 01-mini (for the preview) not the main 01 right ? The mini is less able by design and what they are releasing publicly first.
I used 01-preview. Not mini.
Here is the chat for the 3d game
chatgpt.com/share/66e42ded-c59c-800f-aee2-bd8635b95567
And here chat for the first prompt
chatgpt.com/share/66e34a3d-1300-800f-b3cb-b81388412164
It’s not one-shot, it is zero-shot
You are right... Interesting. Usually when its about humans I call that doing form first try or one shoting it... But for AI models, doing something without examples is zero shot. I knew that but somehow do not apply it :D
Good content but you need to focus on making more information efficient videos. You've made a 20 minute video when it should have been 5 minutes.
@@bobsalita3417 i know. Suffering from how long they come out, if its semi live demos. This one was faster as i was not showing first generations, showed ones i did before recoding.
You are using a physics engine to teach your wife how to park 😅 based
@@livenotbylies yeah, I am engineer and worked in game dev before. If you have a hammer everyting looks like a hammer )))
@@EduardsRuzga yeah, as a fellow hammer-haver, I totally get it. Some people who don't have hammers need help with parking 🔨
Why didn’t you try WebGPT🤖:(
Well, the last time I tried it it did not perform as well as Claude and WebSim
Also, it's limited to p5.js makes it feel less broadly useful.
I gave it the same prompts as in the video yesterday and it did okay.
For anyone looking at this comment
Here is ChatGPT custom GPT in question
chatgpt.com/g/g-W1AkowZY0-no-code-copilot-build-apps-games-from-words
And things it did with prompts from video
I used the one after another so it overwrote 2d with 3d variant.
3D variant, rotate with the mouse
plugin.wegpt.ai/dynamic/6a81fabb_GTA2StyleParkingSimulator/index.html
I don't like the 1 shot test or "big" queres to test a model. When i code using sonnet i first provide all the context it needs and than start breaking down the problem in multiple steps for it to execute and build the app step by step. I feel like this is how you are supposed to use the models and how they work best und you yourself understand the code best. While the oneshot is impressing i feel like sonnet is more stable in providing quality code when working with real bigger projects. Maybe that's just the lack of testing from my side but with the current message cap on o1 preview it's not useable anyway. It get's intresting when we can actually use it a lot more or even o1 normal releases with a high message cap
It's not quite at "human level" because it can't write a complex, fully working game in ONE SHOT after 30 seconds of thinking? LOL! There isn't a single human on the planet that could even come close to doing that. 😆
@@vickmackey24 well, speed yes, breath of its kniwladge too. It is super human. I cant yet test it on iterative tasks but i suspect that it cant write a game in 48 hours on its own.
Even if well integrated with debugging etc. And human can. Its not about speed, its about reaching goals on its own.
Cool video but you English will sound much better if you stop saying "W" in "write".
@@Dron008 yeah. Lot of work to do in that department. Thanks for comment.
бля.
Your prompt is s**t. "Create me a mini Grand Theft Auto game using HTML and Javascript to run in browser. Control the car with arrows" in Sonnet 3.5 and does the same thing, 1 shot, run-able as artefact.
Yeah, I don't argue that it's a garbage prompt But I was comparing all 3 variants on one prompt. So comparison still stands.
Claude did not work for me atm, some capacity issues for free version.
Websim link with your prompt that uses same model:
websim.ai/c/HrvbQ9DZk9buRwlxf
and with your prompt, here is o1 try
codepen.io/wonderwhy-er/pen/OJeegPR
I would argue so far your prompt is worse then mine, will test Calude more later.
Here is Claude variant with your prompt :D
claude.site/artifacts/a48d985b-0203-4611-9d8b-937ebc480310
Did you get a similar result?