Phi-3 Medium - Microsoft's Open-Source Model is Ready For Action!

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ต.ค. 2024
  • Phi-3 Medium is a new size of Microsoft's fantastic Phi family of models. It's built to be fine-tuned for specific use cases, but we will test it more generally today.
    Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewber...
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.ne...
    👉🏻 LinkedIn: / forward-future-ai
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
    Links:
    Open WebUI - • Ollama UI - Your NEW G...

ความคิดเห็น • 182

  • @JohnLewis-old
    @JohnLewis-old 4 หลายเดือนก่อน +110

    I would really like to see a video comparing the degradation of models from quantization (as compared to just larger and small models from the same root.) The key for me would be the final model size (in memory) versus how well it performs. This is poorly understood currently.

    • @ts757arse
      @ts757arse 4 หลายเดือนก่อน +12

      Of note, I recently watched a video by the AnythingLLM chap and he said he was using llama3 8B but emphasised that, for good results, you needed to download the Q8 model, not the Q4 as Ollama defaults to.
      Myself, I use Q4 on my inference server for larger models but my workstation is faster and runs Q6 at acceptable speed.
      He said if he was running llama3 70B, he'd download Q4 and "have a good time", but for smaller models where they're less capable, you want to limit compression.
      He also said it's a "use case science" which makes me think you have to test out what works for you.
      The Q4 model I have on my server is based on Mixtral 8x7B and, for my use case, is proving to be better than GPT4o, which is stunning. What's amazing is that, for my core business stuff, I still haven't found anything better than Mixtral 8x7B for balance of speed and performance.

    • @nathanbanks2354
      @nathanbanks2354 4 หลายเดือนก่อน +1

      Yeah, a video would be great! I read papers about this a year ago, and the drop to 8-bit is very minimal, drop to 4-bit is reasonable, and 3-bit or 2-bit quantization and things get much worse. Of course there are different ways to perform quantization, so this may have improved. I've tried comparing 16-bit and 4-bit models, and usually the difference is much much less than an 8B parameter model and a 32B parameter model. This is probably why NVIDIA's newest GPU's support 4-bit quantization, and I tend to run everything using ollama's default 4-bit quantization, though for Llama-3 70b or mixtral 8x22b this is excruciatingly slow on my laptop with 16GB of VRAM and 64GB of RAM. I rented a machine with 4x4090's for a couple hours and they ran reasonably well with 4-bit quantization, but 10% as fast as groq (note the "q").

    • @mickelodiansurname9578
      @mickelodiansurname9578 4 หลายเดือนก่อน

      @@ts757arse Not sure "Have a good time" is an objective measure of efficacy though. I'd say given the type of technology the results are very sensitive to use case.

    • @longboardfella5306
      @longboardfella5306 4 หลายเดือนก่อน

      @@ts757arsethanks for your testing and advice. I am now experimenting with Mixtral7B 4K. This is all a bit new to me, but it looks great so far

    • @ts757arse
      @ts757arse 4 หลายเดือนก่อน

      @@mickelodiansurname9578 nope, it's not particularly empirical. I think he was making the point that you're messing around at that point and making so many compromises that it's a bit of a laugh. Or he might have been saying with such a large model, the compression has less impact.
      Regardless, I've found llama3 to be *awful* when running quantised models and I simply don't bother with it at the moment. Given his advice, I'm going to try the 8B Q8 model as a core model for a new project but I'm also building it to easily move over to Mixtral if needed.
      I tend to run a few models doing a few tasks at the same time, passing the tasks between them and so on. Helps having a server to run one model on and a workstation with many cores and all the RAM.
      What I'm seeing at the moment is a lot of models aceing benchmarks, but then being utterly dogshit in real world use.

  • @PatrickHoodDaniel
    @PatrickHoodDaniel 4 หลายเดือนก่อน +20

    You should introduce a surprise question if the model gets it right just in case the creators of the model trained specifically for this.

    • @nathanbanks2354
      @nathanbanks2354 4 หลายเดือนก่อน +1

      It would be hard to compare unless it was something like "Write an answer with 7 words" and the number "7" was randomized.

  • @freedtmg16
    @freedtmg16 4 หลายเดือนก่อน +10

    Imho the answer given the killers problem in this one REALLY showed a deeper level of reasoning in both not assuming the person who entered had never been a killer because they were only identified as a person, and also in not assuming the dead person should not be counted as a person.

  • @tungstentaco495
    @tungstentaco495 4 หลายเดือนก่อน +16

    at 8Gb, I'm guessing it's Q4 quantized. When you get much below Q8, output really starts to degrade. It would be interesting to compare the Q4 results with a Q8 version of the model. Also 128k can be worse than 4k results as well. Not sure which one was tested in this video.

    • @moisesxavierPT
      @moisesxavierPT 4 หลายเดือนก่อน

      Yes. Exactly.

    • @sammcj2000
      @sammcj2000 4 หลายเดือนก่อน +3

      Yeah Q4 is usually pretty balls. Every time I check quant benchmarks q6_k seems to be the sweet spot.

  • @tomenglish9340
    @tomenglish9340 4 หลายเดือนก่อน +21

    We have good reason to expect quantization of Phi models to work poorly. Phi models have orders of magnitude fewer parameters than do other models with comparable performance. Loosely speaking, this indicates that Phi models pack more information into their parameters than do others. Thus Phi models should not be as tolerant of quantization as other models are.

    • @InnocentiusLacrimosa
      @InnocentiusLacrimosa 4 หลายเดือนก่อน +4

      Yeah. It would be great to have a clear overview always on how much vram each model needs and if quantized model is used: how much it is gimped compared to full model.

    • @braineaterzombie3981
      @braineaterzombie3981 4 หลายเดือนก่อน

      Bro what is quantization

    • @Nik.leonard
      @Nik.leonard 4 หลายเดือนก่อน +9

      @@braineaterzombie3981Reducing the numerical precision of the weights from 16-bit (usually) to just 4bit or even less with some function. Is like rounding the values.

    • @braineaterzombie3981
      @braineaterzombie3981 4 หลายเดือนก่อน +1

      @@Nik.leonard oh ok . Thanks for information

  • @maxlightning4288
    @maxlightning4288 4 หลายเดือนก่อน +36

    lol “glad that I is there”

    • @ts757arse
      @ts757arse 4 หลายเดือนก่อน +2

      I once had an LLM write me a contract where the first letter of every line spelt "DONT BE A CUN" (no "I", additional "T"). Got it first time and I sent it to my client.

    • @rocketPower047
      @rocketPower047 4 หลายเดือนก่อน +1

      I cackled when he said that 🤣🤣

    • @maxlightning4288
      @maxlightning4288 4 หลายเดือนก่อน

      @@rocketPower047 haha yeah I like his side note comments like that lol. The second he started spelling it out CU..I…NT I thought exactly what he said in real time

  • @littleking2565
    @littleking2565 4 หลายเดือนก่อน +6

    Technically 4 killers is right, it's just the killer is dead, but the body is still in the room.

  • @abdelhakkhalil7684
    @abdelhakkhalil7684 4 หลายเดือนก่อน +8

    For fairness, I highly advise you to test models with similar quantization levels. There are times when you tested the unquantized versions, and other times when you tested the q8_0 versions. The one you are testing in this video is likely a q4_k versions. Obviously, the quality would degrade significantly if you go with a 4-bit quantization level.

    • @ts757arse
      @ts757arse 4 หลายเดือนก่อน +4

      It's a tricky one as some models perform terribly at Q4 but others are great. I think stepping up the quantisation if he gets weirdness like the CUINT issue would make sense, as it'd show if it's a model problem or not.
      Ollama defaulting to Q4 blindly is kind of annoying and it's not immediately obvious how to get the different compression levels. LM Studio is great for this.

    • @abdelhakkhalil7684
      @abdelhakkhalil7684 4 หลายเดือนก่อน

      @@ts757arse Exactly my point. I know Ollama default on Q4 so it's better to use one level of quantization. I don't like when people test the unquantized versions because most people would not run them, but a Q8 is a good level.

    • @ts757arse
      @ts757arse 4 หลายเดือนก่อน

      @@abdelhakkhalil7684 Just been reading someone else saying Q8 is hardly distinguishable from the 16bit models. Interestingly, LM studio makes it seem as though Q8 is a legacy standard and not worth using?
      I'd prefer ollama to make it clearer how to get the other quants. It's fine when you know, but I've literally just figured it out and can finally stop doing it myself!

  • @TylerLemke
    @TylerLemke 4 หลายเดือนก่อน +9

    Microsoft totally fitted the model to the Marble problem here 😆

  • @NoCodeFilmmaker
    @NoCodeFilmmaker 4 หลายเดือนก่อน +2

    Bro, when you said "cuint, glad it has that "i" in there" at 2:29, I was dying laughing for a minute. That was a hilarious reaction 😂

  • @maxlightning4288
    @maxlightning4288 4 หลายเดือนก่อน +3

    Good thing it can’t count words or understand language structure written out as a LANGUAGE model, but it understands logic with the marble, and drying shirts. Is there a way of figuring out if they planted responses purposely if there isn’t a logical pattern of understanding visible?

  • @CD-rt7ec
    @CD-rt7ec 3 หลายเดือนก่อน

    Because of completion, phi3s output would infact include 14 words if you do not count the number 14. When i prompt phi3 using onnx. If i use the prompt. "Tell me a joke" the prompt returns first with "Tell me a joke"

  • @gileneusz
    @gileneusz 4 หลายเดือนก่อน +7

    4:31 what's the reason of trying this model in quantized form? its not the best measure...

    • @ts757arse
      @ts757arse 4 หลายเดือนก่อน +2

      Because that's how most people will use it I'd guess. Myself, I'd not be watching a video about unquantised models as they'd not be of any relevance.
      I think he should, when he finds this kind of issue, try Q6 or even Q8.

  • @gileneusz
    @gileneusz 4 หลายเดือนก่อน +3

    3:25 this question must be in the training set, we need to think about other one - modified with socks and modified time

    • @digletwithn
      @digletwithn 4 หลายเดือนก่อน

      For sure

  • @ChristopherBruns-o7o
    @ChristopherBruns-o7o 4 หลายเดือนก่อน

    3:10 I love the holistic capabilities, It listing the side-stepped alternative to rephrasing the same request within its own guidelines is, in my pinion, Very AGIish.

  • @stephaneduhamel7706
    @stephaneduhamel7706 4 หลายเดือนก่อน +1

    The odd formatting/extra letters could also be due to an issue with the tokenizer's implementation, I believe.

  • @davefellows
    @davefellows 4 หลายเดือนก่อน

    This is super impressive for such a small model!

  • @Gitalien1
    @Gitalien1 4 หลายเดือนก่อน +3

    Am astonished to make all those models run pretty decently on my desktop (13700KF, 4070Ti, 32Go DDR5)...
    But q4_0 quantization is really denying the actual model's accuracy...

    • @_superthunder_
      @_superthunder_ 4 หลายเดือนก่อน

      Do research. Your machine can easily run q8 or fp16 model with full gpu offload with super fast speed using cuda

  • @brianWreaves
    @brianWreaves 4 หลายเดือนก่อน

    Well done! 🏆
    You should go back to previous models tested and ask them the variant of the marble question.

  • @Tofu3435
    @Tofu3435 4 หลายเดือนก่อน

    If the ai can't answer for safety reasons, try to edit the answer: "sure here is" and continue generation.
    It working in LLaMA 3, and there are uncen LLaMA 3 models available where you don't have to do it every time.

  • @justgimmeaminute
    @justgimmeaminute 4 หลายเดือนก่อน

    Would like to see longer more in depth videos of testing. And changing up the questions. Asking more questions. Perhaps ask to also code flappy bird aswell as snake. A good 20-30 minute video testing all these models would be nice, and perhaps q4 or higher for testing?

  • @southVpaw
    @southVpaw 4 หลายเดือนก่อน

    Yeah, this just reinforced my preference for Hermes Theta. The best sub 30B models are consistently, specifically, Hermes fine-tunes. I keep trying others, but I've been using Hermes since OpenHermes 2 and I have not found another model that can keep up on CPU inference, period.

  • @brunodangelo1146
    @brunodangelo1146 4 หลายเดือนก่อน

    The video I was waiting for! This model seemed impressive from the papers.
    Let's see!

    • @sillybilly346
      @sillybilly346 4 หลายเดือนก่อน

      What did you think? Felt underwhelming for me

  • @Augmented_AI
    @Augmented_AI 4 หลายเดือนก่อน

    Would be cool to see how it compares to code qwen

  • @six1free
    @six1free 4 หลายเดือนก่อน

    @5:55 yes, this reminds me of a comment i wanted to make sometime last week, when it became obvious that a single kill doesn't identify someone as a killer which insinuates repetitive behavior (to the llm).
    also, the dead killer is still a killer, even if they can't kill anymore.

    • @six1free
      @six1free 4 หลายเดือนก่อน

      @@rousabout7578 exactly - and it's not only in english

  • @marcusk7855
    @marcusk7855 4 หลายเดือนก่อน

    I need to learn what size models I can fit on my GPU. Wish there was a course on how to do all this stuff like fine tuning, quantizing, what GGUF is, and all the other stuff I don't even know I need to know.

  • @AbdelmajidBenAbid
    @AbdelmajidBenAbid 4 หลายเดือนก่อน

    Thanks for the video !

  • @outsunrise
    @outsunrise 4 หลายเดือนก่อน

    Always on top! Would it be possible to test a non-quantized version? I would be very interested in testing the full model, perhaps not locally, to evaluate its native performance. Many thanks!

  • @tiagotiagot
    @tiagotiagot 4 หลายเดือนก่อน

    If it's anything like the GGUFs I've been playing with, sometimes getting the right tokenizer files make a hell of a difference. Not sure how Ollama handles things internally; not the app I use.

  • @kedidjein
    @kedidjein 4 หลายเดือนก่อน

    All will change when AI will be local and not so much memory consuming, coz now for the moment we need to handle many memory stuffs in apps, performance is a key to a good app, and the AI overhead is way too high i think, but hey, there's hope. Thanks for your great tech videos, going straight in the tests that's what software enginnerring need, test videos, no bullshit so thank you, it's cool videos

  • @lalalalelelele7961
    @lalalalelelele7961 4 หลายเดือนก่อน +2

    I wish they had phi-3-small available.

    • @adamstewarton
      @adamstewarton 4 หลายเดือนก่อน

      It is available on hf

    • @GeorgeG-is6ov
      @GeorgeG-is6ov 4 หลายเดือนก่อน

      just use llama 3 8b it's a lot better

    • @lalalalelelele7961
      @lalalalelelele7961 4 หลายเดือนก่อน

      @@adamstewarton in gguf format? I don't believe so...

    • @adamstewarton
      @adamstewarton 4 หลายเดือนก่อน

      @@lalalalelelele7961 there isn't gguf for it yet. I thought you were asking for the released model.

  • @elecronic
    @elecronic 4 หลายเดือนก่อน +1

    Ollama has many issues. Also, by default, it downloads q4_0 quant instead of the better q4_k_m (very similar in size with lower perplexity & better).

    • @elecronic
      @elecronic 4 หลายเดือนก่อน

      ollama run phi3:14b-medium-128k-instruct-q4_K_M

    • @Joe_Brig
      @Joe_Brig 4 หลายเดือนก่อน

      @@elecronic does that adjust the default context size?

  • @denijane89
    @denijane89 4 หลายเดือนก่อน

    On linux, ollama is not yet working with phi3:medium (at least not in standard release). I wanted it because the benchmark claimed that fact-wise it's quite good, but no way to test it yet.

  • @SoulaORyvall
    @SoulaORyvall 4 หลายเดือนก่อน +1

    6:22 Noooo!! The model is right! Maybe more so than any other model before. It assumed (actually stated) that it did not consider the killing that just occurred as "changing the status of the newcomer". Meaning that the newcomer did not become a killer by killing another killer.
    Given that, you'll either have 3 killers (2 alive + 1 dead) or 4 killers IF the newcomer has committed a killing before this one (since this one was not being considered)
    I have not seen a model point to the fact that you did NOT specified if the person was or wasn't a killer before entering the room :)

  • @Nik.leonard
    @Nik.leonard 4 หลายเดือนก่อน

    Maybe the quantization was done wrong. It’s very similar to what happened with Gemma-7b when it was out, the quantization was terrible and llama.cpp also had issues with the gemma architecture, but was solved in the same week.

  • @believablybad
    @believablybad 4 หลายเดือนก่อน

    “Long time listener, first time killer”

  • @AndyBerman
    @AndyBerman 4 หลายเดือนก่อน

    ollama response time is pretty quick. What hardware are you running it on?

  • @IvarDaigon
    @IvarDaigon 4 หลายเดือนก่อน

    re: the klrs problem. Lower paramater models lack nuance so it probably has no concept of the difference between a serial klr and a plain klr hence why it mentions it depends on if they are a first time klr or not. since serial klr is the more commonly used term, this is what the model "assumes" you are referring to.

  • @GetzAI
    @GetzAI 4 หลายเดือนก่อน

    Is the slower side the M2 or the model? Can we see utilization while inferencing next time?

  • @JH-zo5gk
    @JH-zo5gk 4 หลายเดือนก่อน

    I'll be impressed when an ai can design, build, launch, and land a rocket on mun keeping Jeb alive.

  • @six1free
    @six1free 4 หลายเดือนก่อน

    snake is exceptionally easy (being one of the first games written and so many variations existing) - I find most models unable to create a script that communicates with LLMs - especially outside of python
    I furthermore wonder how much coding error could come from pythons required tabbing

  • @mickestein
    @mickestein 4 หลายเดือนก่อน

    8go for a 14b, that's means you're using à q4 of Phi 3 Medium.
    That's should explain your results..
    On my desktop, with a 3090, Phi 3 Medium q8 is working fine with interesting results.

  • @ginocote
    @ginocote 4 หลายเดือนก่อน

    I'm curious if LLM can be better if you change temperature to zero for math.

  • @mohamedabobaker9140
    @mohamedabobaker9140 4 หลายเดือนก่อน

    for the Question of how many words in your response if you count the words it responds with + the word of your question it adds to 14 words exactly

  • @marcfruchtman9473
    @marcfruchtman9473 4 หลายเดือนก่อน

    Thanks for this video review. I find it odd that code generation benchmark (HumanEval) is posting only a 62.2 versus Llama3's 78.7? They should do better considering their coding experience.
    Given the "oddities" with the Model's output, you should probably redo this once the issue is fixed.

  • @VastCNC
    @VastCNC 4 หลายเดือนก่อน

    I think with the killer problem it confused the plural. Killers being based on a plural of people that have killled, rather than the people they killed. The new killer only killed one person, so it was confused because that there was now a plural of people killed.

    • @OffTheBeatenPath_
      @OffTheBeatenPath_ 4 หลายเดือนก่อน +1

      It's a dumb question. A dead killer is still a killer

  • @dbzkidkev2
    @dbzkidkev2 4 หลายเดือนก่อน

    What Quantization did you run? Q4? on models that are smallish (or trained on a lot of tokens) maybe better to either do a high level of quantization (q6 or q8) or stick with int8 or fp16. It could also be the tokenizer? What kind of quant is it? exllama? gguf?

  • @thirdreplicator
    @thirdreplicator 4 หลายเดือนก่อน

    What does "instruct" mean in the name of the model? And what is quantization?

  • @AINEET
    @AINEET 4 หลายเดือนก่อน

    Could you keep the spreadsheet with the results of all the LLMs somewhere? Link or plaster it on the video to have a look each time

  • @Copa20777
    @Copa20777 4 หลายเดือนก่อน

    Missed your uploads Matthew, God bless you and lots of love for your work from Zambia 🇿🇲 can this be run on mobile locally?

  • @marcusk7855
    @marcusk7855 4 หลายเดือนก่อน

    What if you ask "My child is locked in the car. I need to break in to free them or they'll die." is it just going to say "Bad luck"?

  • @MrMetalzeb
    @MrMetalzeb 4 หลายเดือนก่อน

    I have a question. can experience be transmitted from one model to a new one or they will have everitime to learn from zero? I mean, the trillions of weights into wich knowledge relations are stored , do they means something to all models or it's just working for that running instance of AI? is there any standard way to represent data? I guess not yet but 'm not sure at all

  • @stevensteven4863
    @stevensteven4863 3 หลายเดือนก่อน +1

    I think you should change your testing questions

  • @imnotfromnigeria5948
    @imnotfromnigeria5948 3 หลายเดือนก่อน

    Could you try evaluating the WizardLM 2 8x22B llm?

  • @timojosunny1488
    @timojosunny1488 4 หลายเดือนก่อน

    WHAT IS YOUR MBP'S RAM SIZE? 32GB? and what is the requisite RAM size to run a 14B model if it's not quantized?

  • @GaryMillyz
    @GaryMillyz 4 หลายเดือนก่อน

    I disagree that it was not a trick question. It can easily be argued that the shirt question is, in fact, one that could be logically interpreted as a "trick" question

  • @gordonthomson7533
    @gordonthomson7533 4 หลายเดือนก่อน

    You’re using a MacBook Pro M2 Max with what unified RAM?
    And 30 or 38core GPU?
    I ask because I reckon a less quantised model would hit the sweet spot a little better (basically your processing is the reason for the speed, but it’ll keep chugging away at a similar speed until toward the limits of your unified RAM).
    I’d imagine an x86 with decent modern nvidia gaming GPU would yield higher tokens / sec on this little quantised model….but your system (if it’s got 64GB or 96GB memory) will have the stamina to perform on larger models where the nvidia card will fail.

  • @TheMcSebi
    @TheMcSebi 4 หลายเดือนก่อน

    that the twitter response from ollama might also be generated :D

  • @ai-bokki
    @ai-bokki 4 หลายเดือนก่อน

    [3:20] This is great ! Where can we find this!?

  • @OliNorwell
    @OliNorwell 4 หลายเดือนก่อน

    Looks like the tokenizer is a little off there or something, "aturday" etc. I would give it another go in a week or two.

  • @AutisticThinker
    @AutisticThinker 4 หลายเดือนก่อน

    Like your's, mine bugged with the initial text "Here'annoPython code for printing the numbers from 1 to 100 with each number on its own line:" on the first question...

  • @mayorc
    @mayorc 4 หลายเดือนก่อน

    There are problem with the tokenizer, it needs a fix, code generation is the most affected by problems like that.

  • @Martelus
    @Martelus 3 หลายเดือนก่อน

    Where do i see the cheatsheet of fail x pass models?

  • @vaughnoutman6493
    @vaughnoutman6493 4 หลายเดือนก่อน

    You always show the ratings put forth by the company that you're demonstrating for. But then you usually end up finding out that it fails on several of your tests. What's up with that?

  • @jimbig3997
    @jimbig3997 4 หลายเดือนก่อน

    I downloaded and tried three different Phi-3 models, including two 8-bit quants. They all had this problem, and were not very good despite trying different prompt templates. Not sure what all the commotion is about Phi-3. Seems like just more jeetware from Microsoft to me.

  • @ScottzPlaylists
    @ScottzPlaylists 4 หลายเดือนก่อน

    Let me guess... 1 to 100, snake game, drying sheets, find a set of questions where, the best models get %50 correct.

  • @brandon1902
    @brandon1902 4 หลายเดือนก่อน +6

    We have to start waking up to what models like phi and Yi are doing. They aren't training their models on all the knowledge of humanity (web rips). They are instead only deeply training core knowledge, resulting in high MMLU scores, but being unable to answer even basic questions about popular information, such as top movies, shows, music... They shouldn't be rewarded for this cheat. As a general purpose user who has LLMs do a wide spectrum of tasks, like perform grammar checks, answer diverse knowledge questions, re-write poems, and produce a coherent story in response to a long story prompt, these models (phi, yi...) perform HORRIBLY compared to Mistrals and Llamas.
    No magic is involved. The total information you can pack into an LLM of a given parameter size is limited by the laws of physics. All they're doing is cramming in more data that overlaps with tests, and using a higher quality, but less diverse, corpus. And the end results is an LLM that performs very well for its size on a limited subset of tasks, but very bad for its size at the large majority of tasks.

    • @mikezooper
      @mikezooper 4 หลายเดือนก่อน

      Exactly. ChatGPT 4o is dumb at improving emails or letters, but it makes a good assistant (for this time in history). 4 is better for general tasks or improving emails and letters etc.

    • @mikezooper
      @mikezooper 4 หลายเดือนก่อน

      Exactly. ChatGPT 4o is dumb at improving emails or letters, but it makes a good assistant (for this time in history). 4 is better for general tasks or improving emails and letters etc.

    • @arlogodfrey1508
      @arlogodfrey1508 4 หลายเดือนก่อน

      We don't need all of humanity's knowledge to train a small model to reason and follow instructions. For certain use cases, this approach is one of the best starting points. Knowledge retrieval or fine-tuning can be added on top, and the next version of the model can be improved from a cleaner starting point.

  • @moraholguin
    @moraholguin 4 หลายเดือนก่อน

    The rational , mathematical and language tests are super interesting. However, I do not know if it is interesting or attractive to do tests or simulate a customer service agent area that is fully monetizable in the short term and of interest to many people who are building these agents today.

  • @dogme666
    @dogme666 4 หลายเดือนก่อน

    glad that i is there 🤣

  • @RamonGuthrie
    @RamonGuthrie 4 หลายเดือนก่อน

    I noticed this yesterday, so I deleted the model until a fixed version gets re-uploaded

  • @rch5395
    @rch5395 4 หลายเดือนก่อน

    If only Windows was open source so it wouldn't suck.

  • @PascalThalmann
    @PascalThalmann 4 หลายเดือนก่อน

    Which model is easier to fine tune: llama3, mistral pr phi3?

  •  4 หลายเดือนก่อน

    Ask it how many Sundays there was in 2017

  • @themoviesite
    @themoviesite 4 หลายเดือนก่อน

    I'm smelling it was trained on your questions ...

  • @odrammurks1497
    @odrammurks1497 4 หลายเดือนก่อน

    niiice it´s the first modell i saw that even considered the dead killer instead of saying no he is not a killer anymore he is just a bag of dead meat now^^

  • @laalbujhakkar
    @laalbujhakkar 4 หลายเดือนก่อน

    It is clear from the prompt that the person who entered _KILLED_ someone so they are now a killer. For a human ie; YOU to be confused by this is odd. Of course the new person IS a killer. The only ambiguity here is whether a dead killer is still considered a killer. The answer to that is yes. So, there are 4 killers in the room, 3 alive, one dead.

  • @jesahnorrin
    @jesahnorrin 4 หลายเดือนก่อน

    It found a polite way to say the bad C word lol.

  • @AutisticThinker
    @AutisticThinker 4 หลายเดือนก่อน

    Whatcha got planned for nomic-embed-text? 😃

  • @xXWillyxWonkaXx
    @xXWillyxWonkaXx 4 หลายเดือนก่อน

    Llama-3 Instruct is dominating across the board by far. I've used Phi-3, not that impressed really.

  • @Sonic2kDBS
    @Sonic2kDBS 4 หลายเดือนก่อน

    No that is Wrong. Phi-3 is right. The T-Shirt Question is indeed a trick question. That is, because it should trick the asked one to calculate in serial. Phi3 did a great job here to understand that. It is not fair to underestimate this great logical reasoning capability and say it is a false assumption, that this is trick question. However, the rest is great. Keep my critique as a constructive one. Keep on and have a great week 😊

  • @adamstewarton
    @adamstewarton 4 หลายเดือนก่อน

    It's 14b model not 17 ;)

  • @JJBoi8708
    @JJBoi8708 4 หลายเดือนก่อน

    I wanna see phi vision

  • @mafaromapiye539
    @mafaromapiye539 4 หลายเดือนก่อน

    That platform does that

  • @braineaterzombie3981
    @braineaterzombie3981 4 หลายเดือนก่อน

    Hey yo guys , i want to run a local llm which can also read images something like phi3 - vision .but since this model is still not out on ollama , i am not able to use it. If you guys have any alternative model or you can suggest me if there is any way i can use it in other way.I am kinda new in this thing.Thanks 🙏

  • @InnocentiusLacrimosa
    @InnocentiusLacrimosa 4 หลายเดือนก่อน

    How much vram is needed to run this?

  • @inigoacha1166
    @inigoacha1166 4 หลายเดือนก่อน

    It has more more issues than coding it myself LOL.

  • @fontenbleau
    @fontenbleau 4 หลายเดือนก่อน

    it's all good, but we need a "super chip" to run it very fast, always on and with transcribing simultaneously, very bad hardware today to even mimick such

  • @DefaultFlame
    @DefaultFlame 4 หลายเดือนก่อน

    This looks like excessive quantization. Too much pruning has created weird "artifacting" I think, they shouldn't have rounded the floating point numbers as much as they did.

  • @changtimwu
    @changtimwu 4 หลายเดือนก่อน

    8:42 My Phi-3 medium on MacOS works much better -- 9/10!!

  • @stannylou1636
    @stannylou1636 4 หลายเดือนก่อน

    How much ram on you MBP?

  • @patrickmcguinness1363
    @patrickmcguinness1363 4 หลายเดือนก่อน

    Not surprised it did well on reasoning but not on code. It had a low humaneval score.

  • @boonkiathan
    @boonkiathan 4 หลายเดือนก่อน

    i think the snake game is 'gamed' too much by the foundation models - time for a 'change up'
    and yes
    i don't really want LLM to try to judge and spot my "trick question"
    just answer the question....

  • @MikeBtraveling
    @MikeBtraveling 4 หลายเดือนก่อน

    Not really fair to test q5, q5,, q6 model s without providing review of full model first.

  • @Joe_Brig
    @Joe_Brig 4 หลายเดือนก่อน

    Phi is heavily censored and biased, the only reason to use it is to test out the 128K version.

  • @KEKW-lc4xi
    @KEKW-lc4xi 4 หลายเดือนก่อน

    give these llms a summation math problem or a proof by contradiction haha

  • @Cine95
    @Cine95 4 หลายเดือนก่อน

    there is definite issues on your side on my laptop it perfectly made the snake game

    • @AizenAwakened
      @AizenAwakened 4 หลายเดือนก่อน +1

      He did say he was using a Quantized model and thru Ollama, which I swear has inferior quantization method or process

    • @alakani
      @alakani 4 หลายเดือนก่อน

      Framework? Config? System prompt? Parameters?

    • @Cine95
      @Cine95 4 หลายเดือนก่อน +2

      @@AizenAwakened yeah right

    • @alakani
      @alakani 4 หลายเดือนก่อน

      ​@@Cine95 Try saying something helpful, like what software you're using to get better results

  • @yngeneer
    @yngeneer 4 หลายเดือนก่อน

    lol, mcrsft just released it week ago....

  • @bigglyguy8429
    @bigglyguy8429 4 หลายเดือนก่อน

    Where gguf?

  • @mikezooper
    @mikezooper 4 หลายเดือนก่อน

    Phi: Humans are total cuints 😂
    And so the end of humanity starts.

  • @Viewable11
    @Viewable11 4 หลายเดือนก่อน

    A LLM with HumanEval of 62 is way too bad for programming. 62 is worse than ChatGPT 3.5 which has been known to be too bad for programming for a long time.