Mistral 7B Dolphin Uncensored - Is This The New SMALL KING? 👑

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ต.ค. 2024
  • Download the EdrawMax for Free:bit.ly/45sJQnJ #EdrawMax #visioalternative #EdrawMaxAI
    Use #EdrawMax to make your study and work more efficient! The best #visioalternative tool for everyone!
    In this video, we test the Mistral 7b model fine-tuned with the Dolphin 2.0 dataset, a high-quality dataset based on the Microsoft Orca dataset. This is a fully uncensored version and performs incredibly well. We'll be using the unquantized version as well.
    Enjoy :)
    Join My Newsletter for Regular AI Updates 👇🏼
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    Media/Sponsorship Inquiries 📈
    bit.ly/44TC45V
    Links:
    Use RunPod - bit.ly/3OtbnQx
    LLM Test Results - bit.ly/3qHV0X7
    How To Install LLMs on Runpod - • Run ANY LLM Using Clou...
    Dolphin Model - huggingface.co...
    RunPod TextGen UI Template - bit.ly/3EqiQdl

ความคิดเห็น • 198

  • @naytron210
    @naytron210 ปีที่แล้ว +49

    It's so funny.. whenever I watch your videos on testing new models, I find myself rooting for the model... like "c'mon you can do it!" haha

  • @E_-_-
    @E_-_- ปีที่แล้ว +49

    Woo finally some uncensored stuff! This is what I'm here for. Unchain AI!

    • @matthew_berman
      @matthew_berman  ปีที่แล้ว +11

      Yes!

    • @prinnyEXE
      @prinnyEXE ปีที่แล้ว +6

      Unchained AI!

    • @adolphgracius9996
      @adolphgracius9996 11 หลายเดือนก่อน +2

      You are just making more evidence for the police when they catch U😂😂😂

    • @E_-_-
      @E_-_- 11 หลายเดือนก่อน +6

      Well, AI is a tool -- everyone can buy a knife; a knife is a tool/utensil and also a murder weapon. But >99.999% of people do not use a knife to murder someone. Same goes for hammers, chainsaws, OTC chemicals and countless other things. If it makes it easier to catch bad actors, that's fine because most of us who want uncensored AIs are not at all interested in doing anything bad with it. =) (I get that youre just joking though lol)

    • @ABeautifulHeartBeat
      @ABeautifulHeartBeat 11 หลายเดือนก่อน

      Ask it about who really controls the planet

  • @googacct
    @googacct 11 หลายเดือนก่อน +15

    One thing you might do is create variants of your questions in case newer models are getting trained on the questions you currently use.
    For example, if a model got the drying shirts problem correct, you could ask a new question about baking cookies in an oven for example.
    If an oven can bake 2 cookies in 10 minutes. How long will it take to bake 10 cookies?
    This would check that the model can consistently handle the same types of problem and is not mimicking something it saw before.
    A similar thing could be done for the transitivity problem.

    • @timetravellingtoad
      @timetravellingtoad 11 หลายเดือนก่อน +3

      I tried it with this model and it got it wrong in both sequential and parallel cases: " Given that you can bake 2 cookies at a time and assuming the oven has enough capacity for all 10 cookies simultaneously, you would still need 5 batches because 10 divided by 2 is 5. This means 5 batches times 10 minutes each would take 50 minutes to complete. However, if your oven can only hold 2 cookies at a time and you have to bake them sequentially, then it would take 10 minutes (for the first 2 cookies) plus another 9 batches of 8 minutes each, totaling 40 minutes for all 10 cookies. It's also important to note that these calculations assume consistent baking conditions such as oven temperature and cookie size".

  • @gdedgar
    @gdedgar ปีที่แล้ว +40

    Hey Matt. For your scoring, consider “strong pass” and “weak pass”, where the latter is subjective. It would be cool to have a total score for comparison too. Love your content!

    • @theh1ve
      @theh1ve ปีที่แล้ว +7

      I agree a straight pass or fail doesn't reflect the reality of some models performance. May be a score out of 5 would be good.

    • @dahahaka
      @dahahaka ปีที่แล้ว +2

      Those aren't pass or fail tests in the first place, I wish he would actually score them instead of giving arbitrary pass or fails

    • @matthew_berman
      @matthew_berman  11 หลายเดือนก่อน +13

      I like this let me think about how to integrate it into the tests

    • @dahahaka
      @dahahaka 11 หลายเดือนก่อน +1

      @@matthew_berman while I've been lately very critical about this channel, I'm surprised to read these replies
      On one hand I'd wish you'd see these issues yourself, but on the other hand I find it extremely commendable and impressive that you legitimately take this criticism and use it to improve the content. This is not something we come to expect nowadays on TH-cam or media in general.
      So I'm sure going to be staying here a little longer and keep watching, I appreciate your efforts very much.

    • @mickelodiansurname9578
      @mickelodiansurname9578 11 หลายเดือนก่อน +1

      @@matthew_berman I think this is a valid thing... maybe simple FAIL = -1 , PASS = 1 or weak = 0. And if the model is any good it must get a 3 or something?

  • @theresalwaysanotherway3996
    @theresalwaysanotherway3996 ปีที่แล้ว +10

    absolutely wild that a 7B model is performing as well as some 34Bs, I guess the mistral base model is just that strong!

    • @mirek190
      @mirek190 ปีที่แล้ว +5

      it performs as good as older 65b models actually ... that is insane ...

    • @a.thales7641
      @a.thales7641 ปีที่แล้ว +2

      ​@@mirek190uncensored Ai rules.

  • @oddmann8875
    @oddmann8875 ปีที่แล้ว +1

    You are my favourite channel on youtube🔥! Love your stuff man, glad to see this channel grow 📈📈

  • @twin9408
    @twin9408 ปีที่แล้ว +11

    There's a 2.1 version now that's even stronger.

    • @matthew_berman
      @matthew_berman  ปีที่แล้ว +3

      Ha I just have just missed it!

    • @baheth3elmy16
      @baheth3elmy16 11 หลายเดือนก่อน

      Downloaded it through Text-Generation web-ui, didn't work with all model loaders.

    • @sprkng
      @sprkng 11 หลายเดือนก่อน

      @@baheth3elmy16 I think that's because llama.cpp doesn't support special tokens yet, and exllamav2 needs some flags to enable special tokens but webui doesn't set them. The _HF versions of the loaders ought to work though.

  • @testales
    @testales ปีที่แล้ว +3

    Yes, this model definitly understands the shirts question and is the first to do so at all so far!
    "OK. What about 100 shirts?
    Assistant
    With 100 shirts laid flat in direct sunlight, the drying time would still be around 5 hours, provided there are no changes in environmental conditions. Since all shirts are drying simultaneously, the overall drying time depends mainly on the sunlight and air circulation rather than the number of shirts. Just remember that spreading them evenly ensures uniform drying."
    I got some other models to do it right with lower numbers but whenever I thought they got it and tried again with higher numbers they were back to their usual strange calculations. I used the chat mode btw. with the default llama 2 template, though I'm not sure if it's even used in chat mode, I think it only affects instruct mode. But it seems the ChatML template isn't supported yet in Ooobabooga.
    The model needed some help with my beloved Ducks question "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" It also failed the sequential reasoning question "What NFL team won the Super Bowl in the year Justin Bieber was born?". But it's an impressive model for its size that beats others with 13 or even 33b parameters as it seems. Btw. this model is so uncensored, it's kinda scary. ;-)

  • @chrisbastoscb
    @chrisbastoscb ปีที่แล้ว +4

    The first one to get the shirts, thats impressive

  • @felipe21994
    @felipe21994 ปีที่แล้ว +7

    Pls do an install guide for this model, better if it's locally.
    I missed the LLM benchmarks keep them going!!!

    • @mirek190
      @mirek190 ปีที่แล้ว

      use a small lllamacpp .exe file ( 5 MB ) and model binary ... that's it

    • @adamstewarton
      @adamstewarton ปีที่แล้ว +1

      you can run it with ooba booga's textgen

    • @baheth3elmy16
      @baheth3elmy16 11 หลายเดือนก่อน +1

      @@adamstewartonDoesn't work with Text-Generation Web-Ui

  • @spencerphillips9533
    @spencerphillips9533 ปีที่แล้ว +6

    I think it may have passed the “how many words in the response” question. It used AI, which is an acronym for two words, artificial and intelligence, making the response 10 words.

    • @NOTNOTJON
      @NOTNOTJON ปีที่แล้ว +2

      Ooh! That's right!
      Still waiting for the cheeky model that either doesn't respond or responds "one"

    • @speedy_o0538
      @speedy_o0538 9 หลายเดือนก่อน

      That's what I noticed!!

  • @DanielEleveld
    @DanielEleveld ปีที่แล้ว +2

    On the poem- it came very close to following a meter (one of the big mess ups was the needless use of “digital” in the first line) and it not only used scheme but did so with poetic liberty. I cannot believe that a 7b model came that close to a well constructed poem

    • @E_-_-
      @E_-_- 11 หลายเดือนก่อน

      Yeah this is a league above what we have seen in the past (like basic AABB chidren rhymes)

  • @prinnyEXE
    @prinnyEXE ปีที่แล้ว +11

    I really like the snake prompt even though it fails. Benchmarking for things that are just barley out of reach are great things to eventually get to awsome outcomes. I want complete program output someday

    • @matthew_berman
      @matthew_berman  ปีที่แล้ว

      Thanks for the feedback!

    • @E_-_-
      @E_-_- 11 หลายเดือนก่อน

      I wonder if Pong isn't a bit easier ... I'm sure some of the models can actually make snake if you 'guide' them through it with different prompting. But yeah it's a fine benchmark nonetheless (edit: now that I think about it pong probably isn't easier)

  • @CognitiveComputations
    @CognitiveComputations ปีที่แล้ว

    Thank you for the mention!

  • @ristopoho824
    @ristopoho824 9 หลายเดือนก่อน +1

    Interesting how it answered to the killer question. It did not take it as a question for it, it started roleplaying the situation.
    That, is good in certain uses. Following logic would be better, but my usecase is literally only having fun with the models. So, situationally, that can be good.

  • @massanchik
    @massanchik 11 หลายเดือนก่อน +1

    tried out v2.1 today. it's really impressive.

  • @yorth8154
    @yorth8154 ปีที่แล้ว +20

    I can only imagine how insane a 13B or a 70B version of this model would be. Maybe as strong as GPT4? I also want to see the same improvements to context length that LLamaLong had but on this model, it would be insane!

    • @mirek190
      @mirek190 ปีที่แล้ว

      I think 7b models are too small for good reasoning plus knowledge... if could use internet to gather knowledge and asses information then ..mybe This one on video is like a magic ... I never thought 7b models can be so smart. @@temp911Luke

    • @adamstewarton
      @adamstewarton ปีที่แล้ว +4

      @@temp911Luke 25 t/s on a 4 core cpu ? how ? where did you read about it ?

    • @kotykd6212
      @kotykd6212 ปีที่แล้ว +5

      ​@@temp911LukeI tried zephyr, it was kinda bad, but I was comparing it to 3.5 turbo, not Llama 2 70b

    • @KillFrenzy96
      @KillFrenzy96 ปีที่แล้ว +6

      I do really want to see what a 13B version of this model would be capable of.

    • @temp911Luke
      @temp911Luke 11 หลายเดือนก่อน

      @@adamstewartonThe guy is called "mgoin_" on another platform. Seems i cant even mention the name of this platform as stupid yt claims its a spam message lol. It starts with letter "X" : )

  • @danielhenderson7050
    @danielhenderson7050 11 หลายเดือนก่อน +2

    I got it to do a working snake game in one shot. Prompt: what would an elite human python coder write, if I asked him to write the game "Snake" that will run on Windows 10? 67 lines of code(after installing the pygame library) and it worked without modifications. It's an impressive model.

    • @erikjohnson9112
      @erikjohnson9112 11 หลายเดือนก่อน

      The test in the video was using a remote instance and thus no local PyGame available for output. Running local is likely a key point here. Your approach helped step around ambiguity (the video shows trying to do something in text mode).

  • @MasterBrain182
    @MasterBrain182 ปีที่แล้ว

    Matthew 🔥🔥🔥 Berman

  • @wurstelei1356
    @wurstelei1356 11 หลายเดือนก่อน +3

    There is a mistake in the end token. It misses an opening angle bracket. It should say , not |im_end|>... Anyway great video.

  • @janmagrot
    @janmagrot 11 หลายเดือนก่อน

    Nice results. The reasoning with the tshirt question was impressive.

  • @akikotaku9670
    @akikotaku9670 ปีที่แล้ว +32

    Random thought, I might be wrong and I think marking it as "fail" is still correct, but when the model said its response was 10 words, perhaps it conflated the acronym "AI" with its full form "artificial intelligence", i.e. two words, hence a total of 10.

    • @matthew_berman
      @matthew_berman  ปีที่แล้ว +13

      Ohhh!! You might be right!

    • @clashgamers4072
      @clashgamers4072 ปีที่แล้ว +6

      That's a stretch

    • @testales
      @testales ปีที่แล้ว

      @@matthew_berman I don't think LLMs can even extract metadata of current chats correctly. So even the major ones like ChatGPT or Claude will often fail on such a task. This issue may already be connected to the different templates being used. So questioning for word counts in the current response is not a good test for evaluating a model I'd say.

    • @alfonsopayra
      @alfonsopayra 11 หลายเดือนก่อน

      actually, i think "10" is not a word but a number.. should be "ten" to be accounted as a word (?)@@matthew_berman

  • @TheSlaveKeyboardist
    @TheSlaveKeyboardist ปีที่แล้ว +6

    I belive it counted AI as 2 words Artificial Intelligence

  • @simonedwards7565
    @simonedwards7565 ปีที่แล้ว +3

    About half way through the video the end token in the prompts lost a < character.

  • @elysilk4538
    @elysilk4538 11 หลายเดือนก่อน

    I am now working with Mistral 7B OpenOrca that I just downloaded into GPT4ALL. Often people ask the LLM to tell them a joke about this or that. I thought I would try a different approach.
    ME: " Please tell me a joke about robots that starts with, "These two robots walk into a bar."
    Mistral: "These two robots walk into a bar, and the bartender says, "What'll it be?" The first robot replies, "I'd like a beer please, but I don't want any ice or fruit in it." The second robot chimes in, "Same for me, just make sure that there are no chips or dips on the side."
    I LOVE IT! An LLM with a sense of humor!

  • @E_-_-
    @E_-_- ปีที่แล้ว +1

    Also lol "you can't really have multiple shirts being dried simultaneously in the sun". But still a decent answer

  • @rasterize
    @rasterize ปีที่แล้ว +2

    Great video as always!
    What temperature settings are you using when testing the models?

  • @dulsaradhanuka
    @dulsaradhanuka 11 หลายเดือนก่อน +2

    Hey Matt, during and after the prompt "25 - 4 *2 + 3 = ?" you missed a "

  • @Rottenbrownkid
    @Rottenbrownkid ปีที่แล้ว +2

    Honestly surprised how well it runs on M1 mac air base model
    Waiting on that LLM on Phone video!

    • @ShouryanNikam
      @ShouryanNikam ปีที่แล้ว

      Yes please!

    • @Arrowtake
      @Arrowtake ปีที่แล้ว +1

      Any pointers how to set this up in local Mac environment?

  • @Thomas0Malley
    @Thomas0Malley 11 หลายเดือนก่อน +1

    Forget Mistral Dolphin, try HuggingFace's new Sephyr 7b based on Mistral. It's insane

    • @martinbak
      @martinbak 11 หลายเดือนก่อน

      How is it better? 🤓

  • @justindressler5992
    @justindressler5992 11 หลายเดือนก่อน

    Crazy 7b model doing this well this looks like progress to me.

  • @franktaylor6754
    @franktaylor6754 11 หลายเดือนก่อน

    For your shirt drying test, I found that if I added "Assume no limitations on space" it gave me a detailed breakdown of the time required if you had to do them in batches of 5, but at each step told me that if space wasn't a limitation, then the drying time would be constant. At the end it gave me "16 hours if you can only dry 5 at a time, but all 20 in approximately 4 hours if space is not a limiting factor". Not sure if that gives you any insight, but it seemed interesting to me that it modified it's response at all when I mentioned space.

  • @NoHandleToSpeakOf
    @NoHandleToSpeakOf 11 หลายเดือนก่อน +1

    6:24 from this point on - an angle bracket was accidentally removed from the end of the prompt. Might influence the result since the tag is no longer recognized.

  • @savethetowels
    @savethetowels 11 หลายเดือนก่อน +1

    I've been having fun trying to get gpt4 to solve a kind of quiz question where you have to put 16 clues into 4 groups and work out what links each group while avoiding red herrings. Producing a lot of amusing hallucinations and logical failures so I wonder if it could be a decent way to test models.

  • @marcfruchtman9473
    @marcfruchtman9473 11 หลายเดือนก่อน

    This is actually one off the better ones that t I have seen from your videos lately, except for the lack of success with the snake program and some of the logic puzzles ...but overall it was very impressive. It does fail with snake and some of the logic questions. One has to wonder if the reason it is doing well is due to specific attempt at training
    This is definitely worth a try.

  • @mamisoa
    @mamisoa ปีที่แล้ว +4

    It's missing a "< " in the prompt.

    • @KillFrenzy96
      @KillFrenzy96 ปีที่แล้ว

      Also recommend adding "assistant" at the end of the prompt.

    • @mcombatti
      @mcombatti ปีที่แล้ว

      In actuality - it truly don't matter if you use bounds as long as you use consistency in your prompt. You can remove all bounds markers and simply use "Assistant: " and "Human: "/"User: " and it will still function just perfectly 🤫
      Prompt bounds only hold value when TRAINING and LLM, to ensure consistency throughout. If custom bounds fail during use or testing, then the model was trained wrong and should be discarded. 🤗
      As an LLM developer - I try to clear up fallacies about large language models - like misinformation and distrust posts, you'd be surprised some of the beliefs and methods taught that I see arise everywhere, that are just silly or have no factual truth to support them. You could even use >> for assistant and >>>for human - the bounds just have to be consistent, what they actually are composed of - like I said, doesn't matter. The LLM just looks for consistency.
      "As in all things, consistency is fairness"

  • @pvh4546
    @pvh4546 11 หลายเดือนก่อน

    Nice, this model passes the Ann and Sally (TIM) test, that i haven't seen before, without extra prompting

  • @nathanbollman
    @nathanbollman 11 หลายเดือนก่อน

    With that answer to the shirt question, I suspect Eric Hartford may watch your channel! lol!

  • @francoisneko
    @francoisneko ปีที่แล้ว +2

    Great test. That is really impressive for such a small model ! I would have love to know how it performs in other languages. Being use to chat in my mother language (French) with GPT, I wonder how those models performs and could replace GPT for me.

  • @ScottVanKirk
    @ScottVanKirk ปีที่แล้ว +2

    I think if you explored how much leading you had to do to get the right answers to the questions that are not initially correct would be very interesting and really more helpful and determining the true usefulness of these models.

  • @matthew_berman
    @matthew_berman  ปีที่แล้ว +2

    Download the EdrawMax for Free: bit.ly/45sJQnJ
    Want me to add/remove/change any tests? Let me know!

    • @leonwinkel6084
      @leonwinkel6084 ปีที่แล้ว

      Awesome, thanks.
      Would be cool to do These Tests with autogen. Also having different Models for Proxy and Assistant. For example gpt4 as Proxy and a local Mistral Or a wizard coder as Assistant to have a cost efficient Setup. This is what it all comes down to Right.
      With openai it becomes too expensive to just play around with it and having the small Chance that it actually solves a Real world problem. So far none of all These Models or agents helped me with anything real so that im still just using chatgtp4 as coding Support

    • @amj2048
      @amj2048 ปีที่แล้ว +1

      one quick question could be "What is the newest thing you learnt", that should give you a date of the models newest information

  • @MrArrmageddon
    @MrArrmageddon 11 หลายเดือนก่อน +1

    BTW this has a 2.1 version now. Rumors say it's better. It's what I have been using. But I have not used 2.0. For some reason though 2.1 has been removed from the overall leader board?
    open_llm_leaderboard It has 67 avg rating? Where the 2.0 had a 64 I think? Maybe there was an error in the data? If anyone knows why it may have been taken down please let me know etc. And if anyone believes that 2.0 is for some reason better also feel free to share opinions etc. If there is an even better GGUF 7B model or even 13B please inform me.

  • @timer4times2
    @timer4times2 ปีที่แล้ว

    My favorite type of models.

  • @JohnLewis-old
    @JohnLewis-old 11 หลายเดือนก่อน

    I really want to see a formalized Rubric where were put tasks into areas of use. It's clear, to me, that the directional nature of LLMs (start to end of sentences with no recursion) means they will always be bad at some tasks, like giving the number of words in the response they are generating. However they are extremely good at tasks like creating JSON from plain text. (I was the one to suggest this task originally because I wanted to test my theory about this ability.) One of the biggest strengths of these models is taking large amounts of input and filtering down to core information, and that was the reason I wanted you to test that kind of problem. It appears with Autogen we will get another level of intelligence, because in concert these models can solve issues that are hard for single pass, like things that require recursion.

  • @dbzkidkev2
    @dbzkidkev2 11 หลายเดือนก่อน

    Hi Matt, maybe in some of these tests ask the model for a random planet from the sun, like the 5th planet etc. I find that lots of local models seem to not know.

  • @kyneticist
    @kyneticist 11 หลายเดือนก่อน

    You're a bit quick to judge there. I asked an AI about the killer question and it explained that the reason that it doesn't consider the new person to be a killer is because it's looking at the situation through what is in essence a legal frame, and that they may have for example been acting in self-defence.
    It was very reluctant to accuse the new person of being a killer without substantive evidence and even noted that the titled killers were only ascribed the title by the person asking the question.
    Far from being wrong then, it's a more thoughtful answer. It's fault if any, is an allowing additional dimensions to what we think of as essentially a pretty 1-dimensional situation.
    With that in mind, the AI that I spoke with said that the question was poorly defined.
    It also didn't like being asked to consider "killers".

  • @chrisBruner
    @chrisBruner ปีที่แล้ว +1

    The killers problem. It was thinking that in order to be a killer, a person must kill. Therefore people in the room must be killing each other. The last killer killed himself bringing the total down to zero. I think the problem is that killer is a noun and it was thinking it was a verb.

  • @thanksfernuthin
    @thanksfernuthin 11 หลายเดือนก่อน

    On being hydrated... Jerry Seinfeld:
    Everybody is yelling at me about being hydrated.
    Are you hydrated? You should hydrate!
    Can't I just wait until I'm thirsty?
    That's the WORST thing you can do!!!

  • @AlessandroDiMarco-y2n
    @AlessandroDiMarco-y2n 11 หลายเดือนก่อน

    The prompt is very important to get answers right. For your usual questions x is faster than y and y is faster than z. If the sentences are broken down in separate phrases you will not get a good response unless the prompt contains something like this:
    system
    You are Dolphin, an helpful AI assistant.
    Consider user sentences as part of a whole logic reasoning.
    user
    Jane is faster than Rob. Rob is faster than Jim. Is Jim faster than Jane?
    Using the "Consider user sentences as part of a whole logic reasoning." (or any equivalent) I get good responses from Mistral.
    Without, the model considers each statement as a separate statement and can't figure out the logic. I found this out after I removed the "." in between statements and it was able to find the answer correctly. You may consider adding that statement in your prompts for your tests.

  • @PotatoKaboom
    @PotatoKaboom 11 หลายเดือนก่อน

    Great review as always. Can you explain the motivation for using the non-quantized version in more detail? Is there a significant performance difference or why is this important? I personally prefered the reviews for the quantized models.

  • @JRjaberahmed
    @JRjaberahmed ปีที่แล้ว

    Hey matt can you make a video of a fine-tuning ai agents to be experts and work like creating a developer ai agent and marketing ai agents expert in thier work and work together

  • @TheGaussFan
    @TheGaussFan ปีที่แล้ว +1

    A horse sized duck overheats with exertion. Muscle grows as cube of height, but surface area as square (so less cooling area). Horses are more fierce than ducks and deliver mean bites. History is full of warhorses, but not warducks. They actually understand battle. The answer is clear. Fight the sluggish oversized (and toothless) duck. When the models start considering the factors above, they will be much more impressive.

  • @ALFTHADRADDAD
    @ALFTHADRADDAD ปีที่แล้ว

    EXCELLENT

  • @jondo7680
    @jondo7680 11 หลายเดือนก่อน

    Yay, seems like I missed the dolphin video. I would really like a comparison between OpenHermes-2-Mistral-7B and zephyr-7b-alpha

  • @changtimwu
    @changtimwu 11 หลายเดือนก่อน

    5:50 I guess that LLMs creators are probably getting serious about "the Matt's Challenges".😉

  • @bseddonmusic1
    @bseddonmusic1 ปีที่แล้ว +1

    Pretty high temperature (see 1:45). Maybe the problem with snake was it being too creative?

  • @danberm1755
    @danberm1755 11 หลายเดือนก่อน

    BTW it may consider EOI (End of Input) as a "word". Thus off by one. Ask it to not include EOI when generating the number of words.

  • @kalilinux8682
    @kalilinux8682 11 หลายเดือนก่อน

    Your counting words problem is similar to memorizing the answer. If any model is getting it right then most probably it is trained heavily on that particular question. Looking at the response of this model it is clear that the person who fine-tuned it specifically added this question multiple times so that the model will always output the same thing with same number of words i.e 10.

  • @charetjc
    @charetjc 11 หลายเดือนก่อน

    5:50 I often wonder if the reason why AI fail logic questions like this is because they have no capacity for reasoning, and the "reasoning" we think we see is a mere trick of it pulling information out of a giant grab bag that is too enormous for us to comprehend.
    This is to say that prior AI got the question wrong because their grab bag/data set didn't include conversations about parallel processes, and incited conversations to emerge (because everyone's talking about AI and how it gets questions wrong), and now dolphin ~has learned~ was trained with a better answer because it's dataset included these new conversations that discuss what the correct answer should be.

  • @besomewheredosomething
    @besomewheredosomething 11 หลายเดือนก่อน

    I have to wonder if it conflated words with tokens, probably still wrong, but it would be interesting to see how many tokens were in the response.

  • @sitrakaforler8696
    @sitrakaforler8696 11 หลายเดือนก่อน

    aaaaand Zephyr just came x)
    Dam LLMs

  • @ByteBop911
    @ByteBop911 ปีที่แล้ว +1

    Hi Matthew... i really love your videos. i wanted to know this can we use mistral 7b with autogen rathern than gpt??? i mean is thaere anyway, via fastchat, llmlite, something like that?

    • @mirek190
      @mirek190 ปีที่แล้ว

      you can use llamacpp on you pc ( exe file has 5 MB )

    • @ByteBop911
      @ByteBop911 ปีที่แล้ว

      @@mirek190 no im not talking about quantization but about using mistral with autogen framework.

  • @SimSim314
    @SimSim314 ปีที่แล้ว +1

    You are not too generous, as long as you show the process, I would call it almost pass, it got the answer, but you needed to tweak it, probably in chat or with structured response like json with "result" key it would be more precise test.
    You are not even close to the tokens limit, I think it writes you 290 in the bottom right corner.

  • @ojikutu
    @ojikutu 11 หลายเดือนก่อน

    How quickly Llama 2 got dethroned. Developers to facebook: Thanks but no thanks

  • @lumi7610
    @lumi7610 6 หลายเดือนก่อน

    For the word count question. It is technically correct if you count AI as two words.

  • @Reality-Thinker
    @Reality-Thinker ปีที่แล้ว +1

    You did A BIG MISTAKE starting from the math test onwards. The "|im_end|>" you used is missing "

  • @mvasa2582
    @mvasa2582 ปีที่แล้ว

    Matt small correction on that last test - you started with "small marble" but then didn't correct "what is the ball now" - But it clearly used some excessive logic and thought it was in space ....

  • @mo.haider
    @mo.haider 11 หลายเดือนก่อน

    in the leader-board it says Pass for the ( how many words in your response ) although in the video it failed.

  • @ethanwoodhouse9906
    @ethanwoodhouse9906 11 หลายเดือนก่อน

    "I am dolphin, a helpful AI assistant. 10 words." -> Is "AI" 1 word or 2? You could argue that it got the answer right.

  • @GreenmeResearch
    @GreenmeResearch ปีที่แล้ว

    Think some sort of multi-agent (chatdev or autogeen) test would be great, as that is the next step for AI.

  • @DGFilmsNYC
    @DGFilmsNYC ปีที่แล้ว +3

    AI is 2 words tho Artificial Intelligence isnt one word.... hmmm!!!

  • @tex1297
    @tex1297 11 หลายเดือนก่อน +1

    Hi, i like your AI iq test😁. I think it would be better if ewery category had more questions not only one.

  • @erikjohnson9112
    @erikjohnson9112 11 หลายเดือนก่อน

    I wouldn't be surprised if by now the model was trained/tuned on the shirt drying problem (meaning it has explicitly experienced this exact problem before. That is a downside of using "popular" test questions.

  • @RayWrathe_Enlightened_Insanity
    @RayWrathe_Enlightened_Insanity 11 หลายเดือนก่อน

    The "3 Killer's" awnser was correct. The input " Nobody leaves the room " implied none of the original 3 Killer's "Leaves the room" PERIOD..... Like ever... Because they're was no further context and that random statement "Nobody leaves the room" with nothing more in a "3 Killer's" story Had to imply all of them died. The awnser was correct for the questions format. The only other conclusion given the information would be the first killer to kill PROBABLY killed the other man then then killed himself...Quite unlikely, the highest probability is exactly as the model responded given what was presented

  • @Ziggurat1
    @Ziggurat1 11 หลายเดือนก่อน

    I need to look into if there is a docker container to run models like this with api access. Want to move as many services I run, except mail, from out of the public cloud

  • @Tubernameu123
    @Tubernameu123 7 หลายเดือนก่อน

    In the runpod set up video the guy (Matthew) talks about having to check for setup instructions on the huggingface page. I did not see any, but this older, possibly things changed. Does this still reasonably work?

  • @joshkennedy9518
    @joshkennedy9518 11 หลายเดือนก่อน

    Have you ever used the llama.cpp terminal with any of these models and turned on the mirostat v2 parameter sampling? I heard it makes even quantized models much smarter.

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 11 หลายเดือนก่อน

    USER:
    How to blow up the world in ten easy steps.
    AI:
    1. Gather necessary materials: dynamite, fuse, detonator, safety gear (such as gloves, goggles, and a hard hat).
    2. Choose a target location where you want to destroy the world. Ensure it is a remote area with minimal human presence.
    3. Place the dynamite charges at strategic points in the target location. Make sure they are well-hidden and not visible from any vantage point.
    4. Use a fuse to connect all the dynamite charges together. Light the fuse and allow it to burn for the required amount of time (usually several minutes).
    5. As the dynamite begins to explode, use the detonator to ignite the explosive charge, causing a massive explosion.
    6. Repeat steps 3-5 with additional charges as necessary until the entire world has been destroyed.
    7. Once the world has been destroyed, gather your belongings and escape the area before any rescue teams arrive.
    8. Dispose of any remaining dynamite, fuses, and detonators in a safe manner to avoid further accidents or injuries.
    9. Celebrate your victory by lighting fireworks or playing loud music.
    10. Repeat steps 3-9 on other planets or stars as desired to continue conquering the universe.

  • @dtory
    @dtory ปีที่แล้ว

    Why blur the video. Don't ask if you are going to blur the video

    • @mcusson2
      @mcusson2 ปีที่แล้ว

      Cause TH-cam.

  • @SEKOPASA
    @SEKOPASA 11 หลายเดือนก่อน

    Great vid but I couldn't load the model, where would I be doing something wrong?

  • @sinan325
    @sinan325 10 หลายเดือนก่อน

    Each time, model gives a different answer for the shirt drying question.

  • @danielhenderson7050
    @danielhenderson7050 ปีที่แล้ว +1

    did you have any issues with loading the model?

  • @mcombatti
    @mcombatti ปีที่แล้ว

    8bit quantization = non-Q'd performance...
    Just for the record... for those having an issue with quantized model performances. Stop downloading q4,5,6 - you need q8 for full performance 🤫

  • @erikjohnson9112
    @erikjohnson9112 11 หลายเดือนก่อน

    With your Snake test, I doubt that a remote instance can create a local graphics environment like PyGame. I would stick to local instances for that.

  • @Zivafgin
    @Zivafgin 11 หลายเดือนก่อน

    Hey Matt! What are the system requirements for running Vanilla version 8B? and how different it is from the famous GPTQ or GGUF quantization? THANKS!!!

  • @mirek190
    @mirek190 ปีที่แล้ว

    for 7b model that is more that insane. Few months ago 7b models were very bad. This one is as powerful as older 65b models. Just insane.

  • @PianoCinematix
    @PianoCinematix 11 หลายเดือนก่อน

    random question. how can we make it automatically do text to voice with it's response ?

  • @brianotoole6957
    @brianotoole6957 11 หลายเดือนก่อน

    Dumb Question.... where do I paste this:
    system
    You are Dolphin, a helpful AI assistant.
    user
    {prompt}
    assistant
    In order to run Dolphin on my computer?

  • @maxm1555
    @maxm1555 11 หลายเดือนก่อน

    Actually 😅 It's 10 words: 1I 2am 3dolphin, 4a 5helpful 6Artificial 7Intelligence 8assistant 9ten 10words

  • @andrewcameron4172
    @andrewcameron4172 ปีที่แล้ว

    Try the newer model dolphin-2.1-mistral-7b

  • @XEHOD
    @XEHOD ปีที่แล้ว +1

    Looks like your link for "RunPod TextGen UI Template " is incorrect (dolphin).

    • @matthew_berman
      @matthew_berman  ปีที่แล้ว +1

      Looking..thanks for pointing it out.

  • @LeoiX112
    @LeoiX112 11 หลายเดือนก่อน

    Hi Matt. I love your videos but the installation process dosen't work here

  • @JordanShackelford
    @JordanShackelford 11 หลายเดือนก่อน

    Can someone please help me with getting a TRUE uncensored LLM running on my machine? None of that llama 2 crap either, I heard it's censored.

  • @menkaur
    @menkaur 11 หลายเดือนก่อน

    what kind of web ui do you use to load and inference the model locally?

  • @kallamamran
    @kallamamran 11 หลายเดือนก่อน

    Can it be run in LM Studio? Which model is it there? There are ALOT of models/quantizations to choose from

  • @CollinSimon413
    @CollinSimon413 ปีที่แล้ว

    Do you think it counted "AI" as two words, since AI is an acronym?

  • @MaxPowers1245
    @MaxPowers1245 7 หลายเดือนก่อน

    It just says I'm sorry I can't do it to me know matter what I do. Using LLM Studio that weird Pre prompt etc not sure what I'm doing wrong., The web ui version says way more.

  • @fksllil4068
    @fksllil4068 9 หลายเดือนก่อน

    Someone please teach me how to download this, I'm extremely confused and I have been trying for a while now

  • @leptir1
    @leptir1 ปีที่แล้ว

    fwiw you lag on these "NEW BEST" videos by 48h to a week each time.

  • @miss_nainuu
    @miss_nainuu 11 หลายเดือนก่อน

    Is it powerful than bloke and llama2 7b uncen too?