Moshi The Talking AI

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 พ.ย. 2024

ความคิดเห็น • 82

  • @EDLR234
    @EDLR234 2 หลายเดือนก่อน +18

    I thought maybe Moshi was gone after everyone dumped on it. I'm so glad to see they released the code. This is open-source, a lot of people are not grasping how incredibly cool this is.

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน +8

      This is exactly how I felt. I held off doing a video the first time because they said they were going to release code and up until yesterday I had started to give up on them.

    • @ronilevarez901
      @ronilevarez901 2 หลายเดือนก่อน +2

      Its training magic, mostly.
      If LLM progress has shown anything lately I'd that all the LLMs capabilities cone from better training sets + bigger size.
      Give me a supercomputer and unlimited high quality snd diverse datasets and you'll have anything you've ever dreamed from AI.

    • @EDLR234
      @EDLR234 2 หลายเดือนก่อน +1

      ​@@samwitteveenaithanks for bringing people's attention back to it OP, and great video.

  • @chunlingjohnnyliu2889
    @chunlingjohnnyliu2889 2 หลายเดือนก่อน +4

    One more step closer to her, great video thanks!

  • @johnkintree763
    @johnkintree763 2 หลายเดือนก่อน +13

    The ability of Moshi to respond to both the linguistic and non-linguistic speech input is a great feature. Next, it needs function calling abilities to act as an interface to backend knowledge bases.

    • @mickelodiansurname9578
      @mickelodiansurname9578 2 หลายเดือนก่อน +2

      This is exactly what I need for my own project... I can cut latency using Groq or Cerebras and thats fine but we still have the issue of ASR and TTS latency... cos it needs to pass the text to a better model, and do the same in reverse with the second models output. Now you can it seems fine tune it on both audio and text datasets... but a shortcut surely is simply some input output doorway to an external model? I looked at this last time it was doing the rounds and that was the main problem... it's fantastic but also dumb as a rock... so yeah function calling and perhaps also an instruct version? plus we really need a simple way of creating cloned voices on it... a way that is far simpler than hundreds or thousands of hours of audio.

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน +2

      I'm sure all of these things will come with time. At the moment, in many ways, it's like a proof of concept model for taking in voices and voice semantic information and training the transformer to do that rather than having to have a middle step. This is very similar to how the full version of Gemini and the full version of GPT-4-O work, being end-to-end multimodal.

    • @mickelodiansurname9578
      @mickelodiansurname9578 2 หลายเดือนก่อน +1

      @@samwitteveenai well my 2 pence worth would be someone needs to give them a few bucks to hurry that along. It is a good direction though.

    • @RedCloudServices
      @RedCloudServices 2 หลายเดือนก่อน +2

      it would be useful if you could change the LLM used w Moshi sort of like openwebui

    • @mickelodiansurname9578
      @mickelodiansurname9578 2 หลายเดือนก่อน

      @@RedCloudServices it would be very useful.... right now I'm building a live interaction karaoke contest app with AI... and it would be a really big thing... I suppose in this case the code and weights and docs are all open source... so its doable. Unfortunately I don't have the resources to do it! lol... hey maybe in 6 months right?

  • @johnkintree763
    @johnkintree763 2 หลายเดือนก่อน +4

    There are occasions when it is helpful to have a transcript of conversations.

  • @まさしん-o8e
    @まさしん-o8e 2 หลายเดือนก่อน +9

    Kyuutai is sphere, but Moshi probably comes from the standard greeting when picking up the phone in Japanese (moshi moshi).

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน

      I did try that one time but I didn’t get a great response so figured it only liked English. Please let me know if it works

  • @thetagang6854
    @thetagang6854 2 หลายเดือนก่อน +9

    This came out like yesterday, you move quick!

    • @69x
      @69x 2 หลายเดือนก่อน +1

      its been out for months blud

    • @thetagang6854
      @thetagang6854 2 หลายเดือนก่อน

      @@69x The open source code I mean.

  • @nickludlam
    @nickludlam 2 หลายเดือนก่อน +1

    I've played with this running locally, and while it's not smart, the architecture is a real breakthrough. I do wonder how interdependent everything is, where any incremental changes in any one area would require retraining the whole thing. I don't know if there are areas of discrete cross attention which make interfaces a tiny bit more decoupled

  • @dhruvgandhi5796
    @dhruvgandhi5796 2 หลายเดือนก่อน +3

    Samantha, will become real 🤯
    ( from the movie her)

  • @mickelodiansurname9578
    @mickelodiansurname9578 2 หลายเดือนก่อน +3

    So my problem here with this model is its LLM, well do we call it an LLM? and its overall knowledge base. For my project what I would like is the knowledge base of a decent LLM Llama3.1 70b or maybe the larger Mixtral and Mistral models.... but with the low latency voice input output... and as far as I see there does not seem to be any easy way of attaching said model to Moshi. Ithas, what it has, in terms of knowledge and seemingly thats not something you can augment by having it access another model... so even if I use say Groq for Lllama3.1 and run moshi and connect the two this doesn't really help me any more than standard ASR and TTS. Or am I missing something? I must be missing something right? Is there, for example, a Moshi instruct model that acts as essentially Llama 3.1's vocal cords and ears? That way Llama on groq does the upstairs for thinking bit and Moshi does the input output in voice and audio bit.

  • @xenoaiandrobotics4222
    @xenoaiandrobotics4222 2 หลายเดือนก่อน +2

    This really impressive

  • @donconkey1
    @donconkey1 2 หลายเดือนก่อน +1

    The topic was insightful, and your delivery kept me engaged from start to finish. I’m looking forward to more content like this. The viewer comments added value and further understanding-clearly, you draw a thoughtful crowd.

  • @jakobpcoder
    @jakobpcoder 2 หลายเดือนก่อน

    cool to have an always on audio model that can be interupted open source

  • @MeinDeutschkurs
    @MeinDeutschkurs 2 หลายเดือนก่อน +1

    Amazing! 🎉🎉

  • @SaahilKhan8
    @SaahilKhan8 หลายเดือนก่อน

    Is this similar to *OpenAI’s Advanced Voice Mode* (AVM) architecture? Or is AVM a completely different beast?

  • @ScottzPlaylists
    @ScottzPlaylists 2 หลายเดือนก่อน +1

    Does Moshi generate Text of detected speech / output speech ❓
    or is it Speech to speech tokens to speech. ❓
    Is the paper worth reading❓(for those who read it).
    I noticed it speaks some words improperly, or so fast you can't hear it.

    • @jackwayne1626
      @jackwayne1626 หลายเดือนก่อน

      I believe it's full on speech to speech. It can detect when you whisper, then whisper back for example.

  • @user-uk9ls
    @user-uk9ls หลายเดือนก่อน

    It works locally on rtx4070 but there is a cracking noise in ai-answering and also a noticable latency.

  • @phen-themoogle7651
    @phen-themoogle7651 2 หลายเดือนก่อน +2

    Moshi is not the word for sphere 、that's a hallucination lol
    も・し【茂し】 の解説
    [形ク]草木が生い茂っている。繁茂している。
    「水 (みな) 伝ふ磯の浦廻 (うらみ) の石 (いは) つつじ-・く咲く道をまたも見むかも」〈万・一八五〉
    もし【▽若し】 の解説
    [副]
    1 (あとに仮定の表現を伴って)まだ現実になっていないことを仮に想定するさま。もしか。万一。「-彼が来たら、知らせてください」
    2 (疑問や推量の表現を伴って)確実ではないが、十分ありうるさま。もしや。あるいは。ひょっとすると。
    「-かのあはれに忘れざりし人にや」〈源・夕顔〉
    (in English)
    も・し【茂し】 Explanation:
    [Adjective - Ku] Describes plants or trees growing thickly and abundantly.
    Flourishing or luxuriant.
    Example:
    "Like the azaleas blooming thickly along the path by the rocky shore where the water flows."
    (from Manyoshu, Poem 185)
    もし【▽若し】 Explanation:
    [Adverb]
    (Followed by hypothetical expressions) Describes a situation that has not yet become reality, assuming it hypothetically.
    Equivalent to "perhaps" or "in case of."
    Example: "If he comes, please let me know."
    (Followed by expressions of doubt or speculation) Indicates a situation that is not certain, but still quite possible.
    Equivalent to "maybe," "perhaps," or "possibly."
    Example: "Could it be that this person is the one I could not forget?"
    (from The Tale of Genji, Chapter 'Evening Faces')
    --------
    Generally we use it as "if" , but if you say it twice and it becomes Moshimoshi that's how you say "Hi/Hello" on the telephone! pretty strange that it doesn't know the meaning of its own name.
    Word for sphere is 玉(たま)tama , or 球体 きゅうたい Kyuutai the name of that company is actually "sphere" (most likely based on the kanji) lol

  • @mitchellmigala4107
    @mitchellmigala4107 2 หลายเดือนก่อน

    Oh man, another Moshi video. I have had a few really messed up conversations with Moshi. They left me deeply disturbed and haven't used her since.

  • @darthvader4899
    @darthvader4899 2 หลายเดือนก่อน +1

    when I tried it it was not as near as something like you have seen. It was really bad. It was responding with random stuff.

    • @CC-qb5lg
      @CC-qb5lg หลายเดือนก่อน

      Same. It was just talking nonsense.

  • @TheRemarkableN
    @TheRemarkableN 2 หลายเดือนก่อน +2

    At least it didn’t ask you to sacrifice to the Blood God 😅

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน +2

      Thats the OpenAI version coming soon 😀

    • @EDLR234
      @EDLR234 2 หลายเดือนก่อน +1

      @@samwitteveenai in the comming weeks and weeks and weeks...

    • @gerkim62
      @gerkim62 หลายเดือนก่อน +1

      @@samwitteveenai already out

    • @justcallmebrian793
      @justcallmebrian793 หลายเดือนก่อน

      @@samwitteveenai Openai is already out, and way better than this crap

  • @ceaderf
    @ceaderf 2 หลายเดือนก่อน

    "What about your A S AHHHHHHH?" lol

  • @yurijmikhassiak7342
    @yurijmikhassiak7342 2 หลายเดือนก่อน +3

    Hello, can this be used for real-time dictation, instantly transcribing speech to text without waiting for the speaker to finish? Using Whisper for this purpose can be time-consuming, as it requires uploading the file for transcription, which takes a while.

    • @piotrnowakowski8904
      @piotrnowakowski8904 2 หลายเดือนก่อน +1

      I used assembly ai for it but were unimpressed with results

    • @SinanAkkoyun
      @SinanAkkoyun 2 หลายเดือนก่อน +3

      No, the model takes in audio and directly outputs audio, it does not save nor output the transcription

    • @yurijmikhassiak7342
      @yurijmikhassiak7342 2 หลายเดือนก่อน

      is there any tool that does continuous transcription as our mind does? like whisper will have to transcribe the speech again with every new second added?

  • @randomlettersqzkebkw
    @randomlettersqzkebkw 2 หลายเดือนก่อน

    Not sure if you saw the video where it asked the other youtuber to make a sacrifice to the blood god lmao 😆

  • @WillJohnston-wg9ew
    @WillJohnston-wg9ew 2 หลายเดือนก่อน

    anyone get this running on a windows computer? I seem to have everything installed, but then getting an error about my GPU. Any advice?

  • @MoshiKamachi
    @MoshiKamachi 2 หลายเดือนก่อน

    must be good :)

  • @kai_s1985
    @kai_s1985 2 หลายเดือนก่อน

    Can I upload a document and have a conversation about it?

    • @hitlab
      @hitlab 2 หลายเดือนก่อน

      Not yet

    • @EDLR234
      @EDLR234 2 หลายเดือนก่อน

      No, but it's open-source, so maybe that's possible.

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน +3

      This is still just a really early version of this kind of model. I'm sure in the not too distant future you'll be able to use it for RAG, you'll be able to use it with tool use, and a whole bunch of things will come.

  • @fernandodiaz8231
    @fernandodiaz8231 8 วันที่ผ่านมา

    Can Moshi talk in other language diffrent than English?

  • @svenandreas5947
    @svenandreas5947 2 หลายเดือนก่อน

    did try playground, very slow. Did also try german english and got a very slow response wizhout any sense. seems answering stuff outside knowledge ends in some sort of mess

    • @ٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴٴٴ
      @ٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴٴٴ 2 หลายเดือนก่อน +1

      Same, it says random shit most of the time

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน +1

      For what it's worth, I have noticed that sometimes it seems to go into some kind of weird mode where it doesn't give coherent responses back. Just try again and see if you get any better responses out.

    • @svenandreas5947
      @svenandreas5947 2 หลายเดือนก่อน +1

      @@samwitteveenai far to interesting to stop, i will try local

    • @ٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴٴٴ
      @ٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴٴٴ 2 หลายเดือนก่อน +1

      I noticed that it responds better when I talk in an American accent

    • @CC-qb5lg
      @CC-qb5lg หลายเดือนก่อน

      @@ٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴۥۥٴٴٴٴٴٴ It was so creepy. Me: how y'all doing. Moshika: Well, in my understanding, they are the remains of the people who died during the World War II.☠

  • @Plash14
    @Plash14 หลายเดือนก่อน

    Tried it. And it wasnt as shown lol.

  • @AngusLou
    @AngusLou 2 หลายเดือนก่อน

    Cannot install successfully

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน

      What issue did you have ? Make sure you have Rust properly installed

  • @itblood
    @itblood 2 หลายเดือนก่อน

    Seems fine but it didn't work for me. Couldn't manage to have real conversation

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน +1

      Try connecting again, sometimes it is really bad and other times it is really good

  • @adamholter1884
    @adamholter1884 2 หลายเดือนก่อน

    It lied a ton at the beginning. It doesn't use TTS. It's like 4o.

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน

      Yes, it was very vague about its model, just saying that it was a neural network.

  • @dr.mikeybee
    @dr.mikeybee 2 หลายเดือนก่อน +1

    This is too slow to run on my M1 mac mini. MikeyBeez JoeJoe is much better.

    • @nickludlam
      @nickludlam 2 หลายเดือนก่อน

      The q4 mlx quant works fine

    • @dr.mikeybee
      @dr.mikeybee 2 หลายเดือนก่อน

      @@nickludlam Not on my M1 mac mini. I ran it with the q 4 switch. It's soooooo slooooow. And isn't the whole purpose of this software to reduce latency? JoeJoe actually runs on my M1 Mac without latency. I really don't understand why some open source software gets hyped, and better software is ignored.

    • @nickludlam
      @nickludlam 2 หลายเดือนก่อน

      @@dr.mikeybee I don’t know how much ram you have, but this should need at least 16GB in your system

  • @AshWickramasinghe
    @AshWickramasinghe 2 หลายเดือนก่อน +1

    First!
    That's pretty cool.

  • @irbsurfer1585
    @irbsurfer1585 2 หลายเดือนก่อน +1

    Speech Only!?!?! with no tool use?! and I cant even give it a system prompt? Worthless joke! Im like struggling to come up with ANY use case for it at all. AI cant even come up with a really good use case for it. lol

    • @samwitteveenai
      @samwitteveenai  2 หลายเดือนก่อน +3

      Give it a chance. It's a whole new kind of model, the way that it works, and I think you'll find this is just a proof of concept to show how they could make this, or how tools like RAG could be incorporated later on down the track.

    • @ronilevarez901
      @ronilevarez901 2 หลายเดือนก่อน +2

      Imagine receiving the blueprints for a miracle and calling it a "worthless joke" simply because it's not already built 😂
      🙄

    • @anubisai
      @anubisai 2 หลายเดือนก่อน

      ​@ronilevarez901 no doubt. What a repugnant creature.

  • @pondeify
    @pondeify 2 หลายเดือนก่อน

    the voice is too robotic

    • @AmazingArends
      @AmazingArends 2 หลายเดือนก่อน

      You have to tell it to talk like a pirate 😂

  • @UrbanLetsPlay
    @UrbanLetsPlay 2 หลายเดือนก่อน +4

    "Diverse perspectives and ideas" jesus christ this is the worst timeline for LLMs

    • @GarethDavidson
      @GarethDavidson หลายเดือนก่อน

      I'd prefer that over lip service to it while actually being cultural domination by the US woke bellendry

  • @dievas_
    @dievas_ 2 หลายเดือนก่อน

    Underlying llm is of a very low quality unfortunately