Cohere's Command-R a Strong New Model for RAG

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.พ. 2025

ความคิดเห็น • 32

  • @micbab-vg2mu
    @micbab-vg2mu 11 หลายเดือนก่อน +6

    In corpo world I use LMMs always with RAGs:) - I will check this model - thank you for sharing:)

  • @reza2kn
    @reza2kn 11 หลายเดือนก่อน +5

    Nice job! I look forward to a day where either we could make smart models tiny, or could run huge models on regular hardware.

  • @fire17102
    @fire17102 11 หลายเดือนก่อน +6

    Would love it if you could showcase a working rag example with live changing data. For example item price change, or policy update. Does it require to manually manage chunks and embedding references or are there better existing solutions? I think this really differentiates between fun-todo and actual production systems and applications.
    Thanks and all the best! Awesome video ❤

    • @samwitteveenai
      @samwitteveenai  11 หลายเดือนก่อน

      this is often done by having the RAG return a variable and then just look up the variable for the latest price etc. You probably don't want to put info you want to change into a vector db etc.

  • @sultan_93
    @sultan_93 10 หลายเดือนก่อน +4

    thanks for the video, it will be great for a video on how to use this with langchain, using tools and agents.

    • @samwitteveenai
      @samwitteveenai  10 หลายเดือนก่อน +1

      thanks I will put something like that together.

    • @brunodepaula9145
      @brunodepaula9145 10 หลายเดือนก่อน

      ​@@samwitteveenaiI would love it!!

  • @AnimusOG
    @AnimusOG 11 หลายเดือนก่อน +1

    Thanks for pumping out another one bro!

  • @Luxcium
    @Luxcium 10 หลายเดือนก่อน

    I love your videos I hope you can have some community messages soon in the meantime I would love to mention that: I would love to have information about Open Devin as well as some clever way to get the cheapest Anthropic AI Agent (Haiku) to perform pre processing or post processing of messages (in parallel to be grouped with the help of an other layer or in series with the context window being kept in the smaller model who could be used to perform the processing of queries to explain the way a more capable model like opus or gpt-4 would be able to handle the query without having the larger context window or without needing to rely on images that would be preprocessed by sonnet or haiku and described in few words to the AI Agent of the more capable level sonnet or opus or other…
    I am impressed with the fact that all 3 recent models of Claude are isometric on their respective images capabilities and lengthy context window only the price is different along with their overall capabilities but I have not experimented enough with them to really get to witness where they have strategic advantages one over the other…

  • @davidw8668
    @davidw8668 11 หลายเดือนก่อน +2

    Nice video thx. Gotta check this, wondering how it does compared to the perplexity-online models, which performed a bit mixed in my tests.

    • @samwitteveenai
      @samwitteveenai  11 หลายเดือนก่อน

      Good question I haven't used perplexity a lot, but my guess is this could build something like perplexity pretty easily

    • @davidw8668
      @davidw8668 10 หลายเดือนก่อน

      @@samwitteveenai what I meant is that perplexity offers an api with 7b and 70 models similarly to this for a while. Though this looks like a cleaner solution

  • @joaooliveira7051
    @joaooliveira7051 11 หลายเดือนก่อน +1

    Very nice video, as always.
    I wonder how this works with Raptor retrieval.
    Thanks!

    • @samwitteveenai
      @samwitteveenai  11 หลายเดือนก่อน +1

      Ohh I was playing with Raptor on the weekend, that is very cool. Haven't tried with this model, but my guess is it will do well.

    • @joaooliveira7051
      @joaooliveira7051 11 หลายเดือนก่อน

      I will try it within Llama Index... and let you know.
      My aim is to build summaries based on a predifined document structure. Maybe I will try to "coherce" or "influence clustering inspired on something like "Hyde"... Not sure if pydantic would also be usefull, but probably less flexible...
      Thank you again.

  • @dejecj
    @dejecj 11 หลายเดือนก่อน +1

    Interesting, although I am a bit confused. Isn't RAG itself just a code implementation? The model itself doesn't do the retrieval. So with that in mind what about the model makes it a retrieval model? Is it just the needle in a haystack performance and function calling?

    • @choiswimmer
      @choiswimmer 11 หลายเดือนก่อน +2

      While this is true, the goal of this model is that it was trained with RAG in mind. All other models are general generation models that can do RAG. The idea here is that this model should be able to do RAG better because it retrained to do so at least hopefully

    • @u4tiwasdead
      @u4tiwasdead 11 หลายเดือนก่อน

      I guess the final query in RAG is always going to look something like:

      Answer the query with the context below:
      {query}
      {context}

      Where context is a list of paragraphs.
      So if a model is trained on that style of prompt then it’s good for RAG.

  • @hiranga
    @hiranga 10 หลายเดือนก่อน +1

    @samwitteveenai Great video was always legend! Was wondering if you've used Langchain with Grok cloud/Mixtral 8x7b ? I'm trying to swap out ChatOpenAI with the Grok Mixtral but I'm not sure if it can work with 'bind_tools' . Any idea on this?

    • @samwitteveenai
      @samwitteveenai  10 หลายเดือนก่อน +2

      Thanks. Yes I have got Groq working fine with LangChain but I haven't tried the function calling. I don't think .bind works with the open source models, unless you use it with the OpenAI spec and change the end point.

    • @hiranga
      @hiranga 10 หลายเดือนก่อน

      @@samwitteveenai how do you do that??

  • @alizhadigerov9599
    @alizhadigerov9599 11 หลายเดือนก่อน

    is it better than OpenAI's AssitantsAPI with Retrieval tool? OAI Aapi did not work well for me

    • @samwitteveenai
      @samwitteveenai  10 หลายเดือนก่อน +1

      Yes in my testing I think it is better based on the Coral UI they have.

  • @Luxcium
    @Luxcium 10 หลายเดือนก่อน +1

    It feels like someone could train a model with the help of this and then use their own model (I don’t know how it works or if people doing this could be caught or if it would be against the licensing agreement)…😮😮😮😮

    • @samwitteveenai
      @samwitteveenai  10 หลายเดือนก่อน

      😀 Might run into som licensing issues. But in theory you this idea can be replicated with out them as well

  • @123456crapface
    @123456crapface 11 หลายเดือนก่อน

    🚨🚨🚨 NEW MODEL DROP ALERT 🚨🚨🚨

  • @ai_product_manager
    @ai_product_manager 11 หลายเดือนก่อน

    This seems like the perfect use case for atlassian, they have some AI stuff, but no RAG. Do you happen to know if they plan on doing this? I wonder why they are sleeping on this...

    • @samwitteveenai
      @samwitteveenai  11 หลายเดือนก่อน

      Not sure I haven't used Atlassian in quite a while.

  • @Dan-hw9iu
    @Dan-hw9iu 10 หลายเดือนก่อน +8

    Rather than dropping $20,000 on an NVIDIA card, just buy a MacBook Pro lol. My M2 laptop with 96GB of VRAM works great with 70b models. Save your five figures, or your time jockeying for GPU rental time. Almost all of us hobbyists just need an Apple laptop.

    • @peterwlodarczyk3987
      @peterwlodarczyk3987 10 หลายเดือนก่อน +5

      By “works great” I assume you mean an inference speed of about 5 tokens per second (slightly under the 7t/s one can achieve with an M3 Ultra)? In other words, at a context size of 10k tokens - easily reached with an RAG infused context within three or four messages - a wait time of 30 minutes per message? If so you should probably clarify that because I suspect 99% of people will disagree with your classification of that as “great”.

    • @Dan-hw9iu
      @Dan-hw9iu 10 หลายเดือนก่อน

      ​@@peterwlodarczyk3987 Sure thing! For a 70b model, llama.cpp reports: `( 55.04 ms per token, 18.17 tokens per second)` That's pretty typical, and _far_ exceeds my reading speed. Keep in mind:
      1. For memory bandwidth, M2 > M3. This may, or may not, affect results.
      2. The relationship between context length and inference speed is complex. It's highly sensitive to the hardware, model architecture, optimizations (e.g. KV cache scheme, quantization, etc), and integrations (e.g. RAG systems). I've never seen remotely near "30 minutes / message" generation speeds, but I've also never exceeded 32k context window sizes. 🤷‍♀
      3. As I'm sure we all agree, my laptop inference of course won't surpass beastly setups which suck kilowatts through many thousands of dollars in dedicated multi-GPU hardware. But for most of us, that's okay! Apple silicon is an excellent alternative.
      Apropos of nothing: I saw someone run the new 120b DBRX model on their M2 Ultra at 14t/s (using Apple's MLX framework), and can't wait to try it myself! Yeah, the context size probably won't be great, but it's something I simply could not afford to do without Apple hardware, full stop. I just want people to know that it's a good option!

    • @hibou647
      @hibou647 9 หลายเดือนก่อน

      I run different models on a mac16 M3 max 48gb of ram via ollama or langchain. My preference goes to Mixtral q4, speedwise comparable to GPT4, quality wise GPT 3.5. I haven't pushed the model on large context windows, but if you need to process series of small to mid size texts (reviews, emails, blog posts, short reports) the mac laptops are a good option.