Semantic-Text-Splitter - AI Based Text-Splitting with LangChain

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 ก.พ. 2024
  • In this video I will add upon my last video, where I introduced the semantic-text-splitter package. This time I will show you how to split texts with an LLM

ความคิดเห็น • 46

  • @awakenwithoutcoffee
    @awakenwithoutcoffee หลายเดือนก่อน

    this is an amazing method brother. I think the next step would be to train a local LLM to pre-process documents, extract images and tables (with metadata) etc.

  • @andydataguy
    @andydataguy 4 หลายเดือนก่อน +1

    This is awesome! Am working on a textsplitter this weekend. Thank you for this video 🙌🏾

  • @Challseus
    @Challseus 4 หลายเดือนก่อน +3

    Very happy you continue to do advanced topics 💪🏾

  • @pragyantiwari3885
    @pragyantiwari3885 หลายเดือนก่อน +1

    See what I did....first I extracted the text from my pdf files and passed it to this class...it is working well.....

  • @user-sw2se1xz6r
    @user-sw2se1xz6r 4 หลายเดือนก่อน +3

    thanks for the addition to the first video! this now makes all clear! 👍

  • @swiftmindai
    @swiftmindai 4 หลายเดือนก่อน +1

    Looking forward to your upcoming Monday video. My two cents about this channel, honestly the quality of contents in this channel are far above par compare to many others which has multiple times subscribers. Personally I've shared this with many of my friends and all of them are benefited from it. You deserve alot recognition. Keep up your good work and I wish you all the very best to you and your channel.

  • @andreypetrunin5702
    @andreypetrunin5702 4 หลายเดือนก่อน

    Огромное спасибо!!!

  • @say.xy_
    @say.xy_ 4 หลายเดือนก่อน

    Finally I able to connect three dots, LLMs outperform when reference were given, Every RAG uses that reference as chunks, Better quality of chunks will eventually lower the hallucinations and increase the chance of more accurate outputs.

  • @nathank5140
    @nathank5140 4 หลายเดือนก่อน

    Love it. Thank you. I’d love to hear your ideas on how to preprocess a raw meeting transcript. Say for example A Joe Rogen episode. In my case it would be a business meeting between prospect and onboarding agent. The goal is to process the meeting into something that would be useful later by a retriever. I’ve thought about writing an article about the meeting. Then chunking that, with each chunk being enriched with metadata about who what where when. But just can’t find the right approach to make the chunks useful/valuable/dense enough, but without loosing context.

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน

      Depends on what you want to do with it. You can also summarise parts of it if every detail is not that important

  • @Kevin-jm1go
    @Kevin-jm1go 4 หลายเดือนก่อน +1

    The reason you dont see this in LangChain yet is probably that the text splitters there are meant to split text which is much larger than the context length of the model. In that case, you cannot just make a single LLM call to chunk your document. If you find a way to semantically chunk text being much larger than the models context length without making too many assumptions about the structure of the text, that would be really interesting

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +2

      Yes, but you could probably do a multistep-splitter, which takes the size of the context-window into consideration :).

    • @Kevin-jm1go
      @Kevin-jm1go 4 หลายเดือนก่อน +1

      @@codingcrashcourses8533 I like the idea and I would love to see a video about this 😉to get an idea what kind of challenges this might bring and how they are being tackled.
      One of these challenges could be that when sliding with a fixed window size over your text, you might involuntarily split up a coherent piece of text into 2 topics. Also, sometimes, a specific topic is mentioned in distributed parts of the whole text. I guess you would want to merge those chunks, but the challenge lies in identifying them.
      It would also be interesting to see this with some other LLMs like e.g. Aleph Alpha.

    • @efexzium
      @efexzium 4 หลายเดือนก่อน +1

      semantic text splitter can adjust the token size so it can work for most model context size.

  • @DzulkifleeTaib
    @DzulkifleeTaib 4 หลายเดือนก่อน +1

    Is it correct to say that the limitation is the context size window of the LLM to process the inbound input?

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      Yes and no. RAG is also about the amount of noise you sent to an LLM. There more unrelated data you sent to an LLM, the worse performance becomes

  • @maxiweisei
    @maxiweisei 4 หลายเดือนก่อน

    looks cool! Thinking about implementing this with large, complex pdfs. What will be suitable ways use this approach with pdf but with keeping the information about the page_number of a chunk?

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน

      For PDF i made a Video about multimodal rag with gpt-4-vision

  • @karthikarunr8505
    @karthikarunr8505 2 หลายเดือนก่อน

    Hey, can you share the code/notebook for this? I have a similar use-case and it'd be great if I can take some excerpts from this

  • @yinxing418
    @yinxing418 4 หลายเดือนก่อน +3

    I actually liked the semantic text splitter, too bad it's not so semantic. This way doesn't seem like it would scale for large amounts of documents, I don't think I would use it.

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      Yes, i made this Video just as brain teaser. As you can see i did not make a repo for this :)

  • @anthonycadden120
    @anthonycadden120 4 หลายเดือนก่อน +2

    have you tried to do this with a smaller model for either speed or local instance?

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      no, I rarely use local models due to my old computer ;-)

  • @mrchongnoi
    @mrchongnoi 4 หลายเดือนก่อน

    Edited: Looks good. What about large documents? How would this work for tables?

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      Tables are a different topic... probably should not embedded at all, but rather stored in a DB and queried via function calling. That´s at least my experience with it.

  • @andreweducates
    @andreweducates 3 หลายเดือนก่อน +1

    I'm attempting to do Q&A retrieval with a legal document and the RecursiveCharacterSplitter hasn't been the best for me, do you think chunking Semantically as you've shown here would work well in my use case?
    Appreciate it! 🙏🏾

    • @codingcrashcourses8533
      @codingcrashcourses8533  3 หลายเดือนก่อน

      I dont know your documents. Are they in PDF format? HTML? It heavily depends on how it looks like

    • @andreweducates
      @andreweducates 3 หลายเดือนก่อน

      @@codingcrashcourses8533 im extracting all the text out of the Microsoft Word Document
      So it’s just a huge string of text
      then I’m doing the createDocuments function on it and then the RecursiveSplitting

    • @andreweducates
      @andreweducates 3 หลายเดือนก่อน

      @@codingcrashcourses8533 I'm getting the document in a huge chunk of text basically. Do you have an email/linkedIn/twitter to which I can reach out brother?
      Thanks!

    • @awakenwithoutcoffee
      @awakenwithoutcoffee หลายเดือนก่อน

      did you end up finding a solution ? I think most devs are facing similar issues right now..

    • @codingcrashcourses8533
      @codingcrashcourses8533  หลายเดือนก่อน

      @@awakenwithoutcoffee We use chunking with GPT4... works by far the best.

  • @swiftmindai
    @swiftmindai 4 หลายเดือนก่อน +1

    Appreciate the update. Please share the github link for the above codebase if you've time. Thanks.

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      I did not create a repo for that since it´s just very basic code. The important stuff is the idea behind it :)

    • @swiftmindai
      @swiftmindai 4 หลายเดือนก่อน

      Yeah, thanks for the wakeup call. Sometimes people tends to get lazier. I've managed to get it worked perfectly.

  • @yawboateng9904
    @yawboateng9904 13 วันที่ผ่านมา

    is there a github repo for us to see this code and walk through it?

    • @codingcrashcourses8533
      @codingcrashcourses8533  13 วันที่ผ่านมา

      All of my projects i made videos about are available on github. Everything ;)