LangChain: How to Properly Split your Chunks

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 ส.ค. 2023
  • In this video, we are taking a deep dive into Recursive Character Text Splitter class in Langchain. How you split your chunks/data determines the quality of the answers you get when you are trying to chat with your documents using LLMs. Learn how to properly use text splitter in Langchain.
    #llm #langchain #PDFchat
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬
    ☕ Buy me a Coffee: ko-fi.com/promptengineering
    |🔴 Support my work on Patreon: Patreon.com/PromptEngineering
    🦾 Discord: / discord
    ▶️️ Subscribe: www.youtube.com/@engineerprom...
    📧 Business Contact: engineerprompt@gmail.com
    💼Consulting: calendly.com/engineerprompt/c...
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    LINKS: python.langchain.com/docs/mod...
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    All Interesting Videos:
    Everything LangChain: • LangChain
    Everything LLM: • Large Language Models
    Everything Midjourney: • MidJourney Tutorials
    AI Image Generation: • AI Image Generation Tu...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 77

  • @parisneto
    @parisneto 4 หลายเดือนก่อน

    Just found your channel and while I initially wanted to have you as a professor in a classroom ( maybe back in college 30yrs ago), I really think you are helping to create a better world for many with your content, careful explanation and examples and this is the true reason and mission for a teacher, congrats!

  • @CacoNonino
    @CacoNonino 11 หลายเดือนก่อน +18

    please make more videos like this one! Many people got into AI without coding background, we are missing more detailed videos on these topics!

    • @AJJU_OZA
      @AJJU_OZA 11 หลายเดือนก่อน

      Answer me...
      Promote Engineering's videos are for Developer (appreciation) only ??

    • @CacoNonino
      @CacoNonino 11 หลายเดือนก่อน

      @@AJJU_OZA well if it was i would not be here for so long hahahahahahaha
      What I meant that for those who don't have coding knowledge and want to do more then replicating github repos, this Hands-on type of video is phenomenal!
      In my case I am working in text based rpg game and the basic concept of this video was one I had yet to grasp!
      answered!

    • @CacoNonino
      @CacoNonino 11 หลายเดือนก่อน

      @@AJJU_OZA i mean if the channel also had a LLM Python focused course I would be one paying for it.
      I bet that there are ton of people changing carreers that also have need of more basic concepts in depth explanation videos like this one!

    • @ml-techn
      @ml-techn 11 หลายเดือนก่อน

      @@CacoNonino what do you mean by LLM python focused course?

    • @CacoNonino
      @CacoNonino 11 หลายเดือนก่อน

      @@ml-techn I mean, I cursed economy but changed carreers midway to dataengeenering!
      Now i'am more and more building things on top of LLMs
      All my coding knowlegde came from using chatGPT in the last year, and I think it is the same for a lot of people, hence why the tutorial videos are so popular.
      Am i making sense? I mean, there a thousand videos outthere that mention splitting texting into chunks but not a lot explaining how that especifically is done how he made it here!

  • @asithakoralage628
    @asithakoralage628 11 หลายเดือนก่อน +1

    Thank you, you explain very clearly and I have been watching your content. They really good and honest. Please keep these types of videos.. thanks a lot.

  • @RealEstate3D
    @RealEstate3D 11 หลายเดือนก่อน +1

    First time I see content on the optimal chunk lengths. In addition it might be interesting on how to integrate metadata as for example on which page of a book, which url or which paragraph in a law text a text comes from or is within a text. These metadata also will take space in the retrieval context.
    Good work. Definitely go this road.

  • @adnanrizve5551
    @adnanrizve5551 10 หลายเดือนก่อน

    Great Work! Very simple but really elaborative. Please create more videos in this for this series

  • @yazanrisheh5127
    @yazanrisheh5127 11 หลายเดือนก่อน +1

    Finally understood this. I remember asking on discord and I think you also replied but the fact an entire video was made on this made it muc much much clearer. Thank you so much!
    Could you make a video about vectorstores and which one to use, how to know what to use, and the code behind it because I saw a couple like FAISS, chromaDB, deeplake etc... and for my chatbot, it's pretty much the last thing I have left to do but I still don't understand pretty much most of vectorstores for now.

  • @WinstonWalker-fc7ty
    @WinstonWalker-fc7ty 11 หลายเดือนก่อน +6

    I’d love to see videos on both embedding size and modifying the text splitter! I’m particularly interested in strategies that would enable inclusion of citations, e.g. a medical article that includes numbered citations at the end of each sentence with the reference list at the end of the document.

  • @e_hana_kakou
    @e_hana_kakou 11 หลายเดือนก่อน

    Appreciate all your content. I'd love to know more about chunking customization. Thanks! 🤙

  • @wassimsaioudi116
    @wassimsaioudi116 6 หลายเดือนก่อน

    Incredible ! Hope you'll provide more videos like this one !

  • @deepaksingh9318
    @deepaksingh9318 4 หลายเดือนก่อน +1

    and I think nobody can explain concepts in easier way than you do..
    tried 10 different videos for checking how Recursivesplitter would go if para is chunk size.. and you explained it.. :)
    love it how you cover each and every aspects from learning point of view.. Thanks again. .

    • @engineerprompt
      @engineerprompt  4 หลายเดือนก่อน

      Glad it was helpful. Make sure to watch the next one :)

  • @SmashPhysical
    @SmashPhysical 11 หลายเดือนก่อน

    Great explanation, thanks, this will be super useful!

  • @darshan7673
    @darshan7673 8 หลายเดือนก่อน

    Great Video, Thanks for creating the video!

  • @user-ip6yq5tz1r
    @user-ip6yq5tz1r 7 หลายเดือนก่อน

    Great Video, Thanks for creating the video!😀

  • @hvbris_
    @hvbris_ 11 หลายเดือนก่อน

    Good video - for the dataset I am working with I found that spliting by tokens produces better results but really depends on the data you're working with tbh!

  • @SachinChavan13
    @SachinChavan13 11 หลายเดือนก่อน +1

    Please keep making more such videos. I found this video very helpful..

  • @AA_135
    @AA_135 11 หลายเดือนก่อน

    Great explanation !

  • @duncanprins9944
    @duncanprins9944 11 หลายเดือนก่อน

    Great! Much appreciated 😊

  • @izainonline
    @izainonline 10 หลายเดือนก่อน

    Great Video to understand chunks and textsplitter

  • @Zivafgin
    @Zivafgin 11 หลายเดือนก่อน

    Great content! Keep up please :)

  • @unshadowlabs
    @unshadowlabs 11 หลายเดือนก่อน

    I have seen a lot of videos on how to use these chunks with a vector database and have the LLM using RAG as a knowledge base. There seems to be very few videos on how to use the chunked data to fine-tune a LLM like llama 2 on this chunked data. I would love to see a video that covers the topic of using raw, or chunked data to fine tune a LLM without having to convert it into something like question and answer or instruct formatting .

  • @gerardorosiles8918
    @gerardorosiles8918 10 หลายเดือนก่อน

    Very nice video, I think anyone working on semantic search goes through the experience you described here. Have you seen a study that checks the performance of different embeddings with respect to the chunk size?
    Also, what are the different available models for embeddings? I have been using the faiss models, I have heard you mention another one. What would be a good strategy to pick one vs. another?

  • @gangs0846
    @gangs0846 6 หลายเดือนก่อน

    Thank you!

  • @Ken129100
    @Ken129100 11 หลายเดือนก่อน +1

    Thanks for the video! What if you want to chunk a large PDF of 300 pages? How do you determine the chunk size? I mean, in your example you can observe the length of each paragraph by observation but might be hard to do it for large file. I would appreciate it if you share your opinion.

  • @weber1209rafael
    @weber1209rafael 11 หลายเดือนก่อน

    Please create more content with in-depth information about how to this information in a smart way. Im currently building a domain specific knowledge base to create a "AI expert" in a certain topic and I am trying to find the right way to store alle the knowledge.

  • @RichardGetzPhotography
    @RichardGetzPhotography 11 หลายเดือนก่อน

    Yes please do a video on Embedding settings. I am currently using these.
    Parameters
    ----------
    VECTOR_SIZE: int
    The size of the vector for the text embeddings (e.g., 300).
    WINDOW_SIZE: int
    The context window size for text embeddings, capturing larger contextual information (e.g., 20).
    MIN_COUNT: int
    The minimum frequency count for words to be considered in the text embeddings (e.g., 1).
    EPOCHS: int
    The number of training iterations for the Doc2Vec model (e.g., 500).

  • @ShaneHolloman
    @ShaneHolloman 11 หลายเดือนก่อน

    Excellent to have someone break these concepts down so clearly. Keep going, this is great!

  • @JourneyMindMap
    @JourneyMindMap 6 หลายเดือนก่อน

    thanks dude

  • @goncaavci1579
    @goncaavci1579 11 หลายเดือนก่อน

    please make a video about embedding size. you are awesome thank you for videos

  • @nirsarkar
    @nirsarkar 11 หลายเดือนก่อน

    Please do create one for custom splitting. I have a particular document where I would like to define a chunk demarcated by special sequence.

  • @hl236
    @hl236 5 หลายเดือนก่อน

    More videos on chunking and embedding please.

  • @TheCloudShepherd
    @TheCloudShepherd 8 หลายเดือนก่อน

    Damn you explained that better in 3 mins that most other videos did in 30 mins

    • @engineerprompt
      @engineerprompt  8 หลายเดือนก่อน

      glad it was helpful.

  • @subhashinavolu1704
    @subhashinavolu1704 11 หลายเดือนก่อน

    What if the pdf has tables too? I see the pdf loader in langchain is not reading the table. How to solve that? In case it is solved how does the recursive text splitter work with such tabular data

  • @walidmaly3
    @walidmaly3 11 หลายเดือนก่อน

    Please continue making videos like this. Any chance u can share the code as well?

  • @kenchang3456
    @kenchang3456 6 หลายเดือนก่อน

    Excellent explanation, thank you. Just curious, why this video is the only video in your Demystifying LangChain playlist?

    • @engineerprompt
      @engineerprompt  6 หลายเดือนก่อน +1

      Thank you. Just way too many things to cover but now getting back to RAG. Will be making alot more content on it.

  • @surajthakkar3420
    @surajthakkar3420 6 หลายเดือนก่อน

    Hello mate,
    Any chance you can make a video on Context aware chunking which can improve the quality of chunks/output drastically!

  • @VerdonTrigance
    @VerdonTrigance 5 หลายเดือนก่อน

    How to define my own list of separators? Can I set mupltiple separators for paragraphs and multiple for sentences at the same time?

  • @MattGoldenberg
    @MattGoldenberg 11 หลายเดือนก่อน

    Hmmm, curious why you're splitting by character count and not by token count? Our recursive splitter always bottoms out in token count based on the model we're using, as the model can't see character level data, and the token count is the limiting factor we actually care about when inferencing.

  • @r0f115L4m
    @r0f115L4m 11 หลายเดือนก่อน +1

    Thank you for your video. What program are you using to create your diagrams?

  • @guanjwcn
    @guanjwcn 11 หลายเดือนก่อน

    please continue with these. they are useful.

  • @PerFeldvoss
    @PerFeldvoss 11 หลายเดือนก่อน

    What if you can preproces the texts and reorganise sentences by "key subjects relationship" .... That is as a supplement to the original text, you can perhaps make chunks of texts that summarise different key subjects. The AI would produce a (creative) list of these subjects, and then use that list when running through the text again... (and you can then "make langchain know" what sentences actually belong together!)

  • @mdfarhananis8950
    @mdfarhananis8950 11 หลายเดือนก่อน +1

    Really useful
    Please continue making these

  • @texasfossilguy
    @texasfossilguy 11 หลายเดือนก่อน

    What about a dynamic chunk size as a potential future feature? How does this work for a large series of documents like textbooks and other pdfs like science articles, or legal documents? What is a "best guess" for the parameters?

    • @shivanshugautam1381
      @shivanshugautam1381 11 หลายเดือนก่อน

      Hi I am also having same problem. Do you have any idea how we can divide our document chunk efficiently.

  • @waelmashal7594
    @waelmashal7594 11 หลายเดือนก่อน

    If we check our docs and check the length of each paragraph and set the chunk size = the max length can help ? or maybe take the average length from all paraghraphs ? depend on the

    splitter
    what u think ?

    • @engineerprompt
      @engineerprompt  9 หลายเดือนก่อน

      This might be dates but yes, that can be one approach. Another is to use regular expressions of there is a pattern in the data. There are now more advanced retrieval methods that can compress data in the documents to make them more relevant to the query. A lot is happening in this space

  • @arkodeepchatterjee
    @arkodeepchatterjee 11 หลายเดือนก่อน

    really useful
    please continue making videos like this

  • @jstormclouds
    @jstormclouds 11 หลายเดือนก่อน

    i feel i get the gist but interested in more on topic

  • @mikelugarte
    @mikelugarte 10 หลายเดือนก่อน

    I have a CSV file with product descriptions and Ids. I need to query the descriptions with the user input in order to get the product Id. I am using CharacterTextSplitter split the full file into chunks with 1 line for each chunck. After that I want to do a similarity_search to get the proper lines of the CSV that contain the descriptions that are similar to the user input. Im using the "
    " separator to split the text by lines but, for whatever reason it doesn´t work some times. I´d love to see an example of CharacterTextSplitter with this kind of situation or how to use RecursiveCharacterTextSplitter to do the same

    • @user-im6cm7fr8p
      @user-im6cm7fr8p 10 หลายเดือนก่อน

      Same issue I am also facing. I have managed to write a generic code for chunking, however I am able to get results only for small data sets not for large data sets. Did you manage to solve it ?

  • @AJJU_OZA
    @AJJU_OZA 11 หลายเดือนก่อน

    Sir Promote Engineering's videos are for Developer (appreciation)...???

  • @rutvikghori2410
    @rutvikghori2410 3 หลายเดือนก่อน

    Thank you! How I can resolve issues of splitting, suppose I have multiple files and I want to generate a summary individually

    • @engineerprompt
      @engineerprompt  3 หลายเดือนก่อน +1

      In that case, look into summarization specific chains. Reduce map will be a good start.

    • @rutvikghori2410
      @rutvikghori2410 3 หลายเดือนก่อน

      @@engineerprompt Suppose these are code files and I want to generate summary for all separately.
      What should I do?

  • @amol5146
    @amol5146 5 หลายเดือนก่อน

    Can you please explain how the chunk_overlap parameter works?

    • @engineerprompt
      @engineerprompt  5 หลายเดือนก่อน

      Let's say you define the chunk size to be 1000 char with overlap of 200. In this case, the first chunk will be 1 - 1000 and the second chunk will start from 801-1800 because there is an overlap of 200. Hope this helps.

    • @amol5146
      @amol5146 5 หลายเดือนก่อน

      @@engineerprompt Thank you! Does chunk_overlap also follow the default list?

  • @computerauditor
    @computerauditor 11 หลายเดือนก่อน

    🔥🔥🔥

  • @vertigoz
    @vertigoz 2 หลายเดือนก่อน

    The link no longer works

  • @fra8156
    @fra8156 11 หลายเดือนก่อน +1

    what about making a video using a very small LLM, that every pc can handle, using it on a very specific task, fine tuning it, and showing every steps from zero to hero and that we can work offline. In this way everyone can hands on this "lab" and learn by doing...

  • @MuhammadDanyalKhan
    @MuhammadDanyalKhan 5 หลายเดือนก่อน

    had a question on this video i.e. how to split chunks:
    th-cam.com/video/n0uPzvGTFI0/w-d-xo.html .... How I can find best chunk size for financial statements?

  • @frazuppi4897
    @frazuppi4897 11 หลายเดือนก่อน

    in real life you need to do way more stuff and all the tutorials are basically splitting some okay txt files but this is a good introduction

  • @CarlosIvanDonet
    @CarlosIvanDonet 11 หลายเดือนก่อน

    Does this work on a cpp local model? Like modelname-ggmlv1.q4_1.bin

    • @engineerprompt
      @engineerprompt  11 หลายเดือนก่อน +1

      Yes, it will work with any model