Semantic-Text-Splitter - Create meaningful chunks from documents

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ก.ค. 2024
  • In this video I want to show you a package with uses BERT to create chunks - semantic-text-splitter
    Repo: pypi.org/project/semantic-tex...
    Code: github.com/Coding-Crashkurse/...
    Timestamps
    0:00 Introduction
    0:57 Code walkthrough

ความคิดเห็น • 39

  • @codingcrashcourses8533
    @codingcrashcourses8533  4 หลายเดือนก่อน

    I made a mistake in this video: This Splitter does NOT accept a full model, but only accepts the Tokenizer. Sorry for that. So I am still looking for a good way to create LLM based chunks :(

    • @nmstoker
      @nmstoker 4 หลายเดือนก่อน

      It's a shame but I think the underlying idea of what you were after makes sense. It amuses me that so often people try LLMs with RAG outputs that even a typical human would struggle with!

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +5

      @@nmstoker I will release a video on how to make an LLM based splitter next video :). When nobody else wants to do it, lets do it ourselves :)

    • @nathank5140
      @nathank5140 4 หลายเดือนก่อน +1

      Following for that. I have many meeting transcripts with discussions between two or more participants. The conversation often is non linear. Topics revisited multiple times. I’m trying to find a good way to embed the content. Thinking maybe to write one or more articles on each meeting. Then chunk those. Not sure, would appreciate any ideas

    • @vibhavinayak8527
      @vibhavinayak8527 หลายเดือนก่อน

      @@codingcrashcourses8533 Looks like some people have implemented 'Advanced Agentic Chunking' which actually uses an LLM to do so! Maybe you should make a video about it?
      Thank you for your content, love your videos!

    • @codingcrashcourses8533
      @codingcrashcourses8533  หลายเดือนก่อน

      @@vibhavinayak8527 currently learning langgraph. But still struggle with that

  • @micbab-vg2mu
    @micbab-vg2mu 4 หลายเดือนก่อน +3

    Thank you for the video - ) I agree random chunking every 500 or 1000 token gives random results.

  • @henkhbit5748
    @henkhbit5748 3 หลายเดือนก่อน +1

    Yes, a much better chunking approach. thanks for showing👍

  • @kenj4136
    @kenj4136 4 หลายเดือนก่อน +3

    Your tutorials are gold, thanks!

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน

      Thanks so much, honestly i am quite surprised that so many people watch and like that video

  • @MikewasG
    @MikewasG 4 หลายเดือนก่อน +1

    Thank you for your effort! The video is very helpful!

  • @andreypetrunin5702
    @andreypetrunin5702 4 หลายเดือนก่อน

    Спасибо! Очень полезно!

  • @ashleymavericks
    @ashleymavericks 4 หลายเดือนก่อน

    This is a brilliant idea!

  • @Munk-tt6tz
    @Munk-tt6tz 2 หลายเดือนก่อน

    Exactly what i was looking for, thanks!

  • @pillaideepakb
    @pillaideepakb 4 หลายเดือนก่อน

    This is amazing

  • @znacibrateSANDU
    @znacibrateSANDU 4 หลายเดือนก่อน

    Thank you

  • @ahmadzaimhilmi
    @ahmadzaimhilmi 4 หลายเดือนก่อน

    Very intuitive approach towards RAG performance improvement. I wonder if the barchart at the end would be better off be substituted with 2-dimensional representation and be evaluated with KNN.

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      yes, I did a whole video on how to visualize embeddings

  • @bertobertoberto3
    @bertobertoberto3 3 หลายเดือนก่อน

    Wow

  • @user-lg6dl7gr9e
    @user-lg6dl7gr9e 4 หลายเดือนก่อน +1

    We need a langchain in production course, hope you consider it!!!

  • @maxlgemeinderat9202
    @maxlgemeinderat9202 4 หลายเดือนก่อน

    Interesting! Saw the langchain implmentation. Do you prefer this one an could the tokenizer be any embedding model?

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +2

      There is a difference between an embedding model and a tokenizer, hope you are aware of that. If yes, I didn´t understand the question

  • @moonly3781
    @moonly3781 4 หลายเดือนก่อน

    Stumbled upon your amazing videos and want to thank you for the incredible tutorials. Truly amazing content!
    I'm developing a study advisor chatbot that aims to answer students' questions based on detailed university course descriptions, each roughly the length of a PDF page. The challenge arises with varying descriptions across different universities, for similar course names and the length of each course. Each document starts with the university name and the course description. I've tried adding the university name and course description before every significant point, which helped when chunking by regex to ensure all relevant information is contained within each chunk. Despite this, when asking university-specific questions, the correct course description for the queried university sometimes doesn't appear in the result chunk, often because it's missing in the chunks. Considering a description is about a page of text, do you have a better approach to this problem or any tips ? Really sorry for the long question :) I would be very very grateful for the help.

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      It depends on your usecase. I think embeddings for 1 PDF are quite trashy, but if you need the whole document you can have a look at a parent child retriever. You embed very small documents, but pass the larger, related document to the LLM. Not sure what to do with the noise part, LLMs can handle SOME noise :)

  • @raphauy
    @raphauy 4 หลายเดือนก่อน

    Thanks for the video. Is there a way to do this with typescript?

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน

      yes, but someone has to create that npm package probably :)

  • @thevadimb
    @thevadimb หลายเดือนก่อน

    Why didn't you like the Langchain implementation of the semantic splitter? What was the problem with it?

  • @user-sw2se1xz6r
    @user-sw2se1xz6r 4 หลายเดือนก่อน

    is it theoretically possible to have a normal LLM like llama2 or mistral do the splitting?
    the idea would be to have a completely local alternative running on top of ollama.
    i see that in the case of semantic-textsplitter it uses a tokenizer for that and i understand the difference.
    i am just curious if it would be possible.
    danke für die vids btw. learned a lot from that. ✌✌

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +1

      Sure that is possible. You can treat that as normal task for the llm. I would add that the new chunks should contain characters an outputparser can use to create multiple elements from them.

    • @nathank5140
      @nathank5140 4 หลายเดือนก่อน

      @@codingcrashcourses8533what do you mean? Can you provide an example?

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน

      @@nathank5140 I will release a video on that topic on friday! :)

  • @mansidhingra4118
    @mansidhingra4118 หลายเดือนก่อน

    Hi, thanks for this brilliant video. Really thoughtful of you. Just one question, when I tried to import HuggingFaceTextSplitter. I received an importError. -- "ImportError: cannot import name 'HuggingFaceTextSplitter' from 'semantic_text_splitter'" Any idea how will it work?

    • @codingcrashcourses8533
      @codingcrashcourses8533  หลายเดือนก่อน

      Currently not. Maybe they changed the import path. What Version do you use?

    • @mansidhingra4118
      @mansidhingra4118 หลายเดือนก่อน

      @@codingcrashcourses8533 Thank you for your response. The current version I'm using for semantic_text_splitter is 0.13.3

  • @jasonsting
    @jasonsting 4 หลายเดือนก่อน

    Since this solution creates "meaningful" chunks, implying that there can be meaningless or less meaningful chunks, would that then imply that these chunks affect the semantic quality of embeddings/vector database? I was previously getting garbage out of a chromadb/faiss test and this would explain it.

    • @codingcrashcourses8533
      @codingcrashcourses8533  4 หลายเดือนก่อน +3

      I would argue that there are two different kind of "trash" chunks. 1. there is are docs that just get cut off and lose their meaning. 2. chunks are too large and cover multiple topics -> embeddings just don´t mean anything.