📚To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/AdamLucek/ You’ll also get 20% off an annual premium subscription! 💡
Adam, I'm struggling to find the words to express how grateful I am for the content you share on your channel. Your ability to convey information clearly and without unnecessary speculation is truly brilliant. Thank you very much. If I were not broke, I would support you, but all I can support you with is thanking you.
I am trying to build an Agentic RAG Framework with tool calling for Geographic Information System (GIS) Workflows for my Master's Thesis. I spent a lot of time trying to figure out the best chunking strategy and this honesty humbled me. Semantic chunking was a very compute intensive process and theoretically it sort of made sense so I went with that. Although, I am glad that I was only prototyping anyway, and since the dataset I have is huge, this is such a relief! Thanks for covering this Adam! Your content has been a great help.
The llm based sounds interesting but also expensive. I haven’t implemented any RAG yet but this was great food for thought for helping me know where to start! Thanks!
This is interesting to see. Especially since multiple articles state when using recursive chunking, chunk_overlap is an important parameter to ensure context between chunks but chroma suggests otherwise. What are your thoughts on this from your RAG experience?
Overlap can be a little redundant here and there. It definitely can help when relevant but not apparent context is cut or disconnected, which is kinda what the cluster semantic chunker here is trying to solve for, but usually if your chunk sizes are big enough and your retrieval mechanism is robust, the splitting of recursive approaches based on natural separators tend to do most of the work of keeping relevant sections together when working with text data, which is only improved when introducing cosine similarity comparisons into the mix as well with the semantic approaches.
📚To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/AdamLucek/ You’ll also get 20% off an annual premium subscription! 💡
This is probably the last guide on RAG chunking I'll ever need. So well done. Thank you for the walkthrough of the research!
Adam, I'm struggling to find the words to express how grateful I am for the content you share on your channel. Your ability to convey information clearly and without unnecessary speculation is truly brilliant. Thank you very much. If I were not broke, I would support you, but all I can support you with is thanking you.
The kind words are support enough! Thanks for watching!
I am trying to build an Agentic RAG Framework with tool calling for Geographic Information System (GIS) Workflows for my Master's Thesis. I spent a lot of time trying to figure out the best chunking strategy and this honesty humbled me. Semantic chunking was a very compute intensive process and theoretically it sort of made sense so I went with that. Although, I am glad that I was only prototyping anyway, and since the dataset I have is huge, this is such a relief!
Thanks for covering this Adam! Your content has been a great help.
Sounds like a cool thesis! Glad I could help!
The llm based sounds interesting but also expensive. I haven’t implemented any RAG yet but this was great food for thought for helping me know where to start! Thanks!
That's a great and a quite surprising overview! Thank you :)
Thanks for watching!
Just what i need, big thanks
What I would really like is to see what the code looks like to actually implement the two suggested "best" tokenizers in simple examples.
is anthropic contextual RAG also can be considered as chunking strategy?
This is interesting to see. Especially since multiple articles state when using recursive chunking, chunk_overlap is an important parameter to ensure context between chunks but chroma suggests otherwise. What are your thoughts on this from your RAG experience?
Overlap can be a little redundant here and there. It definitely can help when relevant but not apparent context is cut or disconnected, which is kinda what the cluster semantic chunker here is trying to solve for, but usually if your chunk sizes are big enough and your retrieval mechanism is robust, the splitting of recursive approaches based on natural separators tend to do most of the work of keeping relevant sections together when working with text data, which is only improved when introducing cosine similarity comparisons into the mix as well with the semantic approaches.
🎉🎉🎉