LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

James Briggs

มุมมอง 55 858

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 9 ม.ค. 2025

ความคิดเห็น • 101

@jamesbriggs ปีที่แล้ว ⁺¹²
LangChain docs have moved so the original wget command in this video will no longer download everything, now you need to use:
!wget -r -A.html -P rtdocs python.langchain.com/en/latest/
@mohitagarwal9007 ปีที่แล้ว ⁺¹
This is not downloading everything as well is there anything else we can use to get the necessary files ?
@jamesbriggs ปีที่แล้ว ⁺⁹
@@mohitagarwal9007 yes I have created a copy of the docs on Hugging Face here huggingface.co/datasets/jamescalam/langchain-docs-23-06-27
You can download by doing a `pip install datasets` followed by:
```
from datasets import load_dataset
data = load_dataset('jamescalam/langchain-docs-23-06-27', split='train')
```
@deniskrr ปีที่แล้ว ⁺¹
@@mohitagarwal9007 just go to the above link and see where you're getting redirected to now. Then copy the link from the browser to the wget command and it should always work.
@RamasubramaniamM ปีที่แล้ว ⁺¹
Chunking the most important idea and largely ignored. Thanks James love your technical depth.
@fgfanta ปีที่แล้ว
I need to chunk text for retrieval augmentation and did a search on TH-cam and found... James Briggs' video. I know I will find in it what I need. Nice!
@dikshyakasaju7541 ปีที่แล้ว ⁺¹
Thank you for sharing this informative video showcasing Lanchain's powerful text chunking capabilities using the RecursiveCharacterTextSplitter. Previously, I had to write several functions to tokenize and split text while managing context overlap to avoid missing crucial information. However, accomplishing the same task now only requires a few lines of code. Very impressive.
@ADHDOCD ปีที่แล้ว ⁺⁴
Great video. Finally somebody that goes in depth into data prep. I've always wondered unecessary (key,value) pairs in json files.
@grandplazaunited ปีที่แล้ว ⁺¹
Thank you for sharing your knowledge. these are some of the best videos on LangChain.
@temiwale88 ปีที่แล้ว
I'm @ 12:34 and this is an amazing explanation thus far. Thank you!
@harleenmann6280 ปีที่แล้ว
Great video series. Appreciate you sharing your thought process as we go - this is the part most online content creators of tech mill. They cover the how, and more often than not miss the why. Thanks again. Enjoying all the videos in this playlist
@jamesbriggs ปีที่แล้ว ⁺¹
if the code isn't loading for you from video links, try opening in Colab here:
colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb
@videowatching9576 ปีที่แล้ว
Fascinating channel, thanks! Remarkable to learn about LLMs, how to interact with LLMs, what can be built, and what could be possible over time. I look forward to more.
@SnowyMango ปีที่แล้ว
This was great! I made a terrible mistake of chunking without considering this simple math and embedded indexed into Pinecone at the larger size. Now I have to go redo all them after realizing at their current sizes it isn’t quite suitable for LangChain retrieval
@BrianStDenis-pj1tq ปีที่แล้ว
At first, it seemed like you switched from tiktoken len to char len of your chunks, when explaining RecursiveCharacterTextSplitter. That wasn't going to work, so I went back and found that you did show, maybe not so much explain, that the splitter is using the tiktoken len function. Makes sense now, thanks!
@redfield126 ปีที่แล้ว
Thank you James for the in depth explanation of data prep. Learning a lot with your videos.
@SuperYoschii ปีที่แล้ว ⁺²
Thanks for the content James! I think they changed something when downloading the htmls with wget. When I run the colab, it only downloads a single index.html file
@alvinpinoy ปีที่แล้ว ⁺¹
Very helpful and very well explained. Thanks for sharing your knowledge about this! LangChain really feels as the missing glue between the open web and all those new AI models popping up.
@jamesbriggs ปีที่แล้ว
yeah it's really helpful
@muhammadhammadkhan1289 ปีที่แล้ว ⁺²
You always know what I am looking for thanks for this 🙏
@jamesbriggs ปีที่แล้ว
glad it helps!
@hashiromer7668 ปีที่แล้ว ⁺¹⁰
Wouldn't chunking lose information about long term dependencies between passages? For example, if a term is defined in the start of document which is used in the last passage, this dependency won't be captured if we chunk documents.
@jamesbriggs ปีที่แล้ว ⁺⁵
yes, this is an issue with it, if you're lucky and using a vector db with returning 5 or so chunks, you might return both chunks and then the LLM sees both, but naturally there's no guarantee of this - I'm not aware of a better approach for tackling this problem with large datasets though
@bobjones7274 ปีที่แล้ว ⁺⁴
@@jamesbriggs Somebody on another video said the following, is it relevant here? "You could aggregate chunk togethers asking the LLM to summarize and group them in "meta chunks", you could repeat the process until all years are contained into a single max limit tokens batch. Then, with the meta Data, you'll be able to perform a much more powerful search over the corpus, providing much more context to your LLM with different level of aggregation."
@rodgerb2645 ปีที่แล้ว
@@bobjones7274 sounds interesting, do you remember the video? Can you provide the link? Tnx
@astro_roman ปีที่แล้ว
@@bobjones7274 link, please, I beg you
@JOHNSMITH-ve3rq ปีที่แล้ว
I’ve seen this in many places but where has it been implemented??
@codecritique 9 หลายเดือนก่อน
Thanks for the tutorial, really clear explaination !!
@MaciekMorz ปีที่แล้ว ⁺²
I have seen a lot of materials on how to store embeddings in Pinecone vector db. But I haven't seen any tutorial yet on how to store vectorstores with different embeddings of different users in one index. I.e. how to retrieve embeddings depending on which user they belong to. What would be the best strategy for this, whether through metadata or something else?
It would be great to see a tutorial on this especially using langchain although it seems to me that the current wrapper doesn't really allow this. BTW. The whole series with langchain is great!
@murphp151 ปีที่แล้ว
these videos are pure class
ปีที่แล้ว ⁺¹
Great content once again, thanks for sharing. I wish I had this a couple weeks ago :D
@jamesbriggs ปีที่แล้ว
any time!
@siamhasan288 ปีที่แล้ว
Ayo I was literally looking for how to prepare my data for the past hour . Thank you for making these.
@eRiicBelleT ปีที่แล้ว ⁺¹
In my case the last two weeks xD
@eRiicBelleT ปีที่แล้ว
Uff the video that I was expecting! Thank youuu!
@rodgerb2645 ปีที่แล้ว
Amazing James, I've learned so much from you!
@jamesbriggs ปีที่แล้ว ⁺¹
Awesome to hear :)
@matheusrdgsf ปีที่แล้ว
James you are helping a lot in my activities. Thank you.
@jamesbriggs ปีที่แล้ว ⁺¹
glad to hear!
@sevilnatas 8 หลายเดือนก่อน ⁺¹
I am struggling with a chunking scenario that includes PDFs that include a lot of columnar data in tables and the primary questions users will be asking of the PDF data will be contained in the tables. Qu4estions that depend on finding the value in the first column and then retrieving a value on that row found with the original value in a specified column. This means that the chunked data needs to be able to maintain the integrity of the table. Any suggestions?
@absar66 5 หลายเดือนก่อน
any solution? i am struggling with the same..
@sevilnatas 5 หลายเดือนก่อน
@@absar66 No silver bullets. I did see a project called Marker, I think, that can take PDFs and convert them to markdown text. If that is effective at translating to markdown, columnar type text will probably be better chunked if it is markdown. Anyway, just a thought I was thinking about trying. If you give it a try, let me know how it goes.
@lf6190 ปีที่แล้ว
Awesome I was just trying to figure out how to do this with the langchain docs so that I can learn it quicker!
@AlexBego ปีที่แล้ว
James, I should say Thank You a Lot for your interesting and so useful videos!
@jamesbriggs ปีที่แล้ว
you're welcome, thanks for watching them!
@fraternitas5117 ปีที่แล้ว
James dropping the great content as usual.
@TomanswerAi ปีที่แล้ว
Nice one James. Demystified that step for me there 👍 As you say if people get this part wrong everything else will underperform
@jamesbriggs ปีที่แล้ว
yeah it's super important
@henkhbit5748 ปีที่แล้ว
Thanks James, for sharing this information.👍 I always thought that 4k token limit for chatgpt-turbo was independent for input and output completion and not combined...
@mohammadsunasra ปีที่แล้ว
So James, what you mean to say is it will first split based on the first character splitter, then compare of the no of tokens > chunk size and if yes, then split again based on the next split until the no of tokens < chunk size right?
@kevon217 ปีที่แล้ว
any tips for dealing with datasets that have missing values? doesn’t seem like the various transformer encoding classes have defaults for handling entirely empty strings/values. it’ll still spit out a vector which i assume is just padding tokens?
@Sergedable ปีที่แล้ว ⁺¹
nice job, also it would be grateful if you could make a video on how to combine, for example, multiple documents doc1 doc2 doc3..etc and use LangChain and ChatGraph 4 to analyze them.
@calebmoe9077 ปีที่แล้ว
Thank you James!
@GrahamAndersonis ปีที่แล้ว
Is there a best practice for chunking mixed documents that also include tables and images? Are you extracting tables/images (out of the chunk) and into a separate CSV/other file, and then providing some kind of ‘hey llm, the table for this chunk is located in this CSV file’ ? If so, how do you write the syntax for this note (within the chunk) to the LLM? Much appreciation in advance.
@ayushgautam9462 ปีที่แล้ว
are you using a jupyter notebook or are you working on the google colab, and how can i run these codes on vs code if possible
@maximchuprynsky7472 ปีที่แล้ว
Hello. I have a question/problem. I have a rather large prompt and it exceeds the token limit. Is there any possibility to split it as well as the basic information from the pdf file?
@younginnovatorscenterofint8986 ปีที่แล้ว
thanks for the content James. I am trying to build document conversational asistant,using langchain and huggingface but I have been getting this error .Token indices sequence length is longer than the specified maximum sequence length for this model (2842 > 512). Running this sequence through the model will result in indexing errors
d
@gunderhaven ปีที่แล้ว
Hi James, thanks for sharing your work. In this video, you briefly mention cleaning up the "messy bits" in the plain text page content and that it is not necessary in your estimation. Could you suggest an approach to clean up those messy bits to some degree? Thanks in advance.
@rishniratnam ปีที่แล้ว
Nice video James.
@ChronicleContent ปีที่แล้ว
I am kinda clueless and don't know much about any of this but why are we doing this? Don't you think that in the future chatgpt or other will use live internet and have the information available? And also have bigger limits? I am trying to understand the vision on this. Or is it just for now to be able to "bypass" the limits and use it on updated stuff till they find out a way to have a live trained model? Sorry if it sounds totally clueless.
@generichuman_ ปีที่แล้ว
I'm training a transformer model from scratch just to get a better intuition on how they work. I'm curious if you know the best way to setup the text dataset so that each text chunk is it's own entity and won't bleed over into other chunks. For example if I have a dataset of stories, when one ends and another one begins, I don't want the next story to still have context from the last story. I'm using the hugging face tokenizer to implement BPE. I hope this makes sense and I would greatly appreciate any guidance!
@jimjones26 ปีที่แล้ว
I have a question. I am working on loading documentation for several different technologies into 1 vector database. I want to use this as a AI development assistant for the tech stack that I use to create web applications. I am assuming the way you are categorizing your chunks would be an appropriate way to have these different 'columns' of data within one vector db?
@krisszostak4849 ปีที่แล้ว
Hi James, Thanks for your amazing work!
I've been playing with this lately and I'm not sure if I understand the connection between the tiktoken_len function and the chunk_size and the length_function args in RecursiveCharacterTextSplitter. So the question is this:
In the RecursiveCharacterTextSplitter - if the "length_function=len" (by default), then the "chunk_size" sets the max amount of CHARACTERS in the chunk, but if the "lenght_function=tiktoken_len" (or any other token counter) - then the "chunk_size" sets the max amount of TOKENS? Is that correct?
Thanks!
@ewanp1396 ปีที่แล้ว
Great video. What software are you using for the video (as in the notebook with blue background)?
@LucaMainieri68 ปีที่แล้ว
Thank you for your amazing video and all work you do. I was wondering how to use langchain to perform data analysis on one or more datasets. Let's say I have leads, sell and orders dataset. Can I use langchain to perform some analysis, such as ask which customers placed the last order? How were sales last month? Can you aim me in the right direction? Thanks 🙏
@mintakan003 ปีที่แล้ว ⁺¹
I played with this awhile ago in LangChain. My impression is in order to do Q&A on documents, one has to do a sequential scan. Every chunk has to be read in. Wouldn't this be prohibitively expensive for a large document set? I know there are vector databases (indices) which can do a pre-screen based on vector similarity. This would be an improvement. But it still involves a sequential scan, now at the vector level. Are there attempts to address this problem? Perhaps parallelism maybe one part of the solution (?)
@jamesbriggs ปีที่แล้ว ⁺³
it isn't a sequential scan with (most, maybe all) vector DBs, they use approximate search, so the answer is approximated and not everything is fully compared - a good vector db will make this approximation very accurate (like 99% accuracy)
@artchess0 ปีที่แล้ว
Hi James, thank you very much for your videos. I have a question. What if we need to pass context to our LLM model to translate from one language to another. It is better to chunk in smallest sizes or to the limit of tokens for the request to the model? Im thinking of processing the chunks in parallel and then join the results of the translation together. But i dont know what is the best aproach. Thank you in advance
@mrchongnoi ปีที่แล้ว
You talk about adding context. Where can I get information on adding context? Sorry if it is a remedial question.
@RedCloudServices ปีที่แล้ว
James I hope I am asking this question correctly. Would it not be cheaper to fine tune an existing GPT model with your entire custom corpus (i.e. your langchain docs) and then have chatbot using your finished fine tuned LLM published up to openai?
@jacobgoldenart ปีที่แล้ว
Thanks James! About chunking. What about when your documentation has a lot of code example’s interspersed throughout the text. Is the recursive text splitter able to work with say python code where retaining white space is important?
@jamesbriggs ปีที่แล้ว ⁺¹
it won't distinguish any special difference between normal text and code unfortunately, so it will just split on newlines, whitespace, etc as per usual
@alivecoding4995 ปีที่แล้ว
How do you work remotely in VSCode with the notebook on Colab?
@ylazerson ปีที่แล้ว
great video - super appreciated!
@nazimtairov ปีที่แล้ว
thanks for tutorial, how text after splitting into chunks can be processed further to LLMChain?
I'm getting an error from openai api:
chain = LLMChain(llm=llm, prompt=chat_prompt, verbose=True)
chain_result = chain.run({
'source_code': python_code,
'target_tech': 'python',
'source_tech': 'Go'
})
This model's maximum context length is 8193 tokens. However, your messages resulted in 13448 tokens. Please reduce the length of the messages.
@raypixelz ปีที่แล้ว
Awesome. thank you!
@ketangote ปีที่แล้ว
Great Video
@jesusperdomo8388 ปีที่แล้ว
please, is it possible for you to work the code in visual studio code?
@Sunghoon4life ปีที่แล้ว
Is the RecursiveCharacterTextSplitter split the text based on token or text? as per dos seems like it's based on charector but in the video u said it's based on token. Could you please confirm?
@jamesbriggs ปีที่แล้ว
it's splitting on character (the "

", "
", " ", "" characters), but the length function is based on tokens, so it is kind of doing both, meaning it is identifying a satisfactory length based on tokens, but then the split itself is using characters
@rmehdi5871 ปีที่แล้ว
@@jamesbriggs does this splitting words on any text?? in my data, taken, I think, via xml format, has these tags: , and . Should I split on those rather than with "

", "
", " ", "", or do both, perhaps? What is your recommendation?
@li_tsz_fung ปีที่แล้ว
Is LLaMA langchain a thing now? It makes sense to me that we should use open source stuff, so that we can run it locally soon.
@jamesbriggs ปีที่แล้ว
I believe so, but haven't had the chance to check it out yet - for sure, will be focusing more on open source soon
@StephenStrong-x1s ปีที่แล้ว
James, this video (and all your postings) are excellent! Exactly what a long time developer, looking to expand into AI needs to get started! Do you do any lectures at conferences?
@paenget ปีที่แล้ว
Amazing❤
@tadavid1999 ปีที่แล้ว
Could anyone help me? I'm trying to use the !wget -r -A but it is not recognised as a command. I do not understand where I am going wrong as far as I know I have all modules installed as well as correct permissions. I have tried running this in terminal of Microsoft visual code, powershell as admin (with ChatGPT to put it in a diff format) and as a script importing os. Just is not working for me and I am very interested in the practical applications of this. Great video by the way I like how everything is explained step by step!
@jamesbriggs ปีที่แล้ว ⁺¹
I think it should be recognized as a command, the issue may be that the webpage is outdated, could you try `!wget -r -A.html -P rtdocs python.langchain.com/en/latest/` - also another thought, if you're running in terminal drop the `!`, leaving you with `wget -r -A.html -P rtdocs python.langchain.com/en/latest/`
@tadavid1999 ปีที่แล้ว
@@jamesbriggs i've just figured this out. It's because im not linux based. This video helped me fix the issue for anyone wanting to follow along: th-cam.com/video/gCrF8Zx13wg/w-d-xo.html
@tadavid1999 ปีที่แล้ว
@@jamesbriggs I'm trying to use the wget command to download my own website for context but it keeps downloading the first page only, any tips on how i can get it to go for the rest?
@dreamphoenix ปีที่แล้ว
Thank you.
@mohammedsaheer4700 ปีที่แล้ว
Can we pass more than 10000 tokens into langchain using chunking ?
@jamesbriggs ปีที่แล้ว
Yes you can pass in as many as you like, billions even
@fraternitas5117 ปีที่แล้ว
Could you make content about Nvidia's NeMo?
@rafaelprudencioleite7291 ปีที่แล้ว
Thanks so much for the video. when i use
!wget -r -A.html -P rtdocs link...
It download the index.html page. I tried in the terminal and won't work too. There's a way to handle that?
@Clubcloudcomputing ปีที่แล้ว ⁺²
Looks like the website changed, and does a redirect to a different domain. Hence you get only 1 file. Instead, index the domain that it redirects to.
@TheCloudShepherd ปีที่แล้ว

ต่อไป

เล่นอัตโนมัติ

Chatbot Memory for Chat-GPT, Davinci + other LLMs - LangChain #4