LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ค. 2024
  • In this video, we're going to focus on preparing our text using LangChain data loaders, tokenization using the tiktoken tokenizers, chunking with LangChain text splitters, and storing data with Hugging Face datasets. Naturally, the focus here is on OpenAI embedding and completion models, but we can apply the same logic to other LLMs like those available via Hugging Face, Cohere, and so on.
    🔗 Notebook link:
    github.com/pinecone-io/exampl...
    🎙️ Support me on Patreon:
    / jamesbriggs
    🎨 AI Art:
    www.etsy.com/uk/shop/Intellig...
    🤖 70% Discount on the NLP With Transformers in Python course:
    bit.ly/3DFvvY5
    🎉 Subscribe for Article and Video Updates!
    / subscribe
    / membership
    👾 Discord:
    / discord
    00:00 Data preparation for LLMs
    00:45 Downloading the LangChain docs
    03:29 Using LangChain document loaders
    05:54 How much text can we fit in LLMs?
    11:57 Using tiktoken tokenizer to find length of text
    16:02 Initializing the recursive text splitter in Langchain
    17:25 Why we use chunk overlap
    20:23 Chunking with RecursiveCharacterTextSplitter
    21:37 Creating the dataset
    24:50 Saving and loading with JSONL file
    28:40 Data prep is important
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 99

  • @jamesbriggs
    @jamesbriggs  ปีที่แล้ว +12

    LangChain docs have moved so the original wget command in this video will no longer download everything, now you need to use:
    !wget -r -A.html -P rtdocs python.langchain.com/en/latest/

    • @mohitagarwal9007
      @mohitagarwal9007 ปีที่แล้ว +1

      This is not downloading everything as well is there anything else we can use to get the necessary files ?

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว +9

      @@mohitagarwal9007 yes I have created a copy of the docs on Hugging Face here huggingface.co/datasets/jamescalam/langchain-docs-23-06-27
      You can download by doing a `pip install datasets` followed by:
      ```
      from datasets import load_dataset
      data = load_dataset('jamescalam/langchain-docs-23-06-27', split='train')
      ```

    • @deniskrr
      @deniskrr 9 หลายเดือนก่อน +1

      @@mohitagarwal9007 just go to the above link and see where you're getting redirected to now. Then copy the link from the browser to the wget command and it should always work.

  • @videowatching9576
    @videowatching9576 ปีที่แล้ว

    Fascinating channel, thanks! Remarkable to learn about LLMs, how to interact with LLMs, what can be built, and what could be possible over time. I look forward to more.

  • @redfield126
    @redfield126 ปีที่แล้ว

    Thank you James for the in depth explanation of data prep. Learning a lot with your videos.

  • @ADHDOCD
    @ADHDOCD ปีที่แล้ว +4

    Great video. Finally somebody that goes in depth into data prep. I've always wondered unecessary (key,value) pairs in json files.

  • @harleenmann6280
    @harleenmann6280 8 หลายเดือนก่อน

    Great video series. Appreciate you sharing your thought process as we go - this is the part most online content creators of tech mill. They cover the how, and more often than not miss the why. Thanks again. Enjoying all the videos in this playlist

  • @dikshyakasaju7541
    @dikshyakasaju7541 ปีที่แล้ว +1

    Thank you for sharing this informative video showcasing Lanchain's powerful text chunking capabilities using the RecursiveCharacterTextSplitter. Previously, I had to write several functions to tokenize and split text while managing context overlap to avoid missing crucial information. However, accomplishing the same task now only requires a few lines of code. Very impressive.

  • @fgfanta
    @fgfanta ปีที่แล้ว

    I need to chunk text for retrieval augmentation and did a search on TH-cam and found... James Briggs' video. I know I will find in it what I need. Nice!

  • @SnowyMango
    @SnowyMango ปีที่แล้ว

    This was great! I made a terrible mistake of chunking without considering this simple math and embedded indexed into Pinecone at the larger size. Now I have to go redo all them after realizing at their current sizes it isn’t quite suitable for LangChain retrieval

  • @grandplazaunited
    @grandplazaunited ปีที่แล้ว +1

    Thank you for sharing your knowledge. these are some of the best videos on LangChain.

  • @user-wy9fc5vi3j
    @user-wy9fc5vi3j ปีที่แล้ว +1

    Chunking the most important idea and largely ignored. Thanks James love your technical depth.

  • @alvinpinoy
    @alvinpinoy ปีที่แล้ว +1

    Very helpful and very well explained. Thanks for sharing your knowledge about this! LangChain really feels as the missing glue between the open web and all those new AI models popping up.

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว

      yeah it's really helpful

  • @temiwale88
    @temiwale88 ปีที่แล้ว

    I'm @ 12:34 and this is an amazing explanation thus far. Thank you!

  • @eRiicBelleT
    @eRiicBelleT ปีที่แล้ว

    Uff the video that I was expecting! Thank youuu!

  • @fraternitas5117
    @fraternitas5117 ปีที่แล้ว

    James dropping the great content as usual.

  •  ปีที่แล้ว +1

    Great content once again, thanks for sharing. I wish I had this a couple weeks ago :D

  • @lf6190
    @lf6190 ปีที่แล้ว

    Awesome I was just trying to figure out how to do this with the langchain docs so that I can learn it quicker!

  • @siamhasan288
    @siamhasan288 ปีที่แล้ว

    Ayo I was literally looking for how to prepare my data for the past hour . Thank you for making these.

    • @eRiicBelleT
      @eRiicBelleT ปีที่แล้ว +1

      In my case the last two weeks xD

  • @muhammadhammadkhan1289
    @muhammadhammadkhan1289 ปีที่แล้ว +2

    You always know what I am looking for thanks for this 🙏

  • @AlexBego
    @AlexBego ปีที่แล้ว

    James, I should say Thank You a Lot for your interesting and so useful videos!

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว

      you're welcome, thanks for watching them!

  • @matheusrdgsf
    @matheusrdgsf ปีที่แล้ว

    James you are helping a lot in my activities. Thank you.

  • @codecritique
    @codecritique 3 หลายเดือนก่อน

    Thanks for the tutorial, really clear explaination !!

  • @murphp151
    @murphp151 ปีที่แล้ว

    these videos are pure class

  • @TomanswerAi
    @TomanswerAi ปีที่แล้ว

    Nice one James. Demystified that step for me there 👍 As you say if people get this part wrong everything else will underperform

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว

      yeah it's super important

  • @rodgerb2645
    @rodgerb2645 ปีที่แล้ว

    Amazing James, I've learned so much from you!

  • @calebmoe9077
    @calebmoe9077 ปีที่แล้ว

    Thank you James!

  • @BrianStDenis-pj1tq
    @BrianStDenis-pj1tq 9 หลายเดือนก่อน

    At first, it seemed like you switched from tiktoken len to char len of your chunks, when explaining RecursiveCharacterTextSplitter. That wasn't going to work, so I went back and found that you did show, maybe not so much explain, that the splitter is using the tiktoken len function. Makes sense now, thanks!

  • @rishniratnam
    @rishniratnam ปีที่แล้ว

    Nice video James.

  • @gunderhaven
    @gunderhaven ปีที่แล้ว

    Hi James, thanks for sharing your work. In this video, you briefly mention cleaning up the "messy bits" in the plain text page content and that it is not necessary in your estimation. Could you suggest an approach to clean up those messy bits to some degree? Thanks in advance.

  • @henkhbit5748
    @henkhbit5748 ปีที่แล้ว

    Thanks James, for sharing this information.👍 I always thought that 4k token limit for chatgpt-turbo was independent for input and output completion and not combined...

  • @ylazerson
    @ylazerson ปีที่แล้ว

    great video - super appreciated!

  • @SuperYoschii
    @SuperYoschii ปีที่แล้ว +2

    Thanks for the content James! I think they changed something when downloading the htmls with wget. When I run the colab, it only downloads a single index.html file

  • @Sergedable
    @Sergedable ปีที่แล้ว +1

    nice job, also it would be grateful if you could make a video on how to combine, for example, multiple documents doc1 doc2 doc3..etc and use LangChain and ChatGraph 4 to analyze them.

  • @raypixelz
    @raypixelz ปีที่แล้ว

    Awesome. thank you!

  • @jamesbriggs
    @jamesbriggs  ปีที่แล้ว +1

    if the code isn't loading for you from video links, try opening in Colab here:
    colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb

  • @ketangote
    @ketangote 11 หลายเดือนก่อน

    Great Video

  • @paenget
    @paenget ปีที่แล้ว

    Amazing❤

  • @LucaMainieri68
    @LucaMainieri68 ปีที่แล้ว

    Thank you for your amazing video and all work you do. I was wondering how to use langchain to perform data analysis on one or more datasets. Let's say I have leads, sell and orders dataset. Can I use langchain to perform some analysis, such as ask which customers placed the last order? How were sales last month? Can you aim me in the right direction? Thanks 🙏

  • @ewanp1396
    @ewanp1396 ปีที่แล้ว

    Great video. What software are you using for the video (as in the notebook with blue background)?

  • @artchess0
    @artchess0 ปีที่แล้ว

    Hi James, thank you very much for your videos. I have a question. What if we need to pass context to our LLM model to translate from one language to another. It is better to chunk in smallest sizes or to the limit of tokens for the request to the model? Im thinking of processing the chunks in parallel and then join the results of the translation together. But i dont know what is the best aproach. Thank you in advance

  • @user-tk7os4dm5j
    @user-tk7os4dm5j ปีที่แล้ว

    James, this video (and all your postings) are excellent! Exactly what a long time developer, looking to expand into AI needs to get started! Do you do any lectures at conferences?

  • @dreamphoenix
    @dreamphoenix ปีที่แล้ว

    Thank you.

  • @generichuman_
    @generichuman_ ปีที่แล้ว

    I'm training a transformer model from scratch just to get a better intuition on how they work. I'm curious if you know the best way to setup the text dataset so that each text chunk is it's own entity and won't bleed over into other chunks. For example if I have a dataset of stories, when one ends and another one begins, I don't want the next story to still have context from the last story. I'm using the hugging face tokenizer to implement BPE. I hope this makes sense and I would greatly appreciate any guidance!

  • @MaciekMorz
    @MaciekMorz ปีที่แล้ว +2

    I have seen a lot of materials on how to store embeddings in Pinecone vector db. But I haven't seen any tutorial yet on how to store vectorstores with different embeddings of different users in one index. I.e. how to retrieve embeddings depending on which user they belong to. What would be the best strategy for this, whether through metadata or something else?
    It would be great to see a tutorial on this especially using langchain although it seems to me that the current wrapper doesn't really allow this. BTW. The whole series with langchain is great!

  • @krisszostak4849
    @krisszostak4849 ปีที่แล้ว

    Hi James, Thanks for your amazing work!
    I've been playing with this lately and I'm not sure if I understand the connection between the tiktoken_len function and the chunk_size and the length_function args in RecursiveCharacterTextSplitter. So the question is this:
    In the RecursiveCharacterTextSplitter - if the "length_function=len" (by default), then the "chunk_size" sets the max amount of CHARACTERS in the chunk, but if the "lenght_function=tiktoken_len" (or any other token counter) - then the "chunk_size" sets the max amount of TOKENS? Is that correct?
    Thanks!

  • @younginnovatorscenterofint8986
    @younginnovatorscenterofint8986 ปีที่แล้ว

    thanks for the content James. I am trying to build document conversational asistant,using langchain and huggingface but I have been getting this error .Token indices sequence length is longer than the specified maximum sequence length for this model (2842 > 512). Running this sequence through the model will result in indexing errors
    d

  • @mohammadsunasra
    @mohammadsunasra ปีที่แล้ว

    So James, what you mean to say is it will first split based on the first character splitter, then compare of the no of tokens > chunk size and if yes, then split again based on the next split until the no of tokens < chunk size right?

  • @RedCloudServices
    @RedCloudServices ปีที่แล้ว

    James I hope I am asking this question correctly. Would it not be cheaper to fine tune an existing GPT model with your entire custom corpus (i.e. your langchain docs) and then have chatbot using your finished fine tuned LLM published up to openai?

  • @hashiromer7668
    @hashiromer7668 ปีที่แล้ว +10

    Wouldn't chunking lose information about long term dependencies between passages? For example, if a term is defined in the start of document which is used in the last passage, this dependency won't be captured if we chunk documents.

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว +5

      yes, this is an issue with it, if you're lucky and using a vector db with returning 5 or so chunks, you might return both chunks and then the LLM sees both, but naturally there's no guarantee of this - I'm not aware of a better approach for tackling this problem with large datasets though

    • @bobjones7274
      @bobjones7274 ปีที่แล้ว +4

      @@jamesbriggs Somebody on another video said the following, is it relevant here? "You could aggregate chunk togethers asking the LLM to summarize and group them in "meta chunks", you could repeat the process until all years are contained into a single max limit tokens batch. Then, with the meta Data, you'll be able to perform a much more powerful search over the corpus, providing much more context to your LLM with different level of aggregation."

    • @rodgerb2645
      @rodgerb2645 ปีที่แล้ว

      @@bobjones7274 sounds interesting, do you remember the video? Can you provide the link? Tnx

    • @astro_roman
      @astro_roman ปีที่แล้ว

      @@bobjones7274 link, please, I beg you

    • @JOHNSMITH-ve3rq
      @JOHNSMITH-ve3rq ปีที่แล้ว

      I’ve seen this in many places but where has it been implemented??

  • @GrahamAndersonis
    @GrahamAndersonis ปีที่แล้ว

    Is there a best practice for chunking mixed documents that also include tables and images? Are you extracting tables/images (out of the chunk) and into a separate CSV/other file, and then providing some kind of ‘hey llm, the table for this chunk is located in this CSV file’ ? If so, how do you write the syntax for this note (within the chunk) to the LLM? Much appreciation in advance.

  • @jimjones26
    @jimjones26 ปีที่แล้ว

    I have a question. I am working on loading documentation for several different technologies into 1 vector database. I want to use this as a AI development assistant for the tech stack that I use to create web applications. I am assuming the way you are categorizing your chunks would be an appropriate way to have these different 'columns' of data within one vector db?

  • @kevon217
    @kevon217 ปีที่แล้ว

    any tips for dealing with datasets that have missing values? doesn’t seem like the various transformer encoding classes have defaults for handling entirely empty strings/values. it’ll still spit out a vector which i assume is just padding tokens?

  • @jacobgoldenart
    @jacobgoldenart ปีที่แล้ว

    Thanks James! About chunking. What about when your documentation has a lot of code example’s interspersed throughout the text. Is the recursive text splitter able to work with say python code where retaining white space is important?

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว +1

      it won't distinguish any special difference between normal text and code unfortunately, so it will just split on newlines, whitespace, etc as per usual

  • @maximchuprynsky7472
    @maximchuprynsky7472 ปีที่แล้ว

    Hello. I have a question/problem. I have a rather large prompt and it exceeds the token limit. Is there any possibility to split it as well as the basic information from the pdf file?

  • @alivecoding4995
    @alivecoding4995 ปีที่แล้ว

    How do you work remotely in VSCode with the notebook on Colab?

  • @ChronicleContent
    @ChronicleContent ปีที่แล้ว

    I am kinda clueless and don't know much about any of this but why are we doing this? Don't you think that in the future chatgpt or other will use live internet and have the information available? And also have bigger limits? I am trying to understand the vision on this. Or is it just for now to be able to "bypass" the limits and use it on updated stuff till they find out a way to have a live trained model? Sorry if it sounds totally clueless.

  • @ayushgautam9462
    @ayushgautam9462 ปีที่แล้ว

    are you using a jupyter notebook or are you working on the google colab, and how can i run these codes on vs code if possible

  • @sevilnatas
    @sevilnatas 3 หลายเดือนก่อน

    I am struggling with a chunking scenario that includes PDFs that include a lot of columnar data in tables and the primary questions users will be asking of the PDF data will be contained in the tables. Qu4estions that depend on finding the value in the first column and then retrieving a value on that row found with the original value in a specified column. This means that the chunked data needs to be able to maintain the integrity of the table. Any suggestions?

  • @mrchongnoi
    @mrchongnoi ปีที่แล้ว

    You talk about adding context. Where can I get information on adding context? Sorry if it is a remedial question.

  • @mintakan003
    @mintakan003 ปีที่แล้ว

    I played with this awhile ago in LangChain. My impression is in order to do Q&A on documents, one has to do a sequential scan. Every chunk has to be read in. Wouldn't this be prohibitively expensive for a large document set? I know there are vector databases (indices) which can do a pre-screen based on vector similarity. This would be an improvement. But it still involves a sequential scan, now at the vector level. Are there attempts to address this problem? Perhaps parallelism maybe one part of the solution (?)

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว +2

      it isn't a sequential scan with (most, maybe all) vector DBs, they use approximate search, so the answer is approximated and not everything is fully compared - a good vector db will make this approximation very accurate (like 99% accuracy)

  • @li_tsz_fung
    @li_tsz_fung ปีที่แล้ว

    Is LLaMA langchain a thing now? It makes sense to me that we should use open source stuff, so that we can run it locally soon.

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว

      I believe so, but haven't had the chance to check it out yet - for sure, will be focusing more on open source soon

  • @Sunghoon4life
    @Sunghoon4life ปีที่แล้ว

    Is the RecursiveCharacterTextSplitter split the text based on token or text? as per dos seems like it's based on charector but in the video u said it's based on token. Could you please confirm?

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว

      it's splitting on character (the "

      ", "
      ", " ", "" characters), but the length function is based on tokens, so it is kind of doing both, meaning it is identifying a satisfactory length based on tokens, but then the split itself is using characters

    • @rmehdi5871
      @rmehdi5871 9 หลายเดือนก่อน

      @@jamesbriggs does this splitting words on any text?? in my data, taken, I think, via xml format, has these tags: , and . Should I split on those rather than with "

      ", "
      ", " ", "", or do both, perhaps? What is your recommendation?

  • @jesusperdomo8388
    @jesusperdomo8388 ปีที่แล้ว

    please, is it possible for you to work the code in visual studio code?

  • @fraternitas5117
    @fraternitas5117 ปีที่แล้ว

    Could you make content about Nvidia's NeMo?

  • @nazimtairov
    @nazimtairov ปีที่แล้ว

    thanks for tutorial, how text after splitting into chunks can be processed further to LLMChain?
    I'm getting an error from openai api:
    chain = LLMChain(llm=llm, prompt=chat_prompt, verbose=True)
    chain_result = chain.run({
    'source_code': python_code,
    'target_tech': 'python',
    'source_tech': 'Go'
    })
    This model's maximum context length is 8193 tokens. However, your messages resulted in 13448 tokens. Please reduce the length of the messages.

  • @yourmom-in4po
    @yourmom-in4po ปีที่แล้ว

    For some reason, when i try to download all the HTML files using wget, it only downloads the index.html file, is there any reason for this? i used the provided google collaboratory notebook and nothing :(

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว

      I don't know why that would happen using the same command, may be a system difference I'm not sure - but maybe you can refer to this:
      www.linuxjournal.com/content/downloading-entire-web-site-wget
      and try modifying the command as per the info above?

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว +2

      sorry I realize this is because the webpage for the langchain docs moved, it's actually nothing to do with the command, try this:
      !wget -r -A.html -P rtdocs python.langchain.com/en/latest/

    • @yourmom-in4po
      @yourmom-in4po ปีที่แล้ว +1

      @@jamesbriggs Thank you so much!

  • @tadavid1999
    @tadavid1999 ปีที่แล้ว

    Could anyone help me? I'm trying to use the !wget -r -A but it is not recognised as a command. I do not understand where I am going wrong as far as I know I have all modules installed as well as correct permissions. I have tried running this in terminal of Microsoft visual code, powershell as admin (with ChatGPT to put it in a diff format) and as a script importing os. Just is not working for me and I am very interested in the practical applications of this. Great video by the way I like how everything is explained step by step!

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว +1

      I think it should be recognized as a command, the issue may be that the webpage is outdated, could you try `!wget -r -A.html -P rtdocs python.langchain.com/en/latest/` - also another thought, if you're running in terminal drop the `!`, leaving you with `wget -r -A.html -P rtdocs python.langchain.com/en/latest/`

    • @tadavid1999
      @tadavid1999 ปีที่แล้ว

      @@jamesbriggs i've just figured this out. It's because im not linux based. This video helped me fix the issue for anyone wanting to follow along: th-cam.com/video/gCrF8Zx13wg/w-d-xo.html

    • @tadavid1999
      @tadavid1999 ปีที่แล้ว

      @@jamesbriggs I'm trying to use the wget command to download my own website for context but it keeps downloading the first page only, any tips on how i can get it to go for the rest?

  • @mohammedsaheer4700
    @mohammedsaheer4700 ปีที่แล้ว

    Can we pass more than 10000 tokens into langchain using chunking ?

    • @jamesbriggs
      @jamesbriggs  ปีที่แล้ว

      Yes you can pass in as many as you like, billions even

  • @rafaelprudencioleite7291
    @rafaelprudencioleite7291 ปีที่แล้ว

    Thanks so much for the video. when i use
    !wget -r -A.html -P rtdocs link...
    It download the index.html page. I tried in the terminal and won't work too. There's a way to handle that?

    • @Clubcloudcomputing
      @Clubcloudcomputing ปีที่แล้ว +2

      Looks like the website changed, and does a redirect to a different domain. Hence you get only 1 file. Instead, index the domain that it redirects to.

  • @TheCloudShepherd
    @TheCloudShepherd 8 หลายเดือนก่อน