AI Summarize HUGE Documents Locally! (Langchain + Ollama + Python)

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ก.พ. 2025
  • Today we are looking at a way to efficiently summarize huge PDF (or any other text) documents using clustering method with HuggingFace embeddings, Langchain Python framework and Ollama Llama 3.1 model.

ความคิดเห็น • 24

  • @yashnarang3014
    @yashnarang3014 2 วันที่ผ่านมา

    I can't tell you how much greatefull I am that you made this video and I got this. This saved me so much of effort. I was trying to solve this problem for past 2 days Thank You !!!!!!!!!!

  • @jakubzakowski7422
    @jakubzakowski7422 หลายเดือนก่อน +1

    one of the best videos i have ever seen. I just want to tell you Thank you and good job

  • @DebugVerseTutorials
    @DebugVerseTutorials  5 หลายเดือนก่อน +6

    Source code
    github.com/debugverse/debugverse-youtube/tree/main/summarize_huge_documents_kmeans

  • @yashnarang3014
    @yashnarang3014 2 วันที่ผ่านมา

    Damn man Great Video would you mind if I use this is my own project and make a video about it? Will sure give you creadit for it !!!!

  • @srivenkateswaraswamy3403
    @srivenkateswaraswamy3403 3 หลายเดือนก่อน +5

    what if images of tables and equations are there in that case?

  • @jonm691
    @jonm691 21 วันที่ผ่านมา

    Nice video - thanks for sharing that

  • @drakouzdrowiciel9237
    @drakouzdrowiciel9237 วันที่ผ่านมา

    thx

  • @ajays6393
    @ajays6393 หลายเดือนก่อน

    Thank you very informative!

  • @mikew2883
    @mikew2883 24 วันที่ผ่านมา

    Very cool! Do you mind providing an example of how to filter the data like you mention in closing?

    • @jonm691
      @jonm691 21 วันที่ผ่านมา +2

      I looked at this. Basically, you use the results to provide your source pages, and then use that as the context. For example:
      filter = EmbeddingsClusteringFilter(embeddings=embeddings, num_clusters=10, num_closest=3)
      result = filter.transform_documents(documents=texts)
      context=""
      for i in result:
      context += f"{i.page_content}
      "
      # convert your result pages into a single text blob by combining them
      prompt = " Ask your question here... use the context within triple backticks ``` {context}```"
      response = llm.invoke(prompt)
      print(response)
      However... this is not a replacement for RAG, because remember that much of the document has been discarded and so you're unlikely to find your answer. k-means is basically just collating similar pages, but not necessarily the one with the unique information you need. K-means is therefore great for summarisation, but not necessarily good for specific questions. So, if your specific question relates to something that is summary-like, then if should be more relevant.
      Maybe I've missed something here, but that's my conclusion from playing with it.

  • @mightyboessu
    @mightyboessu 3 หลายเดือนก่อน +5

    Why do you use the HuggingFaceBgeEmbeddings and not OllamaEmbeddings?

    • @MITdork
      @MITdork หลายเดือนก่อน +1

      😎

  • @thingX1x
    @thingX1x หลายเดือนก่อน

    Will this work for a procedurally generated file containing a conversation? Or should I look at another method?

  • @danila8823
    @danila8823 หลายเดือนก่อน

    Using gemini vision to describe the video?? Nice technique

  • @meereslicht
    @meereslicht 4 หลายเดือนก่อน

    Excellent, thank you! A very clever strategy for large documents. However, I am a little at a loss in the search of a good embedding model for texts in Spanish. I am not sure whether the BGE models are the best option for these. Can you suggest one that could be integrated seamlessly within your code?

    • @DebugVerseTutorials
      @DebugVerseTutorials  4 หลายเดือนก่อน +2

      Hi, for Spanish language take a look at jinaai/jina-embeddings-v2-base-es . In your code simply replace the model_name variable and everything should work.

    • @meereslicht
      @meereslicht 4 หลายเดือนก่อน +1

      @@DebugVerseTutorials Thank you very much for your kind answer. I'll do that 😊🤗🤗

    • @igorcastilhos
      @igorcastilhos 3 หลายเดือนก่อน

      @@DebugVerseTutorials Hi, if I would to use the Ollama model, how can I know the exact name necessary to put in the model_name?

    • @mukeshkund4465
      @mukeshkund4465 หลายเดือนก่อน

      ​@@igorcastilhosdo ollama list to see the model available and copy the name.

    • @allok501
      @allok501 หลายเดือนก่อน

      you can use latest jina embeddings v3 as it is multilinugal.

  • @RedCloudServices
    @RedCloudServices 25 วันที่ผ่านมา

    I think the latest vision models will make RAG obsolete

  • @chulung3190
    @chulung3190 19 วันที่ผ่านมา

    Hi, I am working on a company project. Can this help me extract the required data from a PDF?
    I receive a monthly PDF that includes all our company clients' monthly statements. I need to extract the 'Brought Forward' and 'Realized Loss/Profit Amount' from the PDF, which is nearly a thousand pages long. I will need to perform this process monthly.

    • @DebugVerseTutorials
      @DebugVerseTutorials  18 วันที่ผ่านมา

      I have worked on a similar task with both vision LLM and pdfminer so I would recommend those tools.