Extracting Structured Data From PDFs | Full Python AI project for beginners (ft Docker)

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 พ.ย. 2024

ความคิดเห็น • 84

  • @awesomeowwww
    @awesomeowwww 2 หลายเดือนก่อน +38

    I started my Data Science journey two years ago and now I'm building projects like AI Assistants or tkinter Desktop apps with ollama integration which is able to summarize the content of different files (pdf, docx, images) and you are a big part of my development since your passion and love for this swapped over to me :))

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน +6

      Oh congratulations on your projects! So glad you found inspiration from my vids 🤗☺️

  • @rayzorr
    @rayzorr หลายเดือนก่อน +12

    Wow, that was probably the best tutorial I have watched ... and I have watched a lot! Perfectly pitched and well thought out and delivered. Congrats on a great job!

    • @Thuvu5
      @Thuvu5  หลายเดือนก่อน

      Aw you’re so kind! I’m so glad to hear that 🙌

  • @ThirumalaRaoJuvvisetti
    @ThirumalaRaoJuvvisetti ชั่วโมงที่ผ่านมา

    You are speaking with clarity and confidence. Thank you

  • @luisalbertocodes
    @luisalbertocodes 12 วันที่ผ่านมา

    Just started my data science studies at university and this is awesome, I see my initial linear algebra classes paying off

  • @oksanastrelnikova6970
    @oksanastrelnikova6970 หลายเดือนก่อน +2

    Absolutely amazing content. I an only a beginner, I do not think I will be able to do it by myself (too frighten) but I could understand every single step you were doing!!! (also considering that English is not my first language). Thank you a lot!!! For all you work. Your tutorials are super professional and extremely useful!!!

    • @Thuvu5
      @Thuvu5  วันที่ผ่านมา

      Aw you're too kind! I'm really happy to hear it was helpful!

  • @aireescreates
    @aireescreates 2 หลายเดือนก่อน +5

    Thanks for this Thu Vu. I have followed you from your first video. I was just starting in my DS journey. Your videos helped me a lot in my journey. I kinda missed you and I'm just glad that you posted again. This is super helpful and you explained all the concepts very clearly. I am currently building a web app extracting sales data from PDF files and using LLM to generate insights, analysis and recommendations and data viz. You explanation on Docker is a treasure as I'm building my app! Thank you so much!

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน +1

      I'm so glad to hear! 🙌

  • @marktahu2932
    @marktahu2932 หลายเดือนก่อน +1

    Thank you Thu Vu, for a very straight forward step by step guide to creating a RAG project, I have needed something like this for a while to understand how to implement this. Many thanks!!

  • @cerealport2726
    @cerealport2726 2 หลายเดือนก่อน +1

    This is super interesting. Just like your other projects, you make it easy to see how the general process could be adapted for other purposes. Thanks very much!

  • @sk3ffingtonai
    @sk3ffingtonai หลายเดือนก่อน +1

    Thank you so much for creating this comprehensive tutorial. I have been and am working hard on my AI Certification and this content is gold.

  • @RohithS-ig4hl
    @RohithS-ig4hl 8 วันที่ผ่านมา

    Thank you so much for this! You explained it really really well. Kindly Kindly post many videos such like this one/other topics.

  • @ZakinAbdul
    @ZakinAbdul หลายเดือนก่อน

    Thank you for the video, Thu vu. I recently completed a project using LLMs to interact with PDF data as a chatbot. Your code has been invaluable in helping me handle errors with ChromaDB and create a well-structured project directory. I was curious about potential improvements or alternative approaches that could enhance my project. Convert unstructured PDF data into a structured format with the use of LLMs. This was a new concept for me, as my project focused solely on chatbot interactions with the data. And your approach has opened my eyes to new possibilities and I'm eager to explore similar techniques in my future work.

  • @jemiranhunter
    @jemiranhunter 2 หลายเดือนก่อน +4

    Great content. Very informative. Thanks for sharing.

  • @kenchang3456
    @kenchang3456 2 หลายเดือนก่อน +2

    Excellent tutorial. Thank you very much.

  • @cybetica
    @cybetica 2 หลายเดือนก่อน +4

    You might want to renew your API key, as you showed it in plain text in 10:09 secs and scrolled. Nice vid!

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน +5

      Oh thanks, good eyes! 😄 Yep I've revoked the key :)

  • @whatsbetter8457
    @whatsbetter8457 2 หลายเดือนก่อน +5

    Instead of only be able to use OpenAI you could use the “instructor” or “ollama-instructor” library in Python to get structured and validated outputs from a LLM (Ollama, OpenAI, Gemini, Groq, etc.). Was already there before OpenAI came up with its feature :-)

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน

      Thanks for sharing this! Yeah indeed, instructor seems to be more flexible if we want to try different LLMs in the same project

  • @SumithRajagopalan
    @SumithRajagopalan หลายเดือนก่อน +1

    Amazing explanation and video 👍

  • @georgejetson9801
    @georgejetson9801 2 หลายเดือนก่อน +5

    this would have been amazing for my phd studies

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน +5

      Maybe for a second PhD? 🤣

  • @rickrandall3174
    @rickrandall3174 2 หลายเดือนก่อน +1

    Thu Vu, you are wonderful. 🙂

  • @perpl1618
    @perpl1618 หลายเดือนก่อน

    This was an amazing video , Thank you Thu San , Would you consider making an advanced users video , with all of the small details and edge options ?

  • @CapybaraLifeStyle
    @CapybaraLifeStyle หลายเดือนก่อน

    Absolutely fantastic! ❤

  • @nguyenhai.truongan
    @nguyenhai.truongan หลายเดือนก่อน +1

    Hi Thu. Tôi đã theo dõi bạn cách đây vài năm trước, video của bạn làm rất hay. Thời gian gần đây tôi thấy bạn có đăng những video phân tích dữ liệu sử dụng AI. Là một nhà phát triển ứng dụng AI, tôi muốn tìm hiểu các quy trình, nhiệm vụ và nhu cầu của một nhà phân tích dữ liệu là như thế nào để có thể tạo ra một ứng dụng hoàn chỉnh cho ngành phân tích dữ liệu này. Hy vọng bạn sẽ có vài gợi ý cho tôi. Cám ơn Thu.

  • @MrGbruges
    @MrGbruges 2 หลายเดือนก่อน +1

    THANX THU VU, VERY INTERESTING!!!!

  • @robertbutscher6824
    @robertbutscher6824 หลายเดือนก่อน +1

    great video, thank you so much for that valuable inspirations

  • @eulerthegreatestofall147
    @eulerthegreatestofall147 หลายเดือนก่อน

    Great Video as always!!!, quick question, how did you create the requirements.txt file??

  • @ravikumarsingh9766
    @ravikumarsingh9766 29 วันที่ผ่านมา

    Very nicely explained ... Really love the content . Way to go !!!. I wanted to ask if I have multiple PDF files , How can create Embedding for all the PDF files, like 10 PDF files . And then want to run rest of the query ? Whenever you have time , please do suggest . would wait for your reply !!!

  • @gviacava
    @gviacava หลายเดือนก่อน

    What a great tutorial!!! Thank you!

  • @istifanusbulus1214
    @istifanusbulus1214 19 วันที่ผ่านมา

    Wow, one of the best tutorials, I want learn how to extract info on sales invoices and vendor invoices and convert them in datagram to match it the general ledger. Please can do a video about it. Thank in advance.

  • @ahmadzaimhilmi
    @ahmadzaimhilmi 2 หลายเดือนก่อน

    I prefer to use Cohere's command-r instead of OpenAI for RAG tasks. The api response can pinpoint the exact sentences from where the information is retrieved given the chunks that we feed in. Good for retrieving answers with citations.

  • @Aaron-it5il
    @Aaron-it5il หลายเดือนก่อน +1

    Thanks for sharing!

  • @quangvu20780
    @quangvu20780 หลายเดือนก่อน

    Tuyệt vời, video hay đấy em..

  • @readas1
    @readas1 หลายเดือนก่อน

    Hello, I found your video very informative since I have a similar project I am working on. Question for you: What would you do if the program was not returning good chunks? By that I mean, I uploaded a 90 page pricing document, and asked for the title of the document, and none of the chunks included the first page of the document, so the LLM could not correctly answer the question.

  • @VinhNguyen-zg7lu
    @VinhNguyen-zg7lu หลายเดือนก่อน

    Hay quá chị ơi ❤❤❤

  • @dannyrene
    @dannyrene 22 วันที่ผ่านมา

    Ngl you’re one smart cookie

    • @dannyrene
      @dannyrene 22 วันที่ผ่านมา

      I’m not finished watching but doesn’t each embedding vector need to have the same number of dimensions to perform a calculation of their Euclidean distance? Which would imply that all vectors have the same number of dimensions, right? If that’s the case, what is the limiting variable on the number of dimensions? Processing power? Wouldn’t more dimensions as give you a smarter model?

  • @ruanvieira9082
    @ruanvieira9082 13 ชั่วโมงที่ผ่านมา

    thank you friend

  • @MichealAngeloArts
    @MichealAngeloArts 2 หลายเดือนก่อน

    Thanks for the awesome project. What is the amount of code change required if I'll be using a Gemini LLM via Vertex AI on GCP instead of GPT4 / OpenAI (in particular, the LangChain-related code) to replicate this project?

  • @agape13
    @agape13 หลายเดือนก่อน

    With that said, there are going to be a big layoffs waves.
    One can already experience translators positions being significantly reduced.
    The need for analysts will change in the future as well.

  • @freedman1405
    @freedman1405 2 หลายเดือนก่อน +1

    Hi Thu Vu, what's your take on privacy issues with ChatGPT? Wouldn't companies risk their confidential data if they implement this system and use their APIs?

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน

      Good question! In my experience companies typically use an enterprise subscription to a cloud service like Microsoft Azure that integrates access to these LLMs. Here’s an example learn.microsoft.com/en-us/azure/ai-services/openai/

  • @petersheldrick1851
    @petersheldrick1851 หลายเดือนก่อน

    great content, so well explained. I am doing an AI course at the moment, I am stuck on solving my project task,see if you can guide me! The requirement is to use AI or even deep learning to predict a person's shoe size based on a photo of the sole of their foot, without shoes and socks. Not allowed to use other items in the photo as a reference point, for example a centimetre ruler or something of a known size. Have to use learning from known images and their respective shoe size. I am struggling where to start!

  • @kamilherbik
    @kamilherbik หลายเดือนก่อน

    Thanks

  • @sayfasayfa3500
    @sayfasayfa3500 หลายเดือนก่อน

    Pls can u tell which ide u use iam complete biginner and i wanna do this for main project

  • @SkySesshomaru
    @SkySesshomaru 2 หลายเดือนก่อน +1

    incredible

  • @nnamdiodozi7713
    @nnamdiodozi7713 หลายเดือนก่อน

    Did you use a Linux environment for this video? I’m asking cos I keep seeing bin in the file paths.

  • @GoogleUser-tk3mb
    @GoogleUser-tk3mb หลายเดือนก่อน

    You're really taking my interest in data to the next level! It popped up in my TH-cam recommendations, and this is truly a hidden gem. Keep it up, sis.
    +1 Subscribe! I'm sure this channel will blow up soon 🎉
    Anyway, I was wondering how you do that code thing in VSCode without having to type everything? It's amazing!.
    And now I'm totally lost!
    FullStack? FrontEnd? BackEnd? Data Analytics?...
    FOMO is killing me! 🔥😭
    But, the worldwide jobs market is stable for data roles, right? 🤔

  • @trungvan2154
    @trungvan2154 หลายเดือนก่อน

    Does this code scenario work well for the other language such as Vietnamese , with a lang parameter vn for example? Thanks

  • @junaidamin
    @junaidamin หลายเดือนก่อน

    For getting structured data in our answer, we can also use metadata ?

  • @iantotan4229
    @iantotan4229 2 หลายเดือนก่อน +2

    New video!Finally!

  • @ahmadzaimhilmi
    @ahmadzaimhilmi หลายเดือนก่อน

    I have a question about the structured output. I've been trying to find a workaround for dynamic attributes. The ones that you showed as example are hardcoded. I want to pass in a dictionary of field name and its explanations and get a resulting dictionary back in return. So far I couldn't think of a way.

  • @coldbelowfroze
    @coldbelowfroze 2 หลายเดือนก่อน +2

    I missed you so much!

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน +2

      Aww, thank you 🥹

  • @_Around_The_Globe_
    @_Around_The_Globe_ หลายเดือนก่อน

    i get a NotImplementedError when using the with_structured_output, using gpt-4o-mini, can someone help plz?

  • @joseduarte1240
    @joseduarte1240 หลายเดือนก่อน

    we can create an local envirement that can read all the files in one folder, even if its excel,pdfs everything?

  • @hongmeixie409
    @hongmeixie409 หลายเดือนก่อน

    can you show what it looks like in the docker?

  • @CyberHorrorHunter
    @CyberHorrorHunter 2 หลายเดือนก่อน +1

    I am new to this journey but how did you get your VS code to output so many lines, I have been tinkering with notebook settings and cant seem to get it to output the larger amount of data without go far out the right of the screen.

    • @Thuvu5
      @Thuvu5  2 หลายเดือนก่อน

      Good question, I believe it’s a setting for notebook. Check it stackoverflow.com/questions/67855498/how-to-display-all-output-in-jupyter-notebook-within-visual-studio-code

  • @datagus
    @datagus 2 หลายเดือนก่อน

    Is the extracting outcome from the PDF good? Often the extraction process produces text that is all messed up, which can have negative consequences in the chunking process.

    • @heritage1834
      @heritage1834 2 หลายเดือนก่อน +1

      I believe it depends on the formatting of the pdf files and also method the extraction is carried. A project article I read suggested that using image to text (OCR) usually produces better results than parsing pdf documents, especially when the pdf is badly formatted

  • @rodeondurotan6142
    @rodeondurotan6142 2 หลายเดือนก่อน

    I hope you can make a video on unstructured pdf data.

  • @Rationalview4915
    @Rationalview4915 2 หลายเดือนก่อน +1

    It was really helpful
    Thank you for this video❤

  • @dushimiyimanathaulin7930
    @dushimiyimanathaulin7930 2 หลายเดือนก่อน +2

    Very informative

  • @CyberHorrorHunter
    @CyberHorrorHunter หลายเดือนก่อน

    Additionally, I have found this does not output tables correctly (any idea how to remedy that?). Also, this seems to be affected by real text vs PNG, jpg images of the original pdf text that was then embedded in a pdf.

  • @nyanlynn-450
    @nyanlynn-450 2 หลายเดือนก่อน

    Cool👍💯

  • @s3m3sta
    @s3m3sta 2 หลายเดือนก่อน +1

    thanks a bunch Thu Vu

  • @jeffkidder5282
    @jeffkidder5282 11 วันที่ผ่านมา

    Anything that even looks/feels too good to be true usually is. All this wonderful advancement screams of disaster just waiting to happen.

  • @RipulKumar-g2d
    @RipulKumar-g2d หลายเดือนก่อน

    Hi not sure of you revert or not , i tried to follow your video but i am stuck at 22:22 sec and not able to move further. getting error when i execute the same code

  • @FauziFayyad
    @FauziFayyad 2 หลายเดือนก่อน +1

    Yay Thu vuu !

  • @supertab365
    @supertab365 หลายเดือนก่อน

    Damn that's beginner level? I am f--d

  • @readas1
    @readas1 28 วันที่ผ่านมา

    Have you refined this project at all? I built your version with 0 edits, and it gets everything wrong every time I test a paper of anything length. The program works, but it does not actually interpret the documents well at all.. Of the 10 or so I have tested I dont think it has gotten a title correct once, and it usually gets 0/4 correct.

  • @ANAND02120
    @ANAND02120 6 วันที่ผ่านมา

    Hey I tried to play with your code. But I can see the model is Hallucinating. The answer it's giving is not correct. How can we fix it?
    `paper_title paper_summary publication_year paper_authors
    answer Title of the Research Paper This research paper discusses the impact of cl... 2023 John Doe, Jane Smith
    source The title is clearly stated at the beginning o... The summary of the paper outlines the key find... The publication date is mentioned in the heade... The authors' names are listed right below the ...
    reasoning The title of a research paper is typically fou... A summary usually encapsulates the main points... Publication years are typically indicated in t... Authors are usually prominently displayed alon...`

  • @d.d.z.
    @d.d.z. 2 หลายเดือนก่อน

    Very complete

  • @DrB934
    @DrB934 2 หลายเดือนก่อน

    You may have just killed QSR NVivo...

  • @hoangsang2471
    @hoangsang2471 2 หลายเดือนก่อน

    Are you vietnamese, your name seem like vietnamese nam

  • @sifhatshams-s1j
    @sifhatshams-s1j หลายเดือนก่อน

    If you ware my sister i dont have to warry about any problem :))
    Why you did not born as my sister :((