How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 พ.ค. 2024
  • 👨‍💻 Code: github.com/trancethehuman/ai-...
    (if there are issues with viewing the code, just fork and clone the repository. It's just a current problem with GitHub's way of displaying Jupyter notebooks - nbconvert)
    Tools mentioned:
    Jina AI: jina.ai/reader
    Mendable's Firecrawl: www.firecrawl.dev/
    Scrapegraph-ai: github.com/VinciGit00/Scrapeg...
    🚌 Sign up for my upcoming AI engineering course: tally.so/r/n9daQ1
    Follow me on Twitter: / haithehuman
    Find me on LinkedIn: / haiphunghiem
    (Consulting) If you want me to work with you: tally.so/r/n9djRQ

ความคิดเห็น • 109

  • @jarad4621
    @jarad4621 หลายเดือนก่อน +119

    Dammit stop telling everybody about Jina my secret weapon, just stop, it's my advantage, everybody ignore it it's horrible I swear

    • @devlearnllm
      @devlearnllm  หลายเดือนก่อน +19

      TOO LATE

    • @jarad4621
      @jarad4621 หลายเดือนก่อน

      @@devlearnllm 😉. This was one of the mostly highly valuable vids ive seen in past few weeks when considering the contents, the top 3 special scrapers i searched hard for mentioned all together in one good video, nice, add to good cheap open source llm like llama 3 and it = $$$ if you know how, data is valuable, things that were not possible or affordably viable for most previously are now, i can do stuff for $12 now that some would pay thousands for, its a wonderful new world!
      Just finished something awesome with python and Jina and openrouter Llama 3 in 2 days thats gonna double my revenue or more and i dont even know how to code lol, thanks gpt. Jina does have paid api key on the api page btw, 1m free, 580 pages or so it worked out to. but the pricing is so low its insane, 500m tokens or 280 000 pages for $10, destroys firecrawl pricing, which is also good and has its place but much more costly). i think scrapegraph uses llm to parse so its gonna be expensive on tokens right, sending raw website to llms? Ive asked them like you did.
      I only wish Jina showed menus and internal links and it would be perfect, those have valuable data itself and identifies more valuable pages for more visits like pricing, ill ask if there is a way but i guess can add something cheap to the workflow for that, any suggestions? Prob some python libary, ill ask perplexity lol. Im actually new to the tech side but i see the business value as a marketer so learning fast as i can! its the new gold rush.
      Great video, subbed looking forward to more. Cheers

    • @TheBrighamhall
      @TheBrighamhall หลายเดือนก่อน +13

      @@devlearnllm thought you said Jira and was so confused..

    • @devlearnllm
      @devlearnllm  29 วันที่ผ่านมา +1

      @@TheBrighamhall imagine lol

    • @ShinyTechThings
      @ShinyTechThings 29 วันที่ผ่านมา +1

      🤣🤣🤣🤣

  • @kylelau1329
    @kylelau1329 หลายเดือนก่อน +17

    Thank you for introducing all the latest technology for web scraping!

  • @matten_zero
    @matten_zero หลายเดือนก่อน +6

    The reader API tip is so clutch. Thank You!

  • @antoniuskonovalov
    @antoniuskonovalov หลายเดือนก่อน +4

    Just started wondering about web scraping and here you are.
    Thank you.

  • @alonsoalarconaguilar7113
    @alonsoalarconaguilar7113 27 วันที่ผ่านมา +17

    TH-cam algorithm is just insanely good at what it does, this exactly the content I needed and I think I have found what I want to dedicate my life to as a professional.
    Thank you for the video, I will buy your course as fast as I collect the money.

  • @kuhltime
    @kuhltime หลายเดือนก่อน +3

    Came at the perfect time. Very good video. Thx 😊

  • @ariG23498
    @ariG23498 14 วันที่ผ่านมา +3

    How did I not get your content sooner? Love it!

  • @forgotmyoldSN
    @forgotmyoldSN 14 วันที่ผ่านมา +2

    Thanks for adding a new project to my to do list!

  • @st.3m906
    @st.3m906 หลายเดือนก่อน +1

    Amazing video, thank you

  • @markt4565
    @markt4565 23 วันที่ผ่านมา +1

    keep up the good work! - this is an awesome presentation!

  • @moafro6524
    @moafro6524 17 วันที่ผ่านมา +1

    Underrated glad I found

  • @NikhilSwamiExperimental
    @NikhilSwamiExperimental 13 วันที่ผ่านมา +4

    chigga dropping bomb content, meranwhile i made a comment analyzer for highly detailed videos which have 100+ comments, and dint have time for going through all. man, sometimes you dont need to build an ironman suit to do simple shet.

    • @devlearnllm
      @devlearnllm  13 วันที่ผ่านมา

      Printing this comment out and putting on my wall

    • @aarushsaboo1194
      @aarushsaboo1194 6 วันที่ผ่านมา

      Bro, did you build a comment analyzer for all youtube videos in which all you need to do is post a youtube link? That's a nice project!

  • @shivam_in
    @shivam_in 28 วันที่ผ่านมา +7

    If I'm going to scrap millions of pages regularly, no way in hell AI would come anywhere close in accuracy and efficiency than a plane Http request or browser load and Jsoup parsing.

  • @nickk6575
    @nickk6575 หลายเดือนก่อน +1

    Greta video! The open source tool looks great!
    As an aside, I use instructor and pydantic classes to get the LLMs to provide the JSON as I expect it. In my limited experience, dspy wasn't as explicit as I wanted.

    • @devlearnllm
      @devlearnllm  หลายเดือนก่อน

      Good idea

    • @jarad4621
      @jarad4621 หลายเดือนก่อน +2

      Are you you using thos two libraries with agency swarm agentic framework, it uses those a well to ensure performance/quality, if not maybe something you might be interested in, a proper production-capable agentic framework. That with its automation and decision-making capability plus Jina + llms = profit for so many use cases

    • @catchychazz
      @catchychazz หลายเดือนก่อน

      Are you referring to DSPy assertions?

  • @terrytan1827
    @terrytan1827 17 วันที่ผ่านมา +1

    16:05 Worth trying out GPT-4, I find it more accurate at following instruction.

  • @sitedev
    @sitedev หลายเดือนก่อน +1

    Gold!

  • @uwepleban3784
    @uwepleban3784 หลายเดือนก่อน +6

    The transcript at 1:39 states that you are using large sandwich models. This must be a brand new type of model - mouth watering indeed. 😂

    • @devlearnllm
      @devlearnllm  หลายเดือนก่อน

      Heck yeah 🥪

  • @khemchay
    @khemchay 19 วันที่ผ่านมา +1

    Jina love it...

  • @user-xj5gz7ln3q
    @user-xj5gz7ln3q หลายเดือนก่อน +2

    GPT 4o can do this now. Just tested and it's awesome.

  • @BlueBearOne
    @BlueBearOne 27 วันที่ผ่านมา +2

    Thank you. I'll be "away" for a while while I conquer the...I mean save the world!

  • @RenkoGSL
    @RenkoGSL หลายเดือนก่อน +1

    lol that's awesome!

  • @devlearnllm
    @devlearnllm  หลายเดือนก่อน +2

    If anyone’s having issues viewing the notebook on GitHub, it’s GitHub’s fault. Feel free to clone it (the cod e is there, GH just couldn’t display it recently: stackoverflow.com/questions/78501731/error-nbformat-when-uploading-to-github-from-google-colab)

  • @augmentos
    @augmentos หลายเดือนก่อน

    Can anyone speak to the architecture or other tools to prevent detection using beautiful soup as he mentioned? What would be the best process to avoid detection and what tools I wish you elaborated there considering it’s the subject of video in large part.

  • @roberthuff3122
    @roberthuff3122 หลายเดือนก่อน +21

    🎯 Key Takeaways for quick navigation:
    00:00 *🚀 Introduction to web scraping for LLMs in 2024*
    - Overview of startups pivoting to web scraping.
    - Mention of Mendable and its "fire crawl" tool for scraping the web using large language models.
    02:06 *🔍 Scraping competitors' pricing pages*
    - The process of scraping competitors' pricing for market research.
    - Introduction to tools used for scraping: Jina AI, Mendable, and Scrapegraph-ai.
    03:01 *🧠 Understanding "Tik token" and its application*
    - Explanation of tokenization and encoding in web scraping.
    - Discussion on the cost implications based on tokenization.
    05:17 *🛠️ Setting up scrapers with Beautiful Soup and other tools*
    - Description of different scraping tools and their setup.
    - Comparisons among Beautiful Soup, Jina AI, and Mendable based on ease of use and output.
    07:32 *📊 Running scrapers and analyzing outputs*
    - Execution of web scraping and evaluation of the output from different tools.
    - Analysis of readability and format of the scraped data.
    09:37 *💰 Cost comparison and effectiveness of scraping tools*
    - Comparison of costs associated with various scraping tools.
    - Evaluation of which tool provides the most value for money.
    12:53 *🤖 Extracting pricing information using OpenAI*
    - Utilization of OpenAI for extracting specific data points.
    - Challenges and strategies in obtaining clean and useful information.
    17:20 *🌐 Overview of Scrapegraph for advanced web scraping*
    - Introduction to Scrapegraph as an open-source project.
    - Examples of complex data extraction and its accuracy.
    Made with HARPA AI

    • @thethree60five
      @thethree60five หลายเดือนก่อน +1

      ...The best in-browser AI automation system.

    • @thedoctor5478
      @thedoctor5478 หลายเดือนก่อน

      This Jina thing is cool. The beautifulsoup scraper is obviously not a solution. Most web pages (Especially articles, media, etc.) have google schema ld+json ready to be extracted though. There are some good python libs for getting the metadata. There are many scraping APIs, and most of them are not worth the cost IMO. phantomjscloud is probably one exception, depending on volume. Otherwise, one must find a good proxy provider and send a bunch of fancy http headers to bypass anti-bot, like you said. Blackhatworld is a great resource for proxies and all manner of other accounts. The whole scraping thing is a giant rabbit-hole. Jina is for sure keeping all that data. It's not a bad plan, actually. I think I may do the same.

  • @planplay5921
    @planplay5921 หลายเดือนก่อน +23

    But the first problem that all crawls need to face is how to avoid being blocked.

    • @PracticalAI_
      @PracticalAI_ 19 วันที่ผ่านมา +1

      there are ways, maybe I will do a video about that ... but that is a dark art :)

    • @planplay5921
      @planplay5921 19 วันที่ผ่านมา

      @@PracticalAI_ I'm really looking forward to it!😊

    • @Van-Helssen
      @Van-Helssen 6 วันที่ผ่านมา

      Rotation of proxies and query randomly dude, easy task

    • @PracticalAI_
      @PracticalAI_ 6 วันที่ผ่านมา

      ​@@Van-Helssen lol it's not 2014, proxies are recognised by most providers, and they will immediately invalidate the user (if you are scraping as login). There are other ways, using regular ips

    • @Van-Helssen
      @Van-Helssen 6 วันที่ผ่านมา

      @@PracticalAI_ *residential proxies as you would probably know….

  • @marthasamuel
    @marthasamuel 26 วันที่ผ่านมา +1

    Would these work for a dynamic website

  • @stanTrX
    @stanTrX หลายเดือนก่อน

    What are the good and easy to use tools with langchain? Llm is not very useful without such tools, even it has no idea about the date today.

  • @jetlime08
    @jetlime08 20 วันที่ผ่านมา +6

    Is the LLM community really not aware of 40 year old Natural Language Pre-processing methods developed for data mining and NLP?

    • @erickcampos50
      @erickcampos50 16 วันที่ผ่านมา +1

      Could you explain it better? I can't see how to connect what you said with this subject

    • @josefaguilar2955
      @josefaguilar2955 8 วันที่ผ่านมา

      I don't know if the community is aware that this has been a problem to solve for quite some time.

  • @TranKiet-pj9mw
    @TranKiet-pj9mw 19 วันที่ผ่านมา +2

    youtube really know what i am looking :V with python craw a website with LLM is simple just a few line of code . back to 8 year ago i used python tool do a same thing with higher effort . right now , i m trying to mixed data from website/ database with knowledge map for observation view then i could find the short path according its , that will taking less time to read entire book in this field , just focus in some topic but still get the result . nah but you introduced the method with LLM . thanks

    • @devlearnllm
      @devlearnllm  12 วันที่ผ่านมา

      Awesome. Thanks for sharing

  • @supriyosarkar1806
    @supriyosarkar1806 19 วันที่ผ่านมา +4

    I feel really sad. that you publicly talked about Jina. I used to feel special knowing very few people are aware of it lol

  • @danielcave9606
    @danielcave9606 13 วันที่ผ่านมา

    How well does Jina do with bigger sites with anti-bot protection?

  • @stevefox7469
    @stevefox7469 หลายเดือนก่อน +4

    How do these tools cope with CloudFlare operating on the target site, which attempts to block scrapping?

    • @svenvanwier7196
      @svenvanwier7196 27 วันที่ผ่านมา

      cant stop the bots i know about seleniumbase for python..... takes some research but... hey

  • @thingX1x
    @thingX1x 16 วันที่ผ่านมา +1

    Using jina now hehe. Does anyone know if you can get better results from amazon?

  • @AtharvDharmadhikari-vc9fk
    @AtharvDharmadhikari-vc9fk หลายเดือนก่อน +3

    I used scrapegraph ai and was also stuck to get cost, but then I just took the cost my making some changes inside the scrapegraphai library as internally the library is using langchain and langsmith so it was calculating the cost.

    • @devlearnllm
      @devlearnllm  29 วันที่ผ่านมา

      That's awesome. How do you get it to work with LangSmith?

  • @artmadiar
    @artmadiar หลายเดือนก่อน +1

    Great presentation! I'm surprised about jin ai free scraper that doesn't require an API?!! I guess it might be shut down soon for public access

    • @jarad4621
      @jarad4621 หลายเดือนก่อน +1

      There is a paid version thats worth it, check the api page, key at bottom out generates a unique one somehow, you get 1m free then $10 for 500m tokens which is like 280k pages which is insanely low and basically free anyways, crazy valuable tool

    • @artmadiar
      @artmadiar หลายเดือนก่อน

      @@jarad4621 oh wow! it's amazing! thanks for clarification

  • @nzt29
    @nzt29 13 วันที่ผ่านมา

    Haven’t watched it fully yet, but I’m really curious to see how it handles the looming threat of model collapse.
    edit: Yeah it didn’t talk about it. It’s going to be hellish when the internet becomes increasingly flooded with LLM output

  • @prashantbhardwaj6322
    @prashantbhardwaj6322 25 วันที่ผ่านมา +2

    Can you please fix the camera please already feeling dizzy within 60 seconds due to constant camera movement!

    • @devlearnllm
      @devlearnllm  25 วันที่ผ่านมา +1

      Working on it. Just need to find the setting in DJI Pocket 3 to slow down the tracking speed

  • @GeoffY2020
    @GeoffY2020 หลายเดือนก่อน

    i tried to read or download the Web_scraping_for_LLM_in_2024.ipynb but its not readable, can you replace it ?

    • @GeoffY2020
      @GeoffY2020 หลายเดือนก่อน +1

      ok i can read it in colab

  • @PedroIvo-iz5sv
    @PedroIvo-iz5sv 26 วันที่ผ่านมา

    it works in portuguese?

  • @PineState77
    @PineState77 24 วันที่ผ่านมา

    What’s the best way to get in touch?

    • @devlearnllm
      @devlearnllm  24 วันที่ผ่านมา

      Details in the video’s description

  • @denisblack9897
    @denisblack9897 หลายเดือนก่อน +2

    Damn, bro get ready for heavy lifting) baldness is coming
    Been there, you’ll look much much better!

    • @devlearnllm
      @devlearnllm  หลายเดือนก่อน

      Lmao thanks brother

  • @bastabey2652
    @bastabey2652 15 วันที่ผ่านมา

    these scrapping tools are impressive... but they are not ready for scrapping full website with 100s of webpages.. unfortunately, there is still significant a room for manual scraping..

  • @jarg7
    @jarg7 หลายเดือนก่อน

    broken link to github

    • @devlearnllm
      @devlearnllm  หลายเดือนก่อน

      Yeah there’s something weird with GitHub not displaying the notebook right. The link is the same.

  • @MMABeijing
    @MMABeijing 29 วันที่ผ่านมา +1

    That s basic stuff, I feel like it s 2023, and I was late to the party too

  • @flor.7797
    @flor.7797 หลายเดือนก่อน

    none of these seem better than Trafilatura?

    • @flor.7797
      @flor.7797 หลายเดือนก่อน +1

      scrapegraph looks cool though

    • @devlearnllm
      @devlearnllm  หลายเดือนก่อน

      @@flor.7797 How's your experience using Trafilatura? I haven't tried that yet

    • @flor.7797
      @flor.7797 หลายเดือนก่อน

      @@devlearnllm I’m more into main content extraction and boilerplate removal. There isn’t one size fits all unfortunately

  • @PaulFidika
    @PaulFidika 6 วันที่ผ่านมา +1

    "The entire internet hates him for this one simple trick"

    • @devlearnllm
      @devlearnllm  6 วันที่ผ่านมา +1

      9/10 prompt engineers recommend this

  • @eyoo369
    @eyoo369 6 วันที่ผ่านมา +1

    Jina is almost perfect.. too bad it's not smart enough to scrape content from "accordions" where you first click to make the content visible. I feel a smart AI scraper should be able to grab that text and determine based on CSS class that it's probably valuable text.. just hidden at the time

    • @devlearnllm
      @devlearnllm  5 วันที่ผ่านมา

      That's too bad. What's the alternative?

  • @rwz
    @rwz 9 วันที่ผ่านมา +2

    Please do not move the camera all the time

    • @haganlife
      @haganlife 5 วันที่ผ่านมา

      Definitely loosen up the tracking to center. OSBTail?

    • @devlearnllm
      @devlearnllm  3 วันที่ผ่านมา

      It's actually built-into the DJI Pocket 3 camera. I just had it for a few weeks. Just need to find the settings for it.

    • @tleee99
      @tleee99 8 ชั่วโมงที่ผ่านมา

      @@devlearnllm change the follow speed to slow instead of fast.

  • @ryana2952
    @ryana2952 หลายเดือนก่อน +6

    Fix your camera thats annoying AF

    • @devlearnllm
      @devlearnllm  หลายเดือนก่อน +2

      Sounds like you don’t like the swiveling on it

  • @kungfooman
    @kungfooman 19 วันที่ผ่านมา +1

    "how to block these fuckin idiots AWS servers to protect your website" next

  • @kevinlukejr.8996
    @kevinlukejr.8996 14 วันที่ผ่านมา

    Fire crawl is to to expensive

  • @mrRambleGamble
    @mrRambleGamble 7 วันที่ผ่านมา

    The camera moves too much

    • @devlearnllm
      @devlearnllm  7 วันที่ผ่านมา

      its the worst

    • @mrRambleGamble
      @mrRambleGamble 7 วันที่ผ่านมา

      @@devlearnllm Aside from that, great video.

  • @JohnMcclaned
    @JohnMcclaned 9 วันที่ผ่านมา

    such an inefficient and unreliable way to scrape the web

  • @pcebro
    @pcebro 18 วันที่ผ่านมา

    You should definitely wear pants.

  • @chetanesque158
    @chetanesque158 17 วันที่ผ่านมา

    intersting! although I was distracted by your attire... Seriously I was not born 30 years ago man, but can we dress a bit better for a presentation?!

    • @devlearnllm
      @devlearnllm  16 วันที่ผ่านมา +2

      Lol what's wrong with my wardrobe