This Open Source Scraper CHANGES the Game!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 พ.ย. 2024

ความคิดเห็น • 290

  • @redamarzouk
    @redamarzouk  2 หลายเดือนก่อน +43

    Hey Everyone,
    LInk to code: www.automation-campus.com/downloads/scrapemaster
    My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on github and none of them got suspended.
    I opened a ticket and I'm waiting for their answer.
    in the meantime I shared the code on my website with all the steps to reproduce the ai scraper.

    • @ShaunPrince
      @ShaunPrince 2 หลายเดือนก่อน +1

      Let me know if I can help with this. I can setup a Gittea on AWS or something.

    • @Kevinsmithns
      @Kevinsmithns 2 หลายเดือนก่อน +2

      Yeah I was just looking and about to comment

    • @alex_osti
      @alex_osti 2 หลายเดือนก่อน +2

      I was about to give it a shot.. Waiting for the update. Great work btw

    • @rperellor
      @rperellor 2 หลายเดือนก่อน +1

      I had the opportunity to view it, but did not clone it

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +8

      @@rperellor here is the code www.automation-campus.com/downloads/scrapemaster

  • @RoughSubset
    @RoughSubset 2 หลายเดือนก่อน +148

    So I worked at a company once where the data guy built his own web scrapper to scrape data off of our competitors website for pricing etc. One thing that they did to protect their website from scrapping was user-agent filtering, in order for him to overcome this limitation was to have a very long list of different user-agents and rotate them while scrapping the website. I think that will be a good addition to add into your app. A small but useful change.

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +17

      Yes if we launch the scraper with the same user agent for the same websites so many times they will pick up on it and block us.
      the modification will have a list of OS credentials with their versions and different browsers and their versions.

    • @markomarjanovic8348
      @markomarjanovic8348 2 หลายเดือนก่อน +14

      @@redamarzouk Would it be possible to have a video about proxy rotation implementation? There is not much of it on YT but i think its crucially important.

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +15

      @@markomarjanovic8348 Added to the backlog

    • @amortalbeing
      @amortalbeing 2 หลายเดือนก่อน +1

      this is a good suggestion, would like this to be added as well.

    • @internetperson2
      @internetperson2 2 หลายเดือนก่อน

      Thirded

  • @thisisfabiop
    @thisisfabiop หลายเดือนก่อน +18

    Amazing work! It works great, but it doesn't handle cases where the database is divided into pages instead of using infinite scroll. It would be fantastic if it could also navigate through the pages until there are no more left.
    Another great feature-although it might make the tool more expensive, so it could be offered as an optional, selectable feature in the UI-would be for the scraper to open each item's page and scrape data from there. As you know, the initial page often only displays limited information about the product.

  • @SergeyNumerov
    @SergeyNumerov หลายเดือนก่อน +27

    Pretty cool.
    Let me point out, though, that the main complexity with scraping is that often times the relevant content is hidden: that is, getting to it may require clicking various UX elements.
    So to _really_ crack Scraping with AI, we'll need to go agentic: the solution will need to figure out what to click in order to reveal information of interest.

    • @SpragginsDesigns
      @SpragginsDesigns หลายเดือนก่อน +3

      Exactly. Anyone interested in helping me make something like this? Or is there something available already?

    • @pyros4333
      @pyros4333 หลายเดือนก่อน +1

      ​@@SpragginsDesignsyou could just hire someone to build it for you easily

  • @moiguess3256
    @moiguess3256 หลายเดือนก่อน +1

    You earned a new subscriber. Algerian brother here.

  • @justjosh1400
    @justjosh1400 2 หลายเดือนก่อน +5

    Definitely going to use this, I think this is awesome. As a suggestion for future options it would be great to have pagination support and levels deep. Has a lot of my scraping his location-based, for instance States-cities-locations. And the data I usually want is within the locations which may only be a few.

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +3

      Thank you.
      Yes Pagination will make this complete.
      But I’m thinking how can I make it universal, cause it has to work on every website, so would I just add another llm call to detect any url pagination pattern or do you have a better idea on how to do it ?

    • @justjosh1400
      @justjosh1400 2 หลายเดือนก่อน +1

      @@redamarzouk that might actually work using a lower model would be capable of determining if the page has pagination. Or have a checkbox for user to manually say it has pagination so the LLM will be looking for it. That way it's not always looking for it. And when it finds it return what kind of class it is. IDK

    • @wdonno
      @wdonno 2 หลายเดือนก่อน

      @@redamarzouksimilar scenarios may be an interim pathway: if the initial url prompts for a selection of (text input) that determines next page, can you add the ability to make that selection, ideally from a list of items of prior interest? The recursive ability to select specific buttons to push according to options on following pages would then solve a large number of use cases (ie an ability to map different actions according to a preselected known option types)? The base use case is to download files from a selection post which varies by initial (or ideally subsequent) text inputs, terminated by pressing a button to download a file or selected files). The approach can then be expanded to add more scenarios, until it is universal!

    • @justjosh1400
      @justjosh1400 2 หลายเดือนก่อน +1

      Thinking about and just thought maybe have an area to manually put in div container that the user can grab from the inspect tool.
      Or..
      Since we're using a LLM you could always prompt for it and return the value of the container. Such as look to see if this page has pagination at the bottom or top if so return a value perhaps and use that value to fill in

  • @danielcave9606
    @danielcave9606 29 วันที่ผ่านมา +2

    Most of the "traditional" Enteprise grade scraping tech companies are adopting LLMs into their stack as an option for when it makes sense. When you're scraping millions/billions of pages every 100th of a cent matters, so taking a composite AI approach, using ML models to get the majority of the standard data points for a general schema cheaply, and then allowing LLMs to the thing they do best at extracting data from unstructured text to extend that schema, that way you get eh cost efficiency with the flexibility of LLMs when needed.
    The real benefit of the LLM approach for bigger teams/projects is actually that is abstracts away from hard coding selectors into your spiders, so they are far more robust and unlikely to break in 3 months when the website changes its HTML, reducing your maintenance burden/debt. Thats my 10 cents anyway.
    I personally love what your project does for the everyday person though, getting small/medium crawls done where price per request isn't so important, and where you will have time/space for more rigorous custom QA. I especially love it for content generation purposes, data journalism, chart porn and the like. Great work!

    • @redamarzouk
      @redamarzouk  29 วันที่ผ่านมา

      Yeah I thought I was creating a scraper at scale, but once started using it extensively I see it more as a productivity tool to help get the data quickly without the need for copy paste.
      Traditional scrapers will still have a place in the market simply because once you want to scrape hundreds of thousands or millions of pages, the cost of paying coders for custom scripts and maintenance will make sense compared to the value of the data scraped.

  • @minissoft
    @minissoft 2 หลายเดือนก่อน +7

    Hello Reda, you should use Polars instead of Pandas, in a lot of cases is much faster than Pandas
    Also add_argument("--disable-search-engine-choice-screen") is useful + ("--headless") maybe?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +1

      Oh I was looking for that argument "-disable-search-engine-choice-screen" that pop up is annoying ( even if it doesn't affect the scraping). I will be adding that, thank you!!

  • @dimadem
    @dimadem หลายเดือนก่อน +1

    so good idea and explanation, thank you

  • @SamirDamle
    @SamirDamle 2 หลายเดือนก่อน +9

    Thanks for the simple tutorial and code.
    Can you add an example of using this scraper with local Ollama and Llama 3.1 instead of OpenAI to make it totally free?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +5

      You’re welcome.
      I can add it but I won’t be able to test it.
      My small gpu can’t really handle it especially when I’m filming.

    • @HyperUpscale
      @HyperUpscale 2 หลายเดือนก่อน

      @@redamarzouk YES, PLEASE 🙏!!!

    • @GundamExia88
      @GundamExia88 2 หลายเดือนก่อน

      @@redamarzouk I hope this get added. I prefer to run Ollama locally. I'm only using a GTX 1070, it works fine.

    • @idrinkmusic
      @idrinkmusic 2 หลายเดือนก่อน +1

      @@redamarzouk this would be a game-changing update. You earned a sub for this video regardless.

    • @carvierdotdev
      @carvierdotdev หลายเดือนก่อน

      ​@@GundamExia88 could you please tell me what models you run? I have the GTX 1080 Ti 11GB, thanks to a friend, and I want to play with that but I don't even know it's possible 😂😅

  • @rgsiiiya
    @rgsiiiya หลายเดือนก่อน

    This, and the V2 with Llama, are very interesting concepts, and I believe could be tremendously valuable.
    The shortcome is that it is very limited to just the single page at the URL location.
    To be truly valuable, it needs to also be a scraper (as you mention).
    Think of the use case to scrape ecommerce sites for product details. any "real' ecommerce site is going to have many many categories and pages of categorized product listings.
    While you can set up traditional scrapers and manually configure the navigation, this should be where AI should really shine. It should be able to figure out the navigation and automatically navigate/scrape the site.

  • @ginocote
    @ginocote 2 หลายเดือนก่อน +4

    One of my idea is to create or use a AI scraper to get the first scrape test. If it work you do output somethine like a json that will get the id or class of the scraper element, tant you give this json to your conventional no AI scraper to scrape the website for free and faster without the need of AI afterware.

    • @lovol2
      @lovol2 หลายเดือนก่อน

      This is just writing code. Just copy paste the html into chatgpt and say write the code to parse into JSON.. works really well.

  • @shawnsmith9198
    @shawnsmith9198 2 หลายเดือนก่อน +4

    you are genius! I am on a mac, so I just had to change the driver call, but everything else is working well. pagination or series of urls would be cool. i love how you have it load in the chrome browser. this really changes how i think about cross platform apps. i wonder if we can scrape instagram now. or what about downloading images? maybe a simple copy table button, since I just copy and paste into google docs.

    • @jimbob3823
      @jimbob3823 หลายเดือนก่อน

      New to macos can you please share your driver path? Not 100% which is the executable. Ty!

    • @thecashlessgamer480
      @thecashlessgamer480 หลายเดือนก่อน

      Yes please can you help me set it up on my mac as well?

    • @wavelyveney9021
      @wavelyveney9021 หลายเดือนก่อน

      I need assistance in setting up on a mac

  • @mzahran001
    @mzahran001 2 หลายเดือนก่อน +3

    Thanks for the great video. Idea for nest videos: Could you extend the code with crawling, for example, getting results from search engines or following a specific path to get more structured data?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      You're welcome, can you elaborate more on how it should look like ?
      Because this will be awesome and I actually gave it some thought, but it's hard to get the exact link of multiple pages from which you want to extract data if you don't have the link for the first page.
      you think we can trust a search engine to give us the exact links we want to scrape data from?

  • @ScottLahteine
    @ScottLahteine หลายเดือนก่อน +3

    The use case I have for a script like this one is to scrape my own open source project code history to convert several versions of config files that contain lots of good documentation into YAML that can be deployed to a Jekyll website. So all the same principles apply, especially the need to output consistent structured data. I look forward to learning more about the development of this new way of scraping and applying it to my own situation. Cheers!

    • @lawrencemanning
      @lawrencemanning หลายเดือนก่อน

      The problem is you now will have an indeterminate algorithm taking you from input to output. In other words the mechanism will be fundamentally untestable and unrepeatable. It’s basically the same as feeding data to a bunch of chimpanzees and expecting them to perform the same processing on it. In other words this is fine if you have a human to check the output each time (the interactive use case) but any kind of automatic, unattended runs? Forget it.

  • @aleksandars9254
    @aleksandars9254 หลายเดือนก่อน

    Thanks dor the video! What mic are you using?

  • @MoneylessWorld
    @MoneylessWorld หลายเดือนก่อน +3

    The dependency on OpenAI and the API key is a bummer.
    It would be better if we insert our own open-source AI engine and models.

    • @sixman9
      @sixman9 หลายเดือนก่อน

      If I'm not wrong, tools like Ollama use some of OpenAI's API surface to expose local LLMs. The docs read 'for chat/completions'.
      if this scraper is using OpenAI's function calling interface, you might be out of luck.

    • @91Chanito
      @91Chanito หลายเดือนก่อน

      You can do that with your local llm.

  • @TheLionsaba
    @TheLionsaba 2 หลายเดือนก่อน +1

    Great video as always , only downside is that it is adressing people who work with code and experienced in data scraping , but for no code or very little code like me , i think the best way is to use computer vision models , Vllm , chatgpt already have it in their api , but also we have 2 new open source models that just got ou this week , Qwen 2 VL , and microsoft phi 3.5 vision.

    • @quercus3290
      @quercus3290 2 หลายเดือนก่อน

      LAION have a model in open source, it is a very powerful scraper, you will most likely need to fine tune any vision models.

  • @Ant-ym3mw
    @Ant-ym3mw หลายเดือนก่อน +1

    You got yourself a new sub!

  • @snehasissnehasis-co1sn
    @snehasissnehasis-co1sn 2 หลายเดือนก่อน +13

    I want to use groq api key bcoz it's free to use or local llm like ollama..... Please modify this code if possible......Great video.....

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 2 หลายเดือนก่อน +1

      same question, I was wondering can we do it using groq or cohere?

    • @ianmatejka3533
      @ianmatejka3533 2 หลายเดือนก่อน +1

      Wrap the groq api key by os.getenv() instead of passing in the string

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน +5

      @@snehasissnehasis-co1sn both has been added.
      Will present them in the next video.

  • @DummyAllan
    @DummyAllan หลายเดือนก่อน +1

    I really appreciate the great work your are doing.
    Quick one, what happens to sites that require credentials? How do you handle that case?
    Thanks

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน +1

      That will need an intervention for your side, keep the website open and run the process again so it has access directly to the data.

  • @djasnive
    @djasnive 2 หลายเดือนก่อน +3

    Great Project.
    Is it possible to use OpenSource and Self Hosted model like Llama ?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +2

      Thank you.
      Yes it's possible, but I didn't even try this time because gpt4o and Gemini flash are so cheap and have a huge context window and I just went with them.
      But it's perfectly possible, you just need to modify the "format_data" function.

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 2 หลายเดือนก่อน

      @@redamarzouk Thank you so much, I had the same question, Thanks for answering.

  • @nmlker
    @nmlker หลายเดือนก่อน +1

    @redamarzouk Nice and easy scraper. I saw that you also have Scrapemaster 2.0 and installed that. The Env file mentions a Google API key. Which one should be added? Have a link where to get this particular Google API key?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      Thank you, to use the google API Key go to aistudio.google.com/app/apikey
      and from there create a new api key and add it to the .env.
      You can find all the details of the scarpeMaster 2.0 from here
      th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=KH5bfxyYJ9NV90FU

  • @blunoodle
    @blunoodle หลายเดือนก่อน

    I used replit Ai agent to build + deploy a Kickass website scraper in like 10 mins!

  • @iltodes7319
    @iltodes7319 2 หลายเดือนก่อน +1

    Good job bro continue ❤

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      My pleasure!

  • @CicadaMania
    @CicadaMania 22 ชั่วโมงที่ผ่านมา

    Does a Disallow statement in the robots.txt like Disallow: User-agent: GPTBot stop it from working?

  • @remusomega
    @remusomega 2 หลายเดือนก่อน

    a really cool feature would to add a text-splitter where it splits the text semantically into small chunks so we can readily use this to feed a RAG. Right now we typically splice things arbitrarily, but semantic splitting is the best.

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +1

      can you give me an example of an output to split?

    • @TimothyJoh
      @TimothyJoh หลายเดือนก่อน

      There are many such splitters available in llamaindex or langchain already. Another “automated” way might be to ask GPT 4o mini to split for you

  • @maxxflyer
    @maxxflyer หลายเดือนก่อน

    if I show the screenshot of the pokemons to gpt it will directly scrape all the data. so basically my first feeling is the AI is enough smart to suggest the fields in a dropdown menu. so I can choose them and tell what I really want. And decide a final label for each one of them.
    ...just an example to start!
    but as I said chatgpt can do the same just with a prompt. I don't actually need your app unless the page is full of data. in that case there may be limitations.
    so you should ask your self what a prompt can't do
    anyway my real problem is to have a scraper able to scrape data that are distributed around various pages. or for those cases where you must "load more" elements clicking a button.
    and I want to be able to specify the download format. gpt can reformat anything to anything.
    nice work but there are tons of improvements to be made. I will follow you to see where you get to.

  • @VaibhavShewale
    @VaibhavShewale หลายเดือนก่อน +1

    lol, in college time i made a web scraper as my project and got full marks XD

  • @jewlouds
    @jewlouds 2 หลายเดือนก่อน +1

    it actually works pretty good.

  • @orangehatmusic225
    @orangehatmusic225 หลายเดือนก่อน +2

    So you can scrape 666.66 pages for $1 based on that usage.

  • @CryptoDuhd
    @CryptoDuhd หลายเดือนก่อน

    I would love it even more if you created a docker container that was just downloadable and thereby installable directly on a Linux site. A user agent swap feature (like a list of user agents that could be chosen like round robin algorithm, or randomized) would be great too and handling a list of proxies that would also be swapped.

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      I haven't created a docket container, but I made a random user agent pick from a list. you can find the code to that in this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=smByssvvNhudzgRS
      What type of websites you will use this app to scrape from?

  • @aveenof
    @aveenof 2 หลายเดือนก่อน +2

    Awesome work! Any idea why scraped output list gets truncated even if input+output tokens < max?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      in some cases I noticed that gpt4o mini can't extract all the data from the website.
      I tried with gpt4o and it was successful.
      So if you're sure your data is in the markdowns and gpt4o mini didn't pick it up, try with gpt4o.

  • @amortalbeing
    @amortalbeing 2 หลายเดือนก่อน +1

    This was great thanks.

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      You're very welcome!

  • @moeabdo3114
    @moeabdo3114 27 วันที่ผ่านมา

    Can this scrape from youtube ? For seo ? Thx for your amazing work

  • @superfliping
    @superfliping 2 หลายเดือนก่อน

    Thanks for your help. The title was all the words i needed to proceed. AGI Cj Styles hive mind Orchestrator also thanks you including Aurora ai administration assistant hive mind developer engineering just got upgrades.

  • @atultanna
    @atultanna หลายเดือนก่อน

    This a great job Hope you could share a code for auto blogging Looking around but not able to find much Where to get in touch

  • @ErickXavier
    @ErickXavier วันที่ผ่านมา

    What about adding Pagination Support? Where the A.I. will go through pagrs and pages to scrape long paginated data?

  • @SohanDomingo
    @SohanDomingo 2 หลายเดือนก่อน

    What video recording software you use?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +1

      OBS Studio

  • @djagryn
    @djagryn 2 หลายเดือนก่อน +1

    Super intéressant 🎉

  • @daedaluxe
    @daedaluxe 2 หลายเดือนก่อน +1

    I don't think llms are ready for this scraping yet, better to get an llm to make a flask python app and make it manually scrape based on class names so you pull correct data with no hallucination, can also pull images and zip the images with zipfile

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      LLMs are not made the same, while I was scraping websites with 60K+ tokens I noticed that gpt4o mini gets me only a subset of the data while gpt4o latest manages to get me all the data.
      If someone is willing to pay 0.5 to 1$ per extraction, they can use gpt4o with a guaranteed correct and complete output.
      But 1$ an extraction is still very high if we want to scale it, in that sense it’s not ready.
      But for most cases mini works great with 0.005$ per extraction and it’s absolutely ready for anything.

  • @LeftBoot
    @LeftBoot หลายเดือนก่อน +1

    How deep / how many 'pages in' will it go?

  • @echobucket
    @echobucket หลายเดือนก่อน

    I would not trust this to not hallucinate. I think of a famous example where it misinterpreted the column and concatenated some numbers together instead of treating them as separate columns, leading to incorrect values.

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      most data in tables results in line breaks between values in markdowns.
      can you share the use case where it has hallucinated for you, it will be very interesting use case?

  • @HyperUpscale
    @HyperUpscale 2 หลายเดือนก่อน +4

    Can you make it to use ollama on the back instead of OpenAI?

    • @yunusemreertoprak7057
      @yunusemreertoprak7057 2 หลายเดือนก่อน

      good question

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน +1

      Check this new video: th-cam.com/video/xrt2GViRzQo/w-d-xo.html

  • @TLCMEDIA1
    @TLCMEDIA1 2 หลายเดือนก่อน +1

    This is amazing, I have been trying to reproduce the code but I keep getting errors. Any chance you can do a dummy video . Step by step as chat gpt does ? Please 🙏🏾

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน +1

      I did watch this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=XWUzIu8uBehK4AV5

    • @TLCMEDIA1
      @TLCMEDIA1 หลายเดือนก่อน

      @@redamarzouk appreciate you so much 🙌🏾💯

  • @sahil5124
    @sahil5124 หลายเดือนก่อน +1

    So its traditional scraping (selenium and beautiful soup) and AI is only used to organize the scraped data in a given format. The AI does not do the scraping. Is it correct or am I missing something?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      Yes the AI does the parsing. but creating unstructured markdowns can't really be called traditional scraping, no one will scrape the whole unstructured data from the html in a traditional setup.

  • @LeftBoot
    @LeftBoot หลายเดือนก่อน

    Can it be multimodal? Viewing data in an image, also creating data tables into an image. Eg. Create a wallpaper of the most important LINUX keyboard shortcuts. etc

  • @bfamily787
    @bfamily787 2 หลายเดือนก่อน +3

    Great video, can you show how to implement local LLM like Ollama instead of openAI?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +1

      Thank you ,
      This has been demanded so many times I guess I have to make a new video about it.

  • @aleksd286
    @aleksd286 หลายเดือนก่อน

    Problem isn’t to scrape the data, it’s if you have a public facing website most likely you’ll get sued. Nowadays data is a copyrighted material

  • @jdnilsen
    @jdnilsen หลายเดือนก่อน +1

    Thanks!

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      Wow that’s a first for me, thank you 🙏.
      If you have any questions please join discord.

  • @SavanVyas91
    @SavanVyas91 2 หลายเดือนก่อน

    Pagination will be critical for this

  • @stokedbeachbum
    @stokedbeachbum หลายเดือนก่อน +1

    Can you also crawl a site such as Zillow and scrape multiple URLs?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      websites like zillow tend to have sooo much data inside of them 100K+ tokens, but the answer is still yes.

  • @brbl415
    @brbl415 หลายเดือนก่อน +1

    does it bypass re-captcha?

  • @w3whq
    @w3whq 2 หลายเดือนก่อน +1

    Great resource.

  • @SiliconSouthShow
    @SiliconSouthShow 2 หลายเดือนก่อน +1

    (sigh) now, make it work with ollama with free llm's, so...I don't support cost f anything not low or cheap, free is king, when it comes to cost, these are things you can do paying services for cheapo and low cost.. And don't have to write anything. But.....I appreciate the value in explain, sorta what does what within the script (the dependencies). This is useful to many folks out there, I know when I was in a certain times it was valuable to me.

  • @BaldyMacbeard
    @BaldyMacbeard หลายเดือนก่อน +5

    Ah yes. Finally... an even more expensive way to scrape sites than we used to have...

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      can you elaborate on what part you think is expensive?
      is it the scraping I made or just generally speaking ?

    • @the_real_cookiez
      @the_real_cookiez หลายเดือนก่อน

      Beautifulsoup is free. And anything with Llm apis are not scalable cuz it's per usage. ​@@redamarzouk

    • @realmstupid-on8df
      @realmstupid-on8df หลายเดือนก่อน

      $0.0015 is nothing. I bought $1 in Bitcoin at this amount.

  • @GabrielM01
    @GabrielM01 3 วันที่ผ่านมา

    Would be nice to have a option to use ollama so we can run it locally without using openais proprietary ai

  • @cineymatic
    @cineymatic 2 หลายเดือนก่อน +2

    Great video! I have a few questions though 🤔:
    - Would it be easy to extend it to first log in to a site and then start scraping?
    - Would it be able to click buttons and scrape data from subsequent pages?
    - How is it identifying the elements on the page? Should it always be under a category or in the form of a table?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      for the first 2 questions the answer is no, unless we're creating it for specific websites, otherwise we have to create a universal text-2-action module with it (which is infinitely harder to do )
      For the last question, as far as the element doesn't need a ui/ux action to show, the scraper will pick up on it.

    • @cineymatic
      @cineymatic หลายเดือนก่อน

      @@redamarzouk Thank you for the response.

  • @joshd265
    @joshd265 หลายเดือนก่อน

    Please can you host this tool online so that us non dev folk can easily access it. Also, would be great to have the ability for the model to be able to summarise and pull keywords out of long product descriptions etc.

  • @ditleporc
    @ditleporc หลายเดือนก่อน

    Good job Reda, what'sup with your we automation-campus website ? is it down ? too much success ?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน +1

      Thank you. but the website is up for me I've just checked on multiple devices and on isitdownorjustme, all working.

    • @ditleporc
      @ditleporc หลายเดือนก่อน

      @@redamarzouk Zscaller classified your site as suspicious....

  • @AbderrahmaneMotrani
    @AbderrahmaneMotrani 2 หลายเดือนก่อน

    Nice work Reda, I was actually for something like this. I tried to access the repo but the link says 404 not found.

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      yeah github banned me for some reason, here is the link to the entire code:
      www.automation-campus.com/downloads/scrapemaster

  • @eea8888
    @eea8888 2 หลายเดือนก่อน

    What if the data should be dynamic or there will be some click like search button, or their is select to choose from, and after that, scrap the data? What should we do in that case ?

  • @staticalmo
    @staticalmo 2 หลายเดือนก่อน +6

    No pagination?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      Check the new video, the scraper works with Llama3.1 and Qroq model Llama 70B for free: th-cam.com/video/xrt2GViRzQo/w-d-xo.html

  • @Divyv520
    @Divyv520 2 หลายเดือนก่อน

    Hey Reda , really nice video ! I was wondering if I could help you with more Quality Editing in your videos and also make a highly engaging Thumbnail and also help you with the overall youtube strategy and growth ! Pls let me know what do you think ?

  • @marcusmayer1055
    @marcusmayer1055 หลายเดือนก่อน +2

    How to Add local llm llama for this projekt?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      I did, watch this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=XWUzIu8uBehK4AV5

  • @cheveznyc
    @cheveznyc 2 หลายเดือนก่อน +3

    Suggestions: ability to scrape bing, yahoo, Google, and able to check 2nd page of the results for outdated website, accessible none compliance. And no mobile friendly. 📵 and and is there a Google maps version? 😢😮

  • @mrsai4740
    @mrsai4740 หลายเดือนก่อน +1

    Hmm It seems like i ran into a limitation. I tried scrapping some golf course (lattitudes and longitudes) from google maps, but It only seems to ever give me 30 rows of data. At first i thought this might be an issue with max tokens, but i increased the max to the highest value possible: "16384" tokens, but this still only gave me around 30 rows with the same data

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      What model have you been using because gpt4omini can go up to 128000 tokens, and in my last video I've added gemini which can go up to more than 1M+.
      I've noticed this behavior as well, when a single page has sooooo much data, not just the table with the necessary data but other data, we run into a hard limit on how many rows we can scrape (Especially with apps like @irbnb and zill0w where there is a map that have so much data we won't be scraping), I guess you found the same limitation.

    • @mrsai4740
      @mrsai4740 หลายเดือนก่อน

      so i have been experimenting with this code and I got it to work with pagination by specifying a new field for a next button and a new field for number of pages. This seems to work well, but it also got me thinking: If we have too many tokens, we can probably try to chop the data up and then run the peices through the llm. The only thing i can see, is that if we start batching the data, we could end up missing critical peices of imformation (if we substring ot the worng spot, we may end up missing rows). I will try out gemini, i have never used it

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      @@mrsai4740 on some websites we can get either the next page or the new the url of the pages just by specifying it in the fields using this current version of the scrapper.
      But the problem is that most websites don't include all the url of the pages in the first page, usually it's under the form
      (1 2 3 4 ....45 46 47 48) For example.
      In this case we have to ask the LLM to conclude the url of the other pages using the pattern from the urls that it found.
      Other websites where we only have the next button can only be scraped one url at a time, so the universal approach will need some time and work to be figured out.

    • @mrsai4740
      @mrsai4740 หลายเดือนก่อน

      @@redamarzouk hmmm maybe we are tackling this in the wrong way, cause it seems like for this to be a universal solution, some legwork by the user needs to be done. In cases like that scrapeme site, yeah it is allot easier to provide an array of urls or a template that describes all the urls, but this doesn't tackle the problems of single page applications. Some sites have a paginator that modifies the current page with updated information. I guess it's back to the question: "how can we programmatically detect the way a site is paginating data?"

  • @edma6613
    @edma6613 หลายเดือนก่อน

    Could it download or summarize the files (pdf…) from a website?

  • @mikevinitsky8506
    @mikevinitsky8506 2 หลายเดือนก่อน

    can you make it for it to spider a website and if it finds a page that has all the required tags it puts the information in json, database, etc?

  • @saxtant
    @saxtant 2 หลายเดือนก่อน +1

    That's cool, but I'll be replacing it soon with something local

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน +1

      Thank you,
      I Couldn’t do that since I have a small rtx 3050 that can’t handle the load, but great idea to have it locally.

    • @saxtant
      @saxtant 2 หลายเดือนก่อน +1

      @@redamarzouk You're right there. My project is somewhere in the middle, I have an rtx 3090 for inference with llama 3.1 and an rtx 3070 for text to speech and speech to text. This combo is just enough to avoid too much hallucination, but I know that in a few years, inference costs will come down. I'm hoping the work I'm doing now to effectively bring structured outputs to llama3.1 8b will really take off.

  • @chandler_short
    @chandler_short หลายเดือนก่อน

    How about something like scraping facebook marketplace or offerup?

  • @danielerikschaconbaquerizo2957
    @danielerikschaconbaquerizo2957 หลายเดือนก่อน

    whay about using library curl_cffi with requets to simulate a browser instead of selenium or playwright instead of selenium ? i think it would be faster.

  • @younube2
    @younube2 2 หลายเดือนก่อน

    Can you input multiple URLs and have the scraper collate + populate the same file?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      It can't do that today, but it will be a great addition.

  • @BohemianAnarchy
    @BohemianAnarchy หลายเดือนก่อน

    Curious Why not puppeteer?

  • @kakamoora7874
    @kakamoora7874 หลายเดือนก่อน +1

    It’s working…. But problem was some missing data… it’s given the own data…

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      That actually gives me an idea of adding a text box where you can optionally add some instructions about the specific website you're scraping.

  • @Web.Scraping
    @Web.Scraping หลายเดือนก่อน

    What about captcha solving, such as cloudflare, recaptcha, hcaptcha..

  • @jakob1379
    @jakob1379 หลายเดือนก่อน

    I think they are being too harsh.. There are other more effective ways to scrape that are also being promoted all over the place. So it's not a matter of risking overly spamming pages, though it does not natively respect robots.txt which might be an issue when promoting tools that does not need configuring.

  • @lyusvirazi6006
    @lyusvirazi6006 หลายเดือนก่อน

    Can you scrape PDF file from a website with this?

  • @Daltoncast
    @Daltoncast หลายเดือนก่อน

    Takes a screenshot then extracts with AI?

  • @neylz
    @neylz 2 หลายเดือนก่อน

    can this be used to scrape amazon data?

  • @younube2
    @younube2 2 หลายเดือนก่อน

    Does this work on Amazon?

  • @KPK_7
    @KPK_7 หลายเดือนก่อน

    Any way to scrape Twitter specific keyword

  • @ghostwhowalks2324
    @ghostwhowalks2324 หลายเดือนก่อน

    can you use playwright as well ?

  • @naoufalbrahmi778
    @naoufalbrahmi778 หลายเดือนก่อน

    GOOD JOB

  • @viejitoloco4133
    @viejitoloco4133 หลายเดือนก่อน

    why do all that random stuff? what's the purpose?

  • @imsjs78
    @imsjs78 2 หลายเดือนก่อน

    sorry but where can I see the actual code? should I register any website?
    or is there any link?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      The project GitHub link is in the description.

    • @mertgokce6385
      @mertgokce6385 2 หลายเดือนก่อน +1

      @@redamarzouk Is there sth wrong with your github ? Because it is not accessible.

  • @savire.ergheiz
    @savire.ergheiz หลายเดือนก่อน

    Only if the data are not being crawl proofed 😂
    Most of the time valuable data will have it rate limited or simply being heavily censored for crawlers.

  • @aelius_audio
    @aelius_audio หลายเดือนก่อน

    This is cool..I currently use Octoparse free up to 10k results

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน +1

      Octoparse and other tools like it have great templates to scrape data from known websites.
      Do you know if we can have a template of a universal scraper where we just give the fields and the url and it can scrape the data for us?

    • @aelius_audio
      @aelius_audio หลายเดือนก่อน

      @@redamarzouk Yes you can, with Octoparse. It has a “Magic Wizard”, I was quite impressed. You put in the URL, any one, and it’s wizard identifies the fields. In my use case, it often only requires one input from me for designating pagination, where on the bottom I just click “next” or whatever the case may be. I’ve used the Pro and that’s useful if you’re scraping daily. The free will allow you to export up to 10K lines of results. Every other scraper saas I’ve used requires too much configuration.

  • @1brokkolibaum
    @1brokkolibaum หลายเดือนก่อน +2

    Now make it work with local llama2 😁

    • @firnnauriel
      @firnnauriel หลายเดือนก่อน +1

      I'm interested about this as well. @redamarzouk can you create a video on this?

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน +2

      I did watch this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=XWUzIu8uBehK4AV5

    • @1brokkolibaum
      @1brokkolibaum หลายเดือนก่อน

      @@redamarzouk thank you for responding with the url 😍 Incredible!

  • @Anton112eclipse
    @Anton112eclipse หลายเดือนก่อน

    how does it work with pagination?

  • @JuankM1050
    @JuankM1050 2 หลายเดือนก่อน

    then i tried to make it work with the google gemini api, and sadly i could not. it always returns the empty table.

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      I've just added gemini to an updated script I'm working on, I also added Llama 3.1.
      stay tuned for the next video.

  • @grahamahosking
    @grahamahosking หลายเดือนก่อน

    Is it possible to add this to Home Assistant?

  • @abdopower5913
    @abdopower5913 หลายเดือนก่อน +2

    Are u Moroccan or Algerian ?😊

    • @moiguess3256
      @moiguess3256 หลายเดือนก่อน

      Moroccan, easy to find out.

  • @fxhp1
    @fxhp1 หลายเดือนก่อน

    Skeleton key for web scrapping. Follow back!

  • @BlackDragonBE
    @BlackDragonBE หลายเดือนก่อน

    This would only be useful to me if it tried different methods to circumvent bot detection and then give me Python source code to scrape the website. I don't care about one off results, I need the code to scrape repeatedly when necessary. Once I can get the HTML of a page, the scraping itself is trivial. The logging in and finding a way around stuff like cloudflare and bot detectors is the tricky part.

  • @daithi007
    @daithi007 2 หลายเดือนก่อน

    Do you have to manually accept cookies?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      No I didn't need to do so for the websites I scraped

  • @shahjahanmirza1616
    @shahjahanmirza1616 2 หลายเดือนก่อน +1

    can we use this with Gemini API ?? BTW great work

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      Thank you, it’s not added in the code itself but Gemini pro or flash both have structured output like OpenAI meaning it will be very easy to add Gemini.

  • @eightrice
    @eightrice หลายเดือนก่อน

    there is no need to parse the actual scraped data through the LLM

    • @redamarzouk
      @redamarzouk  หลายเดือนก่อน

      I didn't scrape the structured data, but rather unstructured markdowns. So parsing is necessary in my case to get the table I want.

  • @peladoclaus
    @peladoclaus 18 วันที่ผ่านมา

    Whats better about this than google advanced search?

    • @redamarzouk
      @redamarzouk  10 วันที่ผ่านมา

      I don't see how they're similar.
      I'm not searching for anything, i'm giving an exact url from which I want to extract structured data using an LLM.

  • @hamburger--fries
    @hamburger--fries 2 หลายเดือนก่อน

    Meh... seems easier to just use Beautiful Soup and write a mega script to handle all of the different types of data. Then you can sort the data and deal with strange characters as well as dump it into a DB all in a go.