This Open Source Scraper CHANGES the Game!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 24 ธ.ค. 2024
  • Hello Everyone,
    Here is the link with the whole code in my website :
    www.automation...
    My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on Github and none of them got suspended.
    Also check out the 2.0 version here:
    • Yeah but can it RUN LO...
    www.automation...
    _______ 👇 Links 👇 _______
    🤝 Discord: / discord
    💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: / reda-marzouk-rpa
    📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: / redamarzouk.rpa
    🤖 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: / @redamarzouk
    Website: www.automation...
    _______ 👇 Content👇 _______

ความคิดเห็น • 297

  • @redamarzouk
    @redamarzouk  3 หลายเดือนก่อน +43

    Hey Everyone,
    LInk to code: www.automation-campus.com/downloads/scrapemaster
    My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on github and none of them got suspended.
    I opened a ticket and I'm waiting for their answer.
    in the meantime I shared the code on my website with all the steps to reproduce the ai scraper.

    • @ShaunPrince
      @ShaunPrince 3 หลายเดือนก่อน +1

      Let me know if I can help with this. I can setup a Gittea on AWS or something.

    • @Kevinsmithns
      @Kevinsmithns 3 หลายเดือนก่อน +2

      Yeah I was just looking and about to comment

    • @alex_osti
      @alex_osti 3 หลายเดือนก่อน +2

      I was about to give it a shot.. Waiting for the update. Great work btw

    • @rperellor
      @rperellor 3 หลายเดือนก่อน +1

      I had the opportunity to view it, but did not clone it

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +8

      @@rperellor here is the code www.automation-campus.com/downloads/scrapemaster

  • @RoughSubset
    @RoughSubset 3 หลายเดือนก่อน +163

    So I worked at a company once where the data guy built his own web scrapper to scrape data off of our competitors website for pricing etc. One thing that they did to protect their website from scrapping was user-agent filtering, in order for him to overcome this limitation was to have a very long list of different user-agents and rotate them while scrapping the website. I think that will be a good addition to add into your app. A small but useful change.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +18

      Yes if we launch the scraper with the same user agent for the same websites so many times they will pick up on it and block us.
      the modification will have a list of OS credentials with their versions and different browsers and their versions.

    • @markomarjanovic8348
      @markomarjanovic8348 3 หลายเดือนก่อน +19

      @@redamarzouk Would it be possible to have a video about proxy rotation implementation? There is not much of it on YT but i think its crucially important.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +17

      @@markomarjanovic8348 Added to the backlog

    • @amortalbeing
      @amortalbeing 3 หลายเดือนก่อน +2

      this is a good suggestion, would like this to be added as well.

    • @internetperson2
      @internetperson2 3 หลายเดือนก่อน +1

      Thirded

  • @thisisfabiop
    @thisisfabiop 3 หลายเดือนก่อน +24

    Amazing work! It works great, but it doesn't handle cases where the database is divided into pages instead of using infinite scroll. It would be fantastic if it could also navigate through the pages until there are no more left.
    Another great feature-although it might make the tool more expensive, so it could be offered as an optional, selectable feature in the UI-would be for the scraper to open each item's page and scrape data from there. As you know, the initial page often only displays limited information about the product.

  • @jdnilsen
    @jdnilsen 3 หลายเดือนก่อน +2

    Thanks!

  • @moiguess3256
    @moiguess3256 2 หลายเดือนก่อน +1

    You earned a new subscriber. Algerian brother here.

  • @SergeyNumerov
    @SergeyNumerov 3 หลายเดือนก่อน +32

    Pretty cool.
    Let me point out, though, that the main complexity with scraping is that often times the relevant content is hidden: that is, getting to it may require clicking various UX elements.
    So to _really_ crack Scraping with AI, we'll need to go agentic: the solution will need to figure out what to click in order to reveal information of interest.

    • @SpragginsDesigns
      @SpragginsDesigns 3 หลายเดือนก่อน +3

      Exactly. Anyone interested in helping me make something like this? Or is there something available already?

    • @pyros4333
      @pyros4333 3 หลายเดือนก่อน +1

      ​@@SpragginsDesignsyou could just hire someone to build it for you easily

  • @justjosh1400
    @justjosh1400 3 หลายเดือนก่อน +6

    Definitely going to use this, I think this is awesome. As a suggestion for future options it would be great to have pagination support and levels deep. Has a lot of my scraping his location-based, for instance States-cities-locations. And the data I usually want is within the locations which may only be a few.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +3

      Thank you.
      Yes Pagination will make this complete.
      But I’m thinking how can I make it universal, cause it has to work on every website, so would I just add another llm call to detect any url pagination pattern or do you have a better idea on how to do it ?

    • @justjosh1400
      @justjosh1400 3 หลายเดือนก่อน +1

      @@redamarzouk that might actually work using a lower model would be capable of determining if the page has pagination. Or have a checkbox for user to manually say it has pagination so the LLM will be looking for it. That way it's not always looking for it. And when it finds it return what kind of class it is. IDK

    • @wdonno
      @wdonno 3 หลายเดือนก่อน

      @@redamarzouksimilar scenarios may be an interim pathway: if the initial url prompts for a selection of (text input) that determines next page, can you add the ability to make that selection, ideally from a list of items of prior interest? The recursive ability to select specific buttons to push according to options on following pages would then solve a large number of use cases (ie an ability to map different actions according to a preselected known option types)? The base use case is to download files from a selection post which varies by initial (or ideally subsequent) text inputs, terminated by pressing a button to download a file or selected files). The approach can then be expanded to add more scenarios, until it is universal!

    • @justjosh1400
      @justjosh1400 3 หลายเดือนก่อน +1

      Thinking about and just thought maybe have an area to manually put in div container that the user can grab from the inspect tool.
      Or..
      Since we're using a LLM you could always prompt for it and return the value of the container. Such as look to see if this page has pagination at the bottom or top if so return a value perhaps and use that value to fill in

  • @danielcave9606
    @danielcave9606 2 หลายเดือนก่อน +2

    Most of the "traditional" Enteprise grade scraping tech companies are adopting LLMs into their stack as an option for when it makes sense. When you're scraping millions/billions of pages every 100th of a cent matters, so taking a composite AI approach, using ML models to get the majority of the standard data points for a general schema cheaply, and then allowing LLMs to the thing they do best at extracting data from unstructured text to extend that schema, that way you get eh cost efficiency with the flexibility of LLMs when needed.
    The real benefit of the LLM approach for bigger teams/projects is actually that is abstracts away from hard coding selectors into your spiders, so they are far more robust and unlikely to break in 3 months when the website changes its HTML, reducing your maintenance burden/debt. Thats my 10 cents anyway.
    I personally love what your project does for the everyday person though, getting small/medium crawls done where price per request isn't so important, and where you will have time/space for more rigorous custom QA. I especially love it for content generation purposes, data journalism, chart porn and the like. Great work!

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      Yeah I thought I was creating a scraper at scale, but once started using it extensively I see it more as a productivity tool to help get the data quickly without the need for copy paste.
      Traditional scrapers will still have a place in the market simply because once you want to scrape hundreds of thousands or millions of pages, the cost of paying coders for custom scripts and maintenance will make sense compared to the value of the data scraped.

  • @MoneylessWorld
    @MoneylessWorld 3 หลายเดือนก่อน +5

    The dependency on OpenAI and the API key is a bummer.
    It would be better if we insert our own open-source AI engine and models.

    • @sixman9
      @sixman9 3 หลายเดือนก่อน

      If I'm not wrong, tools like Ollama use some of OpenAI's API surface to expose local LLMs. The docs read 'for chat/completions'.
      if this scraper is using OpenAI's function calling interface, you might be out of luck.

    • @91Chanito
      @91Chanito 3 หลายเดือนก่อน +1

      You can do that with your local llm.

  • @orangehatmusic225
    @orangehatmusic225 3 หลายเดือนก่อน +3

    So you can scrape 666.66 pages for $1 based on that usage.

  • @mrsai4740
    @mrsai4740 3 หลายเดือนก่อน +1

    Hmm It seems like i ran into a limitation. I tried scrapping some golf course (lattitudes and longitudes) from google maps, but It only seems to ever give me 30 rows of data. At first i thought this might be an issue with max tokens, but i increased the max to the highest value possible: "16384" tokens, but this still only gave me around 30 rows with the same data

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      What model have you been using because gpt4omini can go up to 128000 tokens, and in my last video I've added gemini which can go up to more than 1M+.
      I've noticed this behavior as well, when a single page has sooooo much data, not just the table with the necessary data but other data, we run into a hard limit on how many rows we can scrape (Especially with apps like @irbnb and zill0w where there is a map that have so much data we won't be scraping), I guess you found the same limitation.

    • @mrsai4740
      @mrsai4740 3 หลายเดือนก่อน

      so i have been experimenting with this code and I got it to work with pagination by specifying a new field for a next button and a new field for number of pages. This seems to work well, but it also got me thinking: If we have too many tokens, we can probably try to chop the data up and then run the peices through the llm. The only thing i can see, is that if we start batching the data, we could end up missing critical peices of imformation (if we substring ot the worng spot, we may end up missing rows). I will try out gemini, i have never used it

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      @@mrsai4740 on some websites we can get either the next page or the new the url of the pages just by specifying it in the fields using this current version of the scrapper.
      But the problem is that most websites don't include all the url of the pages in the first page, usually it's under the form
      (1 2 3 4 ....45 46 47 48) For example.
      In this case we have to ask the LLM to conclude the url of the other pages using the pattern from the urls that it found.
      Other websites where we only have the next button can only be scraped one url at a time, so the universal approach will need some time and work to be figured out.

    • @mrsai4740
      @mrsai4740 3 หลายเดือนก่อน

      @@redamarzouk hmmm maybe we are tackling this in the wrong way, cause it seems like for this to be a universal solution, some legwork by the user needs to be done. In cases like that scrapeme site, yeah it is allot easier to provide an array of urls or a template that describes all the urls, but this doesn't tackle the problems of single page applications. Some sites have a paginator that modifies the current page with updated information. I guess it's back to the question: "how can we programmatically detect the way a site is paginating data?"

  • @rgsiiiya
    @rgsiiiya 3 หลายเดือนก่อน

    This, and the V2 with Llama, are very interesting concepts, and I believe could be tremendously valuable.
    The shortcome is that it is very limited to just the single page at the URL location.
    To be truly valuable, it needs to also be a scraper (as you mention).
    Think of the use case to scrape ecommerce sites for product details. any "real' ecommerce site is going to have many many categories and pages of categorized product listings.
    While you can set up traditional scrapers and manually configure the navigation, this should be where AI should really shine. It should be able to figure out the navigation and automatically navigate/scrape the site.

  • @HyperUpscale
    @HyperUpscale 3 หลายเดือนก่อน +4

    Can you make it to use ollama on the back instead of OpenAI?

    • @yunusemreertoprak7057
      @yunusemreertoprak7057 3 หลายเดือนก่อน

      good question

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      Check this new video: th-cam.com/video/xrt2GViRzQo/w-d-xo.html

  • @ginocote
    @ginocote 3 หลายเดือนก่อน +4

    One of my idea is to create or use a AI scraper to get the first scrape test. If it work you do output somethine like a json that will get the id or class of the scraper element, tant you give this json to your conventional no AI scraper to scrape the website for free and faster without the need of AI afterware.

    • @lovol2
      @lovol2 3 หลายเดือนก่อน

      This is just writing code. Just copy paste the html into chatgpt and say write the code to parse into JSON.. works really well.

  • @SamirDamle
    @SamirDamle 3 หลายเดือนก่อน +9

    Thanks for the simple tutorial and code.
    Can you add an example of using this scraper with local Ollama and Llama 3.1 instead of OpenAI to make it totally free?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +5

      You’re welcome.
      I can add it but I won’t be able to test it.
      My small gpu can’t really handle it especially when I’m filming.

    • @HyperUpscale
      @HyperUpscale 3 หลายเดือนก่อน

      @@redamarzouk YES, PLEASE 🙏!!!

    • @GundamExia88
      @GundamExia88 3 หลายเดือนก่อน

      @@redamarzouk I hope this get added. I prefer to run Ollama locally. I'm only using a GTX 1070, it works fine.

    • @idrinkmusic
      @idrinkmusic 3 หลายเดือนก่อน +1

      @@redamarzouk this would be a game-changing update. You earned a sub for this video regardless.

    • @carvierdotdev
      @carvierdotdev 3 หลายเดือนก่อน

      ​@@GundamExia88 could you please tell me what models you run? I have the GTX 1080 Ti 11GB, thanks to a friend, and I want to play with that but I don't even know it's possible 😂😅

  • @ErickXavier
    @ErickXavier หลายเดือนก่อน

    What about adding Pagination Support? Where the A.I. will go through pagrs and pages to scrape long paginated data?

  • @LeftBoot
    @LeftBoot 3 หลายเดือนก่อน +1

    How deep / how many 'pages in' will it go?

  • @dimadem
    @dimadem 3 หลายเดือนก่อน +1

    so good idea and explanation, thank you

  • @Alphamaan
    @Alphamaan หลายเดือนก่อน

    Can this app click on a car's page to scrap the details and go back to click on another car's page to scrap the details again?

  • @shawnsmith9198
    @shawnsmith9198 3 หลายเดือนก่อน +4

    you are genius! I am on a mac, so I just had to change the driver call, but everything else is working well. pagination or series of urls would be cool. i love how you have it load in the chrome browser. this really changes how i think about cross platform apps. i wonder if we can scrape instagram now. or what about downloading images? maybe a simple copy table button, since I just copy and paste into google docs.

    • @jimbob3823
      @jimbob3823 3 หลายเดือนก่อน

      New to macos can you please share your driver path? Not 100% which is the executable. Ty!

    • @thecashlessgamer480
      @thecashlessgamer480 3 หลายเดือนก่อน

      Yes please can you help me set it up on my mac as well?

    • @wavelyveney9021
      @wavelyveney9021 2 หลายเดือนก่อน

      I need assistance in setting up on a mac

  • @Ant-ym3mw
    @Ant-ym3mw 3 หลายเดือนก่อน +1

    You got yourself a new sub!

  • @TheLionsaba
    @TheLionsaba 3 หลายเดือนก่อน +1

    Great video as always , only downside is that it is adressing people who work with code and experienced in data scraping , but for no code or very little code like me , i think the best way is to use computer vision models , Vllm , chatgpt already have it in their api , but also we have 2 new open source models that just got ou this week , Qwen 2 VL , and microsoft phi 3.5 vision.

    • @quercus3290
      @quercus3290 3 หลายเดือนก่อน

      LAION have a model in open source, it is a very powerful scraper, you will most likely need to fine tune any vision models.

  • @marcusmayer1055
    @marcusmayer1055 3 หลายเดือนก่อน +2

    How to Add local llm llama for this projekt?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      I did, watch this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=XWUzIu8uBehK4AV5

  • @aveenof
    @aveenof 3 หลายเดือนก่อน +2

    Awesome work! Any idea why scraped output list gets truncated even if input+output tokens < max?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      in some cases I noticed that gpt4o mini can't extract all the data from the website.
      I tried with gpt4o and it was successful.
      So if you're sure your data is in the markdowns and gpt4o mini didn't pick it up, try with gpt4o.

  • @CicadaMania
    @CicadaMania หลายเดือนก่อน

    Does a Disallow statement in the robots.txt like Disallow: User-agent: GPTBot stop it from working?

  • @daedaluxe
    @daedaluxe 3 หลายเดือนก่อน +1

    I don't think llms are ready for this scraping yet, better to get an llm to make a flask python app and make it manually scrape based on class names so you pull correct data with no hallucination, can also pull images and zip the images with zipfile

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      LLMs are not made the same, while I was scraping websites with 60K+ tokens I noticed that gpt4o mini gets me only a subset of the data while gpt4o latest manages to get me all the data.
      If someone is willing to pay 0.5 to 1$ per extraction, they can use gpt4o with a guaranteed correct and complete output.
      But 1$ an extraction is still very high if we want to scale it, in that sense it’s not ready.
      But for most cases mini works great with 0.005$ per extraction and it’s absolutely ready for anything.

  • @brbl415
    @brbl415 3 หลายเดือนก่อน +1

    does it bypass re-captcha?

  • @snehasissnehasis-co1sn
    @snehasissnehasis-co1sn 3 หลายเดือนก่อน +13

    I want to use groq api key bcoz it's free to use or local llm like ollama..... Please modify this code if possible......Great video.....

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 3 หลายเดือนก่อน +1

      same question, I was wondering can we do it using groq or cohere?

    • @ianmatejka3533
      @ianmatejka3533 3 หลายเดือนก่อน +1

      Wrap the groq api key by os.getenv() instead of passing in the string

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +5

      @@snehasissnehasis-co1sn both has been added.
      Will present them in the next video.

  • @sahil5124
    @sahil5124 3 หลายเดือนก่อน +1

    So its traditional scraping (selenium and beautiful soup) and AI is only used to organize the scraped data in a given format. The AI does not do the scraping. Is it correct or am I missing something?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      Yes the AI does the parsing. but creating unstructured markdowns can't really be called traditional scraping, no one will scrape the whole unstructured data from the html in a traditional setup.

  • @staticalmo
    @staticalmo 3 หลายเดือนก่อน +6

    No pagination?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      Check the new video, the scraper works with Llama3.1 and Qroq model Llama 70B for free: th-cam.com/video/xrt2GViRzQo/w-d-xo.html

  • @minissoft
    @minissoft 3 หลายเดือนก่อน +7

    Hello Reda, you should use Polars instead of Pandas, in a lot of cases is much faster than Pandas
    Also add_argument("--disable-search-engine-choice-screen") is useful + ("--headless") maybe?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      Oh I was looking for that argument "-disable-search-engine-choice-screen" that pop up is annoying ( even if it doesn't affect the scraping). I will be adding that, thank you!!

  • @djasnive
    @djasnive 3 หลายเดือนก่อน +3

    Great Project.
    Is it possible to use OpenSource and Self Hosted model like Llama ?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +2

      Thank you.
      Yes it's possible, but I didn't even try this time because gpt4o and Gemini flash are so cheap and have a huge context window and I just went with them.
      But it's perfectly possible, you just need to modify the "format_data" function.

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 3 หลายเดือนก่อน

      @@redamarzouk Thank you so much, I had the same question, Thanks for answering.

  • @stokedbeachbum
    @stokedbeachbum 3 หลายเดือนก่อน +1

    Can you also crawl a site such as Zillow and scrape multiple URLs?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      websites like zillow tend to have sooo much data inside of them 100K+ tokens, but the answer is still yes.

  • @CryptoDuhd
    @CryptoDuhd 3 หลายเดือนก่อน

    I would love it even more if you created a docker container that was just downloadable and thereby installable directly on a Linux site. A user agent swap feature (like a list of user agents that could be chosen like round robin algorithm, or randomized) would be great too and handling a list of proxies that would also be swapped.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      I haven't created a docket container, but I made a random user agent pick from a list. you can find the code to that in this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=smByssvvNhudzgRS
      What type of websites you will use this app to scrape from?

  • @JordanCrawfordSF
    @JordanCrawfordSF หลายเดือนก่อน

    0:36 - dude got possessed by ChatGPT and his eyes went bananas.

  • @obey24com
    @obey24com 3 หลายเดือนก่อน +1

    What about websites with cloudfare security etc.?

    • @TheLionsaba
      @TheLionsaba 3 หลายเดือนก่อน

      Very important question.

  • @GabrielM01
    @GabrielM01 หลายเดือนก่อน

    Would be nice to have a option to use ollama so we can run it locally without using openais proprietary ai

  • @blunoodle
    @blunoodle 3 หลายเดือนก่อน

    I used replit Ai agent to build + deploy a Kickass website scraper in like 10 mins!

  • @remusomega
    @remusomega 3 หลายเดือนก่อน

    a really cool feature would to add a text-splitter where it splits the text semantically into small chunks so we can readily use this to feed a RAG. Right now we typically splice things arbitrarily, but semantic splitting is the best.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      can you give me an example of an output to split?

    • @TimothyJoh
      @TimothyJoh 3 หลายเดือนก่อน

      There are many such splitters available in llamaindex or langchain already. Another “automated” way might be to ask GPT 4o mini to split for you

  • @aleksandars9254
    @aleksandars9254 3 หลายเดือนก่อน

    Thanks dor the video! What mic are you using?

  • @nmlker
    @nmlker 3 หลายเดือนก่อน +1

    @redamarzouk Nice and easy scraper. I saw that you also have Scrapemaster 2.0 and installed that. The Env file mentions a Google API key. Which one should be added? Have a link where to get this particular Google API key?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      Thank you, to use the google API Key go to aistudio.google.com/app/apikey
      and from there create a new api key and add it to the .env.
      You can find all the details of the scarpeMaster 2.0 from here
      th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=KH5bfxyYJ9NV90FU

  • @eea8888
    @eea8888 3 หลายเดือนก่อน

    What if the data should be dynamic or there will be some click like search button, or their is select to choose from, and after that, scrap the data? What should we do in that case ?

  • @KPK_7
    @KPK_7 3 หลายเดือนก่อน

    Any way to scrape Twitter specific keyword

  • @maxxflyer
    @maxxflyer 3 หลายเดือนก่อน

    if I show the screenshot of the pokemons to gpt it will directly scrape all the data. so basically my first feeling is the AI is enough smart to suggest the fields in a dropdown menu. so I can choose them and tell what I really want. And decide a final label for each one of them.
    ...just an example to start!
    but as I said chatgpt can do the same just with a prompt. I don't actually need your app unless the page is full of data. in that case there may be limitations.
    so you should ask your self what a prompt can't do
    anyway my real problem is to have a scraper able to scrape data that are distributed around various pages. or for those cases where you must "load more" elements clicking a button.
    and I want to be able to specify the download format. gpt can reformat anything to anything.
    nice work but there are tons of improvements to be made. I will follow you to see where you get to.

  • @mzahran001
    @mzahran001 3 หลายเดือนก่อน +3

    Thanks for the great video. Idea for nest videos: Could you extend the code with crawling, for example, getting results from search engines or following a specific path to get more structured data?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      You're welcome, can you elaborate more on how it should look like ?
      Because this will be awesome and I actually gave it some thought, but it's hard to get the exact link of multiple pages from which you want to extract data if you don't have the link for the first page.
      you think we can trust a search engine to give us the exact links we want to scrape data from?

  • @danielerikschaconbaquerizo2957
    @danielerikschaconbaquerizo2957 3 หลายเดือนก่อน

    whay about using library curl_cffi with requets to simulate a browser instead of selenium or playwright instead of selenium ? i think it would be faster.

  • @LeftBoot
    @LeftBoot 3 หลายเดือนก่อน

    Can it be multimodal? Viewing data in an image, also creating data tables into an image. Eg. Create a wallpaper of the most important LINUX keyboard shortcuts. etc

  • @aleksd286
    @aleksd286 3 หลายเดือนก่อน

    Problem isn’t to scrape the data, it’s if you have a public facing website most likely you’ll get sued. Nowadays data is a copyrighted material

  • @chandler_short
    @chandler_short 2 หลายเดือนก่อน

    How about something like scraping facebook marketplace or offerup?

  • @JuankM1050
    @JuankM1050 3 หลายเดือนก่อน

    then i tried to make it work with the google gemini api, and sadly i could not. it always returns the empty table.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      I've just added gemini to an updated script I'm working on, I also added Llama 3.1.
      stay tuned for the next video.

  • @Web.Scraping
    @Web.Scraping 3 หลายเดือนก่อน

    What about captcha solving, such as cloudflare, recaptcha, hcaptcha..

  • @echobucket
    @echobucket 3 หลายเดือนก่อน

    I would not trust this to not hallucinate. I think of a famous example where it misinterpreted the column and concatenated some numbers together instead of treating them as separate columns, leading to incorrect values.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      most data in tables results in line breaks between values in markdowns.
      can you share the use case where it has hallucinated for you, it will be very interesting use case?

  • @mikevinitsky8506
    @mikevinitsky8506 3 หลายเดือนก่อน

    can you make it for it to spider a website and if it finds a page that has all the required tags it puts the information in json, database, etc?

  • @TLCMEDIA1
    @TLCMEDIA1 3 หลายเดือนก่อน +1

    This is amazing, I have been trying to reproduce the code but I keep getting errors. Any chance you can do a dummy video . Step by step as chat gpt does ? Please 🙏🏾

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      I did watch this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=XWUzIu8uBehK4AV5

    • @TLCMEDIA1
      @TLCMEDIA1 3 หลายเดือนก่อน

      @@redamarzouk appreciate you so much 🙌🏾💯

  • @iltodes7319
    @iltodes7319 3 หลายเดือนก่อน +1

    Good job bro continue ❤

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      My pleasure!

  • @Anton112eclipse
    @Anton112eclipse 2 หลายเดือนก่อน

    how does it work with pagination?

  • @DummyAllan
    @DummyAllan 3 หลายเดือนก่อน +1

    I really appreciate the great work your are doing.
    Quick one, what happens to sites that require credentials? How do you handle that case?
    Thanks

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      That will need an intervention for your side, keep the website open and run the process again so it has access directly to the data.

  • @brianzvc
    @brianzvc หลายเดือนก่อน

    does this scrape dynamic data?

  • @edma6613
    @edma6613 3 หลายเดือนก่อน

    Could it download or summarize the files (pdf…) from a website?

  • @moeabdo3114
    @moeabdo3114 2 หลายเดือนก่อน

    Can this scrape from youtube ? For seo ? Thx for your amazing work

  • @SohanDomingo
    @SohanDomingo 3 หลายเดือนก่อน

    What video recording software you use?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      OBS Studio

  • @BohemianAnarchy
    @BohemianAnarchy 3 หลายเดือนก่อน

    Curious Why not puppeteer?

  • @Cygx
    @Cygx 3 หลายเดือนก่อน

    why do I need to use a llm for scraping the data?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      Yeah for 1 of 2 websites it's doesn't make sense, but to scrape any website with 1 single app is pretty useful.
      Will you still prefer the traditional option even if you have to create a script every time ?

  • @joshd265
    @joshd265 3 หลายเดือนก่อน

    Please can you host this tool online so that us non dev folk can easily access it. Also, would be great to have the ability for the model to be able to summarise and pull keywords out of long product descriptions etc.

  • @lyusvirazi6006
    @lyusvirazi6006 3 หลายเดือนก่อน

    Can you scrape PDF file from a website with this?

  • @peladoclaus
    @peladoclaus 2 หลายเดือนก่อน

    Whats better about this than google advanced search?

    • @redamarzouk
      @redamarzouk  2 หลายเดือนก่อน

      I don't see how they're similar.
      I'm not searching for anything, i'm giving an exact url from which I want to extract structured data using an LLM.

  • @SoSoInfinite
    @SoSoInfinite 3 หลายเดือนก่อน

    Can this scrape eBay api?

  • @ditleporc
    @ditleporc 3 หลายเดือนก่อน

    Good job Reda, what'sup with your we automation-campus website ? is it down ? too much success ?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      Thank you. but the website is up for me I've just checked on multiple devices and on isitdownorjustme, all working.

    • @ditleporc
      @ditleporc 3 หลายเดือนก่อน

      @@redamarzouk Zscaller classified your site as suspicious....

  • @eightrice
    @eightrice 3 หลายเดือนก่อน

    there is no need to parse the actual scraped data through the LLM

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      I didn't scrape the structured data, but rather unstructured markdowns. So parsing is necessary in my case to get the table I want.

  • @younube2
    @younube2 3 หลายเดือนก่อน

    Can you input multiple URLs and have the scraper collate + populate the same file?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      It can't do that today, but it will be a great addition.

  • @atultanna
    @atultanna 3 หลายเดือนก่อน

    This a great job Hope you could share a code for auto blogging Looking around but not able to find much Where to get in touch

  • @ScottLahteine
    @ScottLahteine 3 หลายเดือนก่อน +3

    The use case I have for a script like this one is to scrape my own open source project code history to convert several versions of config files that contain lots of good documentation into YAML that can be deployed to a Jekyll website. So all the same principles apply, especially the need to output consistent structured data. I look forward to learning more about the development of this new way of scraping and applying it to my own situation. Cheers!

    • @lawrencemanning
      @lawrencemanning 3 หลายเดือนก่อน

      The problem is you now will have an indeterminate algorithm taking you from input to output. In other words the mechanism will be fundamentally untestable and unrepeatable. It’s basically the same as feeding data to a bunch of chimpanzees and expecting them to perform the same processing on it. In other words this is fine if you have a human to check the output each time (the interactive use case) but any kind of automatic, unattended runs? Forget it.

  • @grahamahosking
    @grahamahosking 3 หลายเดือนก่อน

    Is it possible to add this to Home Assistant?

  • @ghostwhowalks2324
    @ghostwhowalks2324 3 หลายเดือนก่อน

    can you use playwright as well ?

  • @amortalbeing
    @amortalbeing 3 หลายเดือนก่อน +1

    This was great thanks.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      You're very welcome!

  • @Daltoncast
    @Daltoncast 3 หลายเดือนก่อน

    Takes a screenshot then extracts with AI?

  • @jewlouds
    @jewlouds 3 หลายเดือนก่อน +1

    it actually works pretty good.

  • @neylz
    @neylz 3 หลายเดือนก่อน

    can this be used to scrape amazon data?

  • @younube2
    @younube2 3 หลายเดือนก่อน

    Does this work on Amazon?

  • @menachem-145
    @menachem-145 3 หลายเดือนก่อน

    how can i work with this on mac?

  • @cineymatic
    @cineymatic 3 หลายเดือนก่อน +2

    Great video! I have a few questions though 🤔:
    - Would it be easy to extend it to first log in to a site and then start scraping?
    - Would it be able to click buttons and scrape data from subsequent pages?
    - How is it identifying the elements on the page? Should it always be under a category or in the form of a table?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      for the first 2 questions the answer is no, unless we're creating it for specific websites, otherwise we have to create a universal text-2-action module with it (which is infinitely harder to do )
      For the last question, as far as the element doesn't need a ui/ux action to show, the scraper will pick up on it.

    • @cineymatic
      @cineymatic 3 หลายเดือนก่อน

      @@redamarzouk Thank you for the response.

  • @kakamoora7874
    @kakamoora7874 3 หลายเดือนก่อน +1

    It’s working…. But problem was some missing data… it’s given the own data…

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      That actually gives me an idea of adding a text box where you can optionally add some instructions about the specific website you're scraping.

  • @imsjs78
    @imsjs78 3 หลายเดือนก่อน

    sorry but where can I see the actual code? should I register any website?
    or is there any link?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      The project GitHub link is in the description.

    • @mertgokce6385
      @mertgokce6385 3 หลายเดือนก่อน +1

      @@redamarzouk Is there sth wrong with your github ? Because it is not accessible.

  • @daithi007
    @daithi007 3 หลายเดือนก่อน

    Do you have to manually accept cookies?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      No I didn't need to do so for the websites I scraped

  • @viejitoloco4133
    @viejitoloco4133 3 หลายเดือนก่อน

    why do all that random stuff? what's the purpose?

  • @aijokker
    @aijokker 3 หลายเดือนก่อน

    Any way to use it with free model?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      Yes the only function that needs to be modified is format data.
      Make sure the open source model supports structured output.

  • @djagryn
    @djagryn 3 หลายเดือนก่อน +1

    Super intéressant 🎉

  • @VaibhavShewale
    @VaibhavShewale 3 หลายเดือนก่อน +1

    lol, in college time i made a web scraper as my project and got full marks XD

  • @SiliconSouthShow
    @SiliconSouthShow 3 หลายเดือนก่อน +1

    (sigh) now, make it work with ollama with free llm's, so...I don't support cost f anything not low or cheap, free is king, when it comes to cost, these are things you can do paying services for cheapo and low cost.. And don't have to write anything. But.....I appreciate the value in explain, sorta what does what within the script (the dependencies). This is useful to many folks out there, I know when I was in a certain times it was valuable to me.

  • @bfamily787
    @bfamily787 3 หลายเดือนก่อน +3

    Great video, can you show how to implement local LLM like Ollama instead of openAI?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      Thank you ,
      This has been demanded so many times I guess I have to make a new video about it.

  • @AbderrahmaneMotrani
    @AbderrahmaneMotrani 3 หลายเดือนก่อน

    Nice work Reda, I was actually for something like this. I tried to access the repo but the link says 404 not found.

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      yeah github banned me for some reason, here is the link to the entire code:
      www.automation-campus.com/downloads/scrapemaster

  • @hendrikvanbrantegem7526
    @hendrikvanbrantegem7526 3 หลายเดือนก่อน

    can u do bulk url?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      The streamlit application is mainly for interactive scraping. but the scraper.py file can be used to launch the scraping on a list of URLs.

  • @w3whq
    @w3whq 3 หลายเดือนก่อน +1

    Great resource.

  • @mockcrackers7636
    @mockcrackers7636 หลายเดือนก่อน

    Can it scrap linkedin ?

    • @redamarzouk
      @redamarzouk  23 วันที่ผ่านมา

      I've tried it and it did scrape it.

  • @ld-yt.
    @ld-yt. 3 หลายเดือนก่อน

    Why take down the repo ?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน +1

      My GitHub got suspended, here is a back up link:
      www.automation-campus.com/downloads/scrapemaster

  • @anianait
    @anianait 3 หลายเดือนก่อน

    Or in Chrome, use the menu "Save web page as .... "

  • @MrTestingchannel1
    @MrTestingchannel1 3 หลายเดือนก่อน

    Repo deleted or hidden, why?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      GitHub suspended my account.
      I’ve shared the whole code, link in the description.

  • @cameronyking
    @cameronyking 3 หลายเดือนก่อน

    Can this be an API?

  • @CarlvanEijk
    @CarlvanEijk 3 หลายเดือนก่อน

    404 on your git? what's going on?

    • @redamarzouk
      @redamarzouk  3 หลายเดือนก่อน

      GitHub suspended my whole account (without warning). I've shared the code, follow the link in my description.

  • @abdopower5913
    @abdopower5913 3 หลายเดือนก่อน +2

    Are u Moroccan or Algerian ?😊

    • @moiguess3256
      @moiguess3256 2 หลายเดือนก่อน

      Moroccan, easy to find out.