This Open Source Scraper CHANGES the Game!!!
ฝัง
- เผยแพร่เมื่อ 24 ธ.ค. 2024
- Hello Everyone,
Here is the link with the whole code in my website :
www.automation...
My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on Github and none of them got suspended.
Also check out the 2.0 version here:
• Yeah but can it RUN LO...
www.automation...
_______ 👇 Links 👇 _______
🤝 Discord: / discord
💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: / reda-marzouk-rpa
📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: / redamarzouk.rpa
🤖 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: / @redamarzouk
Website: www.automation...
_______ 👇 Content👇 _______
Hey Everyone,
LInk to code: www.automation-campus.com/downloads/scrapemaster
My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on github and none of them got suspended.
I opened a ticket and I'm waiting for their answer.
in the meantime I shared the code on my website with all the steps to reproduce the ai scraper.
Let me know if I can help with this. I can setup a Gittea on AWS or something.
Yeah I was just looking and about to comment
I was about to give it a shot.. Waiting for the update. Great work btw
I had the opportunity to view it, but did not clone it
@@rperellor here is the code www.automation-campus.com/downloads/scrapemaster
So I worked at a company once where the data guy built his own web scrapper to scrape data off of our competitors website for pricing etc. One thing that they did to protect their website from scrapping was user-agent filtering, in order for him to overcome this limitation was to have a very long list of different user-agents and rotate them while scrapping the website. I think that will be a good addition to add into your app. A small but useful change.
Yes if we launch the scraper with the same user agent for the same websites so many times they will pick up on it and block us.
the modification will have a list of OS credentials with their versions and different browsers and their versions.
@@redamarzouk Would it be possible to have a video about proxy rotation implementation? There is not much of it on YT but i think its crucially important.
@@markomarjanovic8348 Added to the backlog
this is a good suggestion, would like this to be added as well.
Thirded
Amazing work! It works great, but it doesn't handle cases where the database is divided into pages instead of using infinite scroll. It would be fantastic if it could also navigate through the pages until there are no more left.
Another great feature-although it might make the tool more expensive, so it could be offered as an optional, selectable feature in the UI-would be for the scraper to open each item's page and scrape data from there. As you know, the initial page often only displays limited information about the product.
Thanks!
You earned a new subscriber. Algerian brother here.
Pretty cool.
Let me point out, though, that the main complexity with scraping is that often times the relevant content is hidden: that is, getting to it may require clicking various UX elements.
So to _really_ crack Scraping with AI, we'll need to go agentic: the solution will need to figure out what to click in order to reveal information of interest.
Exactly. Anyone interested in helping me make something like this? Or is there something available already?
@@SpragginsDesignsyou could just hire someone to build it for you easily
Definitely going to use this, I think this is awesome. As a suggestion for future options it would be great to have pagination support and levels deep. Has a lot of my scraping his location-based, for instance States-cities-locations. And the data I usually want is within the locations which may only be a few.
Thank you.
Yes Pagination will make this complete.
But I’m thinking how can I make it universal, cause it has to work on every website, so would I just add another llm call to detect any url pagination pattern or do you have a better idea on how to do it ?
@@redamarzouk that might actually work using a lower model would be capable of determining if the page has pagination. Or have a checkbox for user to manually say it has pagination so the LLM will be looking for it. That way it's not always looking for it. And when it finds it return what kind of class it is. IDK
@@redamarzouksimilar scenarios may be an interim pathway: if the initial url prompts for a selection of (text input) that determines next page, can you add the ability to make that selection, ideally from a list of items of prior interest? The recursive ability to select specific buttons to push according to options on following pages would then solve a large number of use cases (ie an ability to map different actions according to a preselected known option types)? The base use case is to download files from a selection post which varies by initial (or ideally subsequent) text inputs, terminated by pressing a button to download a file or selected files). The approach can then be expanded to add more scenarios, until it is universal!
Thinking about and just thought maybe have an area to manually put in div container that the user can grab from the inspect tool.
Or..
Since we're using a LLM you could always prompt for it and return the value of the container. Such as look to see if this page has pagination at the bottom or top if so return a value perhaps and use that value to fill in
Most of the "traditional" Enteprise grade scraping tech companies are adopting LLMs into their stack as an option for when it makes sense. When you're scraping millions/billions of pages every 100th of a cent matters, so taking a composite AI approach, using ML models to get the majority of the standard data points for a general schema cheaply, and then allowing LLMs to the thing they do best at extracting data from unstructured text to extend that schema, that way you get eh cost efficiency with the flexibility of LLMs when needed.
The real benefit of the LLM approach for bigger teams/projects is actually that is abstracts away from hard coding selectors into your spiders, so they are far more robust and unlikely to break in 3 months when the website changes its HTML, reducing your maintenance burden/debt. Thats my 10 cents anyway.
I personally love what your project does for the everyday person though, getting small/medium crawls done where price per request isn't so important, and where you will have time/space for more rigorous custom QA. I especially love it for content generation purposes, data journalism, chart porn and the like. Great work!
Yeah I thought I was creating a scraper at scale, but once started using it extensively I see it more as a productivity tool to help get the data quickly without the need for copy paste.
Traditional scrapers will still have a place in the market simply because once you want to scrape hundreds of thousands or millions of pages, the cost of paying coders for custom scripts and maintenance will make sense compared to the value of the data scraped.
The dependency on OpenAI and the API key is a bummer.
It would be better if we insert our own open-source AI engine and models.
If I'm not wrong, tools like Ollama use some of OpenAI's API surface to expose local LLMs. The docs read 'for chat/completions'.
if this scraper is using OpenAI's function calling interface, you might be out of luck.
You can do that with your local llm.
So you can scrape 666.66 pages for $1 based on that usage.
Hmm It seems like i ran into a limitation. I tried scrapping some golf course (lattitudes and longitudes) from google maps, but It only seems to ever give me 30 rows of data. At first i thought this might be an issue with max tokens, but i increased the max to the highest value possible: "16384" tokens, but this still only gave me around 30 rows with the same data
What model have you been using because gpt4omini can go up to 128000 tokens, and in my last video I've added gemini which can go up to more than 1M+.
I've noticed this behavior as well, when a single page has sooooo much data, not just the table with the necessary data but other data, we run into a hard limit on how many rows we can scrape (Especially with apps like @irbnb and zill0w where there is a map that have so much data we won't be scraping), I guess you found the same limitation.
so i have been experimenting with this code and I got it to work with pagination by specifying a new field for a next button and a new field for number of pages. This seems to work well, but it also got me thinking: If we have too many tokens, we can probably try to chop the data up and then run the peices through the llm. The only thing i can see, is that if we start batching the data, we could end up missing critical peices of imformation (if we substring ot the worng spot, we may end up missing rows). I will try out gemini, i have never used it
@@mrsai4740 on some websites we can get either the next page or the new the url of the pages just by specifying it in the fields using this current version of the scrapper.
But the problem is that most websites don't include all the url of the pages in the first page, usually it's under the form
(1 2 3 4 ....45 46 47 48) For example.
In this case we have to ask the LLM to conclude the url of the other pages using the pattern from the urls that it found.
Other websites where we only have the next button can only be scraped one url at a time, so the universal approach will need some time and work to be figured out.
@@redamarzouk hmmm maybe we are tackling this in the wrong way, cause it seems like for this to be a universal solution, some legwork by the user needs to be done. In cases like that scrapeme site, yeah it is allot easier to provide an array of urls or a template that describes all the urls, but this doesn't tackle the problems of single page applications. Some sites have a paginator that modifies the current page with updated information. I guess it's back to the question: "how can we programmatically detect the way a site is paginating data?"
This, and the V2 with Llama, are very interesting concepts, and I believe could be tremendously valuable.
The shortcome is that it is very limited to just the single page at the URL location.
To be truly valuable, it needs to also be a scraper (as you mention).
Think of the use case to scrape ecommerce sites for product details. any "real' ecommerce site is going to have many many categories and pages of categorized product listings.
While you can set up traditional scrapers and manually configure the navigation, this should be where AI should really shine. It should be able to figure out the navigation and automatically navigate/scrape the site.
Can you make it to use ollama on the back instead of OpenAI?
good question
Check this new video: th-cam.com/video/xrt2GViRzQo/w-d-xo.html
One of my idea is to create or use a AI scraper to get the first scrape test. If it work you do output somethine like a json that will get the id or class of the scraper element, tant you give this json to your conventional no AI scraper to scrape the website for free and faster without the need of AI afterware.
This is just writing code. Just copy paste the html into chatgpt and say write the code to parse into JSON.. works really well.
Thanks for the simple tutorial and code.
Can you add an example of using this scraper with local Ollama and Llama 3.1 instead of OpenAI to make it totally free?
You’re welcome.
I can add it but I won’t be able to test it.
My small gpu can’t really handle it especially when I’m filming.
@@redamarzouk YES, PLEASE 🙏!!!
@@redamarzouk I hope this get added. I prefer to run Ollama locally. I'm only using a GTX 1070, it works fine.
@@redamarzouk this would be a game-changing update. You earned a sub for this video regardless.
@@GundamExia88 could you please tell me what models you run? I have the GTX 1080 Ti 11GB, thanks to a friend, and I want to play with that but I don't even know it's possible 😂😅
What about adding Pagination Support? Where the A.I. will go through pagrs and pages to scrape long paginated data?
How deep / how many 'pages in' will it go?
so good idea and explanation, thank you
Can this app click on a car's page to scrap the details and go back to click on another car's page to scrap the details again?
you are genius! I am on a mac, so I just had to change the driver call, but everything else is working well. pagination or series of urls would be cool. i love how you have it load in the chrome browser. this really changes how i think about cross platform apps. i wonder if we can scrape instagram now. or what about downloading images? maybe a simple copy table button, since I just copy and paste into google docs.
New to macos can you please share your driver path? Not 100% which is the executable. Ty!
Yes please can you help me set it up on my mac as well?
I need assistance in setting up on a mac
You got yourself a new sub!
Great video as always , only downside is that it is adressing people who work with code and experienced in data scraping , but for no code or very little code like me , i think the best way is to use computer vision models , Vllm , chatgpt already have it in their api , but also we have 2 new open source models that just got ou this week , Qwen 2 VL , and microsoft phi 3.5 vision.
LAION have a model in open source, it is a very powerful scraper, you will most likely need to fine tune any vision models.
How to Add local llm llama for this projekt?
I did, watch this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=XWUzIu8uBehK4AV5
Awesome work! Any idea why scraped output list gets truncated even if input+output tokens < max?
in some cases I noticed that gpt4o mini can't extract all the data from the website.
I tried with gpt4o and it was successful.
So if you're sure your data is in the markdowns and gpt4o mini didn't pick it up, try with gpt4o.
Does a Disallow statement in the robots.txt like Disallow: User-agent: GPTBot stop it from working?
I don't think llms are ready for this scraping yet, better to get an llm to make a flask python app and make it manually scrape based on class names so you pull correct data with no hallucination, can also pull images and zip the images with zipfile
LLMs are not made the same, while I was scraping websites with 60K+ tokens I noticed that gpt4o mini gets me only a subset of the data while gpt4o latest manages to get me all the data.
If someone is willing to pay 0.5 to 1$ per extraction, they can use gpt4o with a guaranteed correct and complete output.
But 1$ an extraction is still very high if we want to scale it, in that sense it’s not ready.
But for most cases mini works great with 0.005$ per extraction and it’s absolutely ready for anything.
does it bypass re-captcha?
I want to use groq api key bcoz it's free to use or local llm like ollama..... Please modify this code if possible......Great video.....
same question, I was wondering can we do it using groq or cohere?
Wrap the groq api key by os.getenv() instead of passing in the string
@@snehasissnehasis-co1sn both has been added.
Will present them in the next video.
So its traditional scraping (selenium and beautiful soup) and AI is only used to organize the scraped data in a given format. The AI does not do the scraping. Is it correct or am I missing something?
Yes the AI does the parsing. but creating unstructured markdowns can't really be called traditional scraping, no one will scrape the whole unstructured data from the html in a traditional setup.
No pagination?
Check the new video, the scraper works with Llama3.1 and Qroq model Llama 70B for free: th-cam.com/video/xrt2GViRzQo/w-d-xo.html
Hello Reda, you should use Polars instead of Pandas, in a lot of cases is much faster than Pandas
Also add_argument("--disable-search-engine-choice-screen") is useful + ("--headless") maybe?
Oh I was looking for that argument "-disable-search-engine-choice-screen" that pop up is annoying ( even if it doesn't affect the scraping). I will be adding that, thank you!!
Great Project.
Is it possible to use OpenSource and Self Hosted model like Llama ?
Thank you.
Yes it's possible, but I didn't even try this time because gpt4o and Gemini flash are so cheap and have a huge context window and I just went with them.
But it's perfectly possible, you just need to modify the "format_data" function.
@@redamarzouk Thank you so much, I had the same question, Thanks for answering.
Can you also crawl a site such as Zillow and scrape multiple URLs?
websites like zillow tend to have sooo much data inside of them 100K+ tokens, but the answer is still yes.
I would love it even more if you created a docker container that was just downloadable and thereby installable directly on a Linux site. A user agent swap feature (like a list of user agents that could be chosen like round robin algorithm, or randomized) would be great too and handling a list of proxies that would also be swapped.
I haven't created a docket container, but I made a random user agent pick from a list. you can find the code to that in this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=smByssvvNhudzgRS
What type of websites you will use this app to scrape from?
0:36 - dude got possessed by ChatGPT and his eyes went bananas.
What about websites with cloudfare security etc.?
Very important question.
Would be nice to have a option to use ollama so we can run it locally without using openais proprietary ai
I used replit Ai agent to build + deploy a Kickass website scraper in like 10 mins!
a really cool feature would to add a text-splitter where it splits the text semantically into small chunks so we can readily use this to feed a RAG. Right now we typically splice things arbitrarily, but semantic splitting is the best.
can you give me an example of an output to split?
There are many such splitters available in llamaindex or langchain already. Another “automated” way might be to ask GPT 4o mini to split for you
Thanks dor the video! What mic are you using?
@redamarzouk Nice and easy scraper. I saw that you also have Scrapemaster 2.0 and installed that. The Env file mentions a Google API key. Which one should be added? Have a link where to get this particular Google API key?
Thank you, to use the google API Key go to aistudio.google.com/app/apikey
and from there create a new api key and add it to the .env.
You can find all the details of the scarpeMaster 2.0 from here
th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=KH5bfxyYJ9NV90FU
What if the data should be dynamic or there will be some click like search button, or their is select to choose from, and after that, scrap the data? What should we do in that case ?
yeah I have the same question
Any way to scrape Twitter specific keyword
if I show the screenshot of the pokemons to gpt it will directly scrape all the data. so basically my first feeling is the AI is enough smart to suggest the fields in a dropdown menu. so I can choose them and tell what I really want. And decide a final label for each one of them.
...just an example to start!
but as I said chatgpt can do the same just with a prompt. I don't actually need your app unless the page is full of data. in that case there may be limitations.
so you should ask your self what a prompt can't do
anyway my real problem is to have a scraper able to scrape data that are distributed around various pages. or for those cases where you must "load more" elements clicking a button.
and I want to be able to specify the download format. gpt can reformat anything to anything.
nice work but there are tons of improvements to be made. I will follow you to see where you get to.
Thanks for the great video. Idea for nest videos: Could you extend the code with crawling, for example, getting results from search engines or following a specific path to get more structured data?
You're welcome, can you elaborate more on how it should look like ?
Because this will be awesome and I actually gave it some thought, but it's hard to get the exact link of multiple pages from which you want to extract data if you don't have the link for the first page.
you think we can trust a search engine to give us the exact links we want to scrape data from?
whay about using library curl_cffi with requets to simulate a browser instead of selenium or playwright instead of selenium ? i think it would be faster.
Can it be multimodal? Viewing data in an image, also creating data tables into an image. Eg. Create a wallpaper of the most important LINUX keyboard shortcuts. etc
Problem isn’t to scrape the data, it’s if you have a public facing website most likely you’ll get sued. Nowadays data is a copyrighted material
How about something like scraping facebook marketplace or offerup?
then i tried to make it work with the google gemini api, and sadly i could not. it always returns the empty table.
I've just added gemini to an updated script I'm working on, I also added Llama 3.1.
stay tuned for the next video.
What about captcha solving, such as cloudflare, recaptcha, hcaptcha..
I would not trust this to not hallucinate. I think of a famous example where it misinterpreted the column and concatenated some numbers together instead of treating them as separate columns, leading to incorrect values.
most data in tables results in line breaks between values in markdowns.
can you share the use case where it has hallucinated for you, it will be very interesting use case?
can you make it for it to spider a website and if it finds a page that has all the required tags it puts the information in json, database, etc?
This is amazing, I have been trying to reproduce the code but I keep getting errors. Any chance you can do a dummy video . Step by step as chat gpt does ? Please 🙏🏾
I did watch this video th-cam.com/video/xrt2GViRzQo/w-d-xo.htmlsi=XWUzIu8uBehK4AV5
@@redamarzouk appreciate you so much 🙌🏾💯
Good job bro continue ❤
My pleasure!
how does it work with pagination?
I really appreciate the great work your are doing.
Quick one, what happens to sites that require credentials? How do you handle that case?
Thanks
That will need an intervention for your side, keep the website open and run the process again so it has access directly to the data.
does this scrape dynamic data?
Could it download or summarize the files (pdf…) from a website?
Can this scrape from youtube ? For seo ? Thx for your amazing work
What video recording software you use?
OBS Studio
Curious Why not puppeteer?
why do I need to use a llm for scraping the data?
Yeah for 1 of 2 websites it's doesn't make sense, but to scrape any website with 1 single app is pretty useful.
Will you still prefer the traditional option even if you have to create a script every time ?
Please can you host this tool online so that us non dev folk can easily access it. Also, would be great to have the ability for the model to be able to summarise and pull keywords out of long product descriptions etc.
Can you scrape PDF file from a website with this?
Whats better about this than google advanced search?
I don't see how they're similar.
I'm not searching for anything, i'm giving an exact url from which I want to extract structured data using an LLM.
Can this scrape eBay api?
Good job Reda, what'sup with your we automation-campus website ? is it down ? too much success ?
Thank you. but the website is up for me I've just checked on multiple devices and on isitdownorjustme, all working.
@@redamarzouk Zscaller classified your site as suspicious....
there is no need to parse the actual scraped data through the LLM
I didn't scrape the structured data, but rather unstructured markdowns. So parsing is necessary in my case to get the table I want.
Can you input multiple URLs and have the scraper collate + populate the same file?
It can't do that today, but it will be a great addition.
This a great job Hope you could share a code for auto blogging Looking around but not able to find much Where to get in touch
The use case I have for a script like this one is to scrape my own open source project code history to convert several versions of config files that contain lots of good documentation into YAML that can be deployed to a Jekyll website. So all the same principles apply, especially the need to output consistent structured data. I look forward to learning more about the development of this new way of scraping and applying it to my own situation. Cheers!
The problem is you now will have an indeterminate algorithm taking you from input to output. In other words the mechanism will be fundamentally untestable and unrepeatable. It’s basically the same as feeding data to a bunch of chimpanzees and expecting them to perform the same processing on it. In other words this is fine if you have a human to check the output each time (the interactive use case) but any kind of automatic, unattended runs? Forget it.
Is it possible to add this to Home Assistant?
can you use playwright as well ?
This was great thanks.
You're very welcome!
Takes a screenshot then extracts with AI?
it actually works pretty good.
can this be used to scrape amazon data?
Does this work on Amazon?
how can i work with this on mac?
Great video! I have a few questions though 🤔:
- Would it be easy to extend it to first log in to a site and then start scraping?
- Would it be able to click buttons and scrape data from subsequent pages?
- How is it identifying the elements on the page? Should it always be under a category or in the form of a table?
for the first 2 questions the answer is no, unless we're creating it for specific websites, otherwise we have to create a universal text-2-action module with it (which is infinitely harder to do )
For the last question, as far as the element doesn't need a ui/ux action to show, the scraper will pick up on it.
@@redamarzouk Thank you for the response.
It’s working…. But problem was some missing data… it’s given the own data…
That actually gives me an idea of adding a text box where you can optionally add some instructions about the specific website you're scraping.
sorry but where can I see the actual code? should I register any website?
or is there any link?
The project GitHub link is in the description.
@@redamarzouk Is there sth wrong with your github ? Because it is not accessible.
Do you have to manually accept cookies?
No I didn't need to do so for the websites I scraped
why do all that random stuff? what's the purpose?
Any way to use it with free model?
Yes the only function that needs to be modified is format data.
Make sure the open source model supports structured output.
Super intéressant 🎉
Merci.
lol, in college time i made a web scraper as my project and got full marks XD
(sigh) now, make it work with ollama with free llm's, so...I don't support cost f anything not low or cheap, free is king, when it comes to cost, these are things you can do paying services for cheapo and low cost.. And don't have to write anything. But.....I appreciate the value in explain, sorta what does what within the script (the dependencies). This is useful to many folks out there, I know when I was in a certain times it was valuable to me.
Great video, can you show how to implement local LLM like Ollama instead of openAI?
Thank you ,
This has been demanded so many times I guess I have to make a new video about it.
Nice work Reda, I was actually for something like this. I tried to access the repo but the link says 404 not found.
yeah github banned me for some reason, here is the link to the entire code:
www.automation-campus.com/downloads/scrapemaster
can u do bulk url?
The streamlit application is mainly for interactive scraping. but the scraper.py file can be used to launch the scraping on a list of URLs.
Great resource.
Can it scrap linkedin ?
I've tried it and it did scrape it.
Why take down the repo ?
My GitHub got suspended, here is a back up link:
www.automation-campus.com/downloads/scrapemaster
Or in Chrome, use the menu "Save web page as .... "
Repo deleted or hidden, why?
GitHub suspended my account.
I’ve shared the whole code, link in the description.
Can this be an API?
404 on your git? what's going on?
GitHub suspended my whole account (without warning). I've shared the code, follow the link in my description.
Are u Moroccan or Algerian ?😊
Moroccan, easy to find out.