SCRAPE ANY WEBSITE with (Llama3.1, Groq, Gemini) (source code included!)

Reda Marzouk

มุมมอง 15 442

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 18 ก.ย. 2024
Hello Everyone,
A lot of you asked about adding a local Llama model to the universal scraper.
In this video we'll see how to use a local free model to scrape any website from the internet.
_______ 👇 Links 👇 _______
🤝 Discord: / discord
💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: / reda-marzouk-rpa
📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: / redamarzouk.rpa
🤖 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: / @redamarzouk
Website: www.automation...
Here is the link with the whole code.
www.automation...
_______ 👇 Content👇 _______

ความคิดเห็น • 107

@bluetheredpanda 11 วันที่ผ่านมา ⁺⁸
Re: pagination, there would be a few ways to tackle this.
The simplest to implement would be to have the user specify the CSS selector of the next button. Then, the script could retrieve a page, wait a few seconds, launch a click on that selector (i.e. via a JS function), and loop.
Now, if you wanted to make it automatic, I think the simplest way to tackle this would to write a simple algorithm with a decision tree, that is triggered when the DOM of the first page is returned. The algo would go over the DOM, and look for specific signs of a pagination: next button, numbers in a tag, links with a valid href attribute (not just a #) that contain a number, etc.
Or if you really want to cover all basis, you could have both: the algo would perform an auto-detection tentative, but there would be an input as a fallback in case it fails, or in case the user wants to modify what has been detected.
But I wouldn't necessarily use a model for this, as the cost and duration is going to skyrocket compared to running a simple JS or Python algo as traditional scraper do.
@mrsai4740 11 วันที่ผ่านมา
I personally havent been able to get the ollama model working, i only got the gpt-4o-mini working, but that said, is cost a factor now that we can scrape with llama?
@bluetheredpanda 11 วันที่ผ่านมา
@@mrsai4740 well, i was thinking of cases where, for one reason or another, a local model couldn’t be used
But aside from that from my perspective using a simple algorithm seems more viable as we can roughly know what to expect from the DOM in that regard. There aren’t 10,000 possible implementations for a pagination that are valid HTML and with parsing and regex it should be fairly easy to develop
@mrsai4740 11 วันที่ผ่านมา ⁺¹
@@bluetheredpanda I agree, there should be a way to detect pagination in a page or an item detail view of a page. However, i feel like capturing every single way one may paginate is going to be a challenge. Looking at the dom itself on some sites, Ive seen people use tags to put the pagination in, ive also soon people use to put a pagination in, and these two setups had no specific CSS that made it obvious this was a paginator. What can be done maybe, is have selenium truly crawl a site by clicking every element it can, and try to parse and combine the results. It sounds doable but the performance will probably be ass
@bluetheredpanda 11 วันที่ผ่านมา ⁺¹
@@mrsai4740 table wouldn’t be valid html, it’s definitely going to be an outlier (that’s actually why I mentioned the input field so users can enter their own targeting).
I wouldn’t click every element at random (even though he did mention turning this into a crawler, so you never know), but beautiful soup already returns the entire DOM. We can scan it for links, and filter those based on the value of the href property, which would achieve the same thing while being a billion times faster.
Re: CSS - modern CSS declarations would 100% allow you to target even if no specific class is used, ie. a[href$="/page/2/"] which means “a link which points to a page whose URL ends in /page/2/”. Combine that with regex, and we get a[href$="/page/\d+/"], which works for any number, not just page 2. There’s definitely something in there.
@idrinkmusic 9 วันที่ผ่านมา ⁺⁴
He listened and he provided
@EmilioGagliardi 10 วันที่ผ่านมา ⁺²
check if the website has a site map. the links in there are usually the content related ones. plus for SEO purposes, most business-related websites will use meaningful keywords in the URLs you can use to regex to filter/sort/prioritize. the issue you're trying to articulate is how to you allow the user to specify which content is scraped+paged. In your example, if you're on a shop site, you obviously want shop related links, you don't really care about the privacy policy or the returns policy. so in the same way you provide tags for data extract you could also provide a limited set of content type tags the user could select to guide which links are followed. use the ai to make a best guess about the nature of the site and then provide some helpful tags. Ai detects online store and blog. do you want to scrape both, the shop only, images only, blog only. if the user selects shop data only, then you can get pretty far in finding the links to follow
@andretaylor2646 10 วันที่ผ่านมา ⁺³
Thank you Thank you Thank you. They can not stop you.
@lowbudgetgamer2439 9 วันที่ผ่านมา ⁺²
Thanks Buddy. I feel sorry for your account suspension. I hope they will remove suspension soon.
@wasserbesser 12 วันที่ผ่านมา ⁺⁵
great work!
I think it should also be an advanced spider, wich checks the full sitestructure and then use the most needed.
@redamarzouk 11 วันที่ผ่านมา
Thank you!
Yeah some prompting to detect the structure of the page will get us the crawler for pagination we want.
@DESX312 11 วันที่ผ่านมา
I'm literally doing this right now on my own scraper. So far, promising early results.
I'm scraping part of the html structure. Sending to Claude to parse, and then guiding the scraper based on the initial structure it gets. It's also integrating brightdata scraping browser functionality.
@yazanrisheh5127 9 วันที่ผ่านมา ⁺¹
I cant wait for the pagination part!
@schongut9030 5 วันที่ผ่านมา ⁺¹
A lot modern websites don't use pagination but load as you scroll. You have to be able to handle that.
@redamarzouk 4 วันที่ผ่านมา
That also will be challenging yeah!
Do you have an example of a website in mind?
@ld-yt. 11 วันที่ผ่านมา ⁺¹
Here is how I would implement automatic processing of Paginated and/or Nested data :
The first time the user runs the script, they should get the option of web scraping with pagination and/or nested data. We can prompt the LLM accordingly depending on what they chose and have it store the nested page URLs per listing in the appropriate object as well as extract the selectors for next/previous page elements (or at least one pagination link from the DOM, but that is probably tricky to implement).
The user could have the option at the start to decide whether to process all pages and if not, decide how many pages he wants starting from the current page that was linked. The same goes for Nested Pages, the user can choose to attempt extraction of additional data found in "detail page" for each listing.
The important part is that for either, it must process the pages one step at a time while saving the progress. If anything goes wrong or some pages are missing, there should be clear UI letting the user know that, so he tries to web scrape those pages again. We could also offer the user an input to give us the pagination selector if all else fails.
@IdPreferNot1 12 วันที่ผ่านมา ⁺¹
Thx. Definitely following this project!
@adsgsd8205 10 วันที่ผ่านมา ⁺¹
What about save data in a database and run it for a while? That would be amazing ...that would be the real useful tool for so many people ..
@unisol111 10 วันที่ผ่านมา ⁺²
Tutorial how to dockerize and launch on own linux server to access from any device and from everywhere would be great. Thanks for the app and code!
@redamarzouk 4 วันที่ผ่านมา ⁺¹
You're most welcome, and the docker part will be coming stay tuned!
@Cairthebest 3 วันที่ผ่านมา
An idea for the pagination : scraping the source code of the URL, sending it to the LLM to recognized the structure and applying the adapted scraping code selected in a library of differents approaches ?
@mrsai4740 11 วันที่ผ่านมา ⁺¹
Idea: What if we don't run headless but instead scrape as you are navigating. At least this way we can try to capture multiple pages and item/detail style websites.
@redamarzouk 11 วันที่ผ่านมา
True, I've been avoiding using headless for a while, but apparently the new --headless=new attribute runs as if you're opening the page (I'm not so convinced).
Anyways if you don't like the headless option go to assets.py and you can simply remove it for the list of options, you don't need to touch the code itself.
@MrRossss1 11 วันที่ผ่านมา ⁺¹
Thanks great project. Regarding the pagination - you could have the user specify a placeholder in the url to identify the eg page= parameter for say the second page eg ?product=12345&page= ie to identify the pagination parameter ie page= in this case and then also specify a start at and end at page number separately and then have it open the different urls with the number inserted for that parameter.
@MrRossss1 11 วันที่ผ่านมา
or just have the user specify the page= parameter and the start and end pages
@MrRossss1 11 วันที่ผ่านมา
or have the user just provide the url for say the second page with the page= parameter in and ask the model to determine it. not sure how reliable it would be though in every case if the pagination param wasnt obvious.
@redamarzouk 10 วันที่ผ่านมา ⁺¹
do you think this will not burden the user with more steps, shouldn't we give a suggestion first and let them validate or modify it?
@MrRossss1 9 วันที่ผ่านมา
@@redamarzouk Yes probably better to give a suggestion from the url that they can change.
@rayhon1014 6 วันที่ผ่านมา
perhaps u allow users input url pattern with [1-X] at the end so the your code can turn it into page urls and run your code per each.
@guerra_dos_bichos 11 วันที่ผ่านมา
I love this content, but I'd like to see your takes on the good old regular scraping, like scrapy with proxies
@redamarzouk 10 วันที่ผ่านมา
The market will take time to embrace these new methods of scraping with AI, even with very cheap or free options like groq and gemini, so scraping using regular ways still the dominant way to do scraping in general and I see them adding this type of AI Integration alongside what they already have, but it will take time.
@solporcima 8 วันที่ผ่านมา ⁺²
Hi Marzouk, first thanks so much.
can you show us which files i need to change to use this on linux?
are you planing to docker this app in a future?
@redamarzouk 4 วันที่ผ่านมา
You're welcome.
About the docker, I got this request so many times now, I will have to create one. Stay tuned for that.
To make it work on Linux, you'll need to change the path of the chromiumdriver because it's different for linux, and of course the commands to create the virtual env are different, but everything else should be the same.
@py_coder_fpv 12 วันที่ผ่านมา ⁺¹²
why did they suspend the github account?
@xXWillyxWonkaXx 12 วันที่ผ่านมา ⁺⁵
It's working now. Check the link in the description. I was going to replicate the entire thing using cursor ai lol thank god he updated it. Thanks Reda
@guerra_dos_bichos 12 วันที่ผ่านมา ⁺²
Probably tripped some automatic security measure
@MaxwellHay 12 วันที่ผ่านมา
Still unavailable
@DmitriZaitsev 11 วันที่ผ่านมา ⁺²
Also curious, never heard of someone having the github account suspended.
@redamarzouk 11 วันที่ผ่านมา ⁺¹
Only god knows, I didn't receive any explanation, and my demand to reinstate my account can take days, weeks or months (according to their forum).
I'm not alone, this is happening to a lot of people.
The issue is that similar AI Scrapers projects are up.
@remusomega 6 วันที่ผ่านมา
Have you considered making this more scalable by using an LLM to discover the exact settings to use for beautiful soup for each website?
@redamarzouk 6 วันที่ผ่านมา
actually that is a good idea because it can make the whole process faster.
So the idea is to give the page structure to the LLM first and let it decide where the data will be located, after that I should create the markdowns only from inside that tag reducing the number of tokens and making the call faster.
I thought about this approach before but was too lazy to recreate the app around it.
@abd-elrahmanmahmoud3167 7 วันที่ผ่านมา
the pagination thing you can do by checking the URL, I usually do this
@redamarzouk 4 วันที่ผ่านมา
How do you do it in case you have numbers of pages with no next button versus times where there is only a next button?
@thisisfabiop 12 วันที่ผ่านมา ⁺²
Pagination is so important, but also going one level deeper to scrape data from each item would be a great add!
In addition to paginations with numbers, a complementary approach (perhaps an additional toggle?) involved pages with Next button. The Forward button is an anchor element. It navigates to the next page and grabs the link until it reaches the last page, with the forward button’s href value #.
Perhaps a similar approach can be done to scrape data from each item's page - listing all item links present in every page, and then scrape?
@mrsai4740 11 วันที่ผ่านมา
Yes, for websites that have a list/detail view, it would be nice if we can make selenium go into each and scrape data. I think the problem with this is the amount of tokens that will end up producing
@redamarzouk 10 วันที่ผ่านมา
so would you suggest having another way of launching the application only to crawl the URLs of the pages and put them somewhere to then launch the scraping on those urls ?
@iltodes7319 11 วันที่ผ่านมา
Realy hard to handle all paginations automatically for any website and caring in the same time about token cost for a mass of scraping pages. But is a great challenge 😅
@MrMiguelChaves 11 วันที่ผ่านมา ⁺¹
We should create a scraper to scrape your site to get the code of the scraper.
@redamarzouk 10 วันที่ผ่านมา ⁺¹
Good idea 😄😄
@dirkpostma77 10 วันที่ผ่านมา
If a human can paginate, AI should be able to paginate. True Universal would be a mechanism with image recognition. Feed AI a screenshot of website and “which element does pagination?”
This doesn’t yet cover infinite scroll. You could try to scroll the page and detect if more data is loaded. If so, there you have your pagination.
@Jaridb 3 วันที่ผ่านมา
what if I wanted to use this scraper to reference a csv file to do skip tracing?
@jbssfl 9 วันที่ผ่านมา
Can this do dynamic JavaScript related scraping for something that’s not tied to an actual page/route?
@punktkommastrich007 11 วันที่ผ่านมา
Is it able to scrape prices from variable products? Need this for promotional gifts (printed with your Logo)… so Like 100pcs, printed with 1 Color, 2 colors, 3 colors… Same with other qtys. ?
@onlyms4693 11 วันที่ผ่านมา
Is it possible to use vision model to scrap a website that block or flag a scrapper by setting up some virtual Environment where the llm can control and open website with it able to scrolling checking even press button to be able fully scrapping the web.. So using normal llm with agent added with vision.
That just my mind i dont know if it too complex and dont make sense in term efficientcy specialy with the comput power it needed.
@kamleshbhandari8680 11 วันที่ผ่านมา ⁺¹
this only fetch the first page. what can we do to scrape from all pages
@familiea.2515 10 วันที่ผ่านมา
+1
@VicKipyego 8 วันที่ผ่านมา
NoSuchDriverException: Message: Unable to obtain driver for chrome; For documentation on this error, please visit:
how can I go about this error?
@LibertyRecordsFree 10 วันที่ผ่านมา
I was dreaming of this... Do you have included a way to anynimize header / VPN set up to avoid beeing up banned?
@redamarzouk 10 วันที่ผ่านมา
I've have a random usergent chosen every time the app is launching a new process, it's in the assets file you can add more there easily without touching the code in the other files!
@JuankM1050 10 วันที่ผ่านมา
can you add the option, to send the sorted file back to the llm to add more headers to the table or in general to modify the table, without having to do the scrap again?
@redamarzouk 4 วันที่ผ่านมา
I always save every markdown for every scraping inside a folder called output in the project.
you only need to tweak the code a bit to run on the same markdowns again with the fields that you need.
But yeah adding this feature will be nice, can you tell me when have you needed this?
@s6yx 10 วันที่ผ่านมา
would rather use playwright instead of selenium
@nuno2032 11 วันที่ผ่านมา
Very nice job but you are not facing the real problem. Does it work with JS heavy site? For the basic site with no JS you can make a web scraping tool in minutes with Claude. Plus pagination in many specific cases is really complicated. In my opinion ion you should use playground to let the user the ability to create a pattern specific for that web site (clicking around to memorize what to do to get the info) and than proceed with the auto scraping part
@nguyenduyta7136 10 วันที่ผ่านมา
I think we just code and scrap but we need extension chrome . So not useful
@na1du 11 วันที่ผ่านมา
Hello, is there a docker container for unraid?
@EmeraldTablets-info 11 วันที่ผ่านมา
Can it collect emails? On dune and Brad street
@satyaviswapavanranga5915 7 วันที่ผ่านมา
what if it has a login page?
@shankar9063 4 วันที่ผ่านมา
How much fast is the local llma3. 1?
@redamarzouk 4 วันที่ผ่านมา
It will really depend on each machine, better GPU+Smaller Models = faster inference and vice versa.
In my case I had a generation of 15 tokens per second which feels a bit on the slower side but it's not very bad.
@mrinalraj4801 11 วันที่ผ่านมา
Why did your github account get suspended? Which policy did you violate?
@redamarzouk 11 วันที่ผ่านมา
I have no idea, I haven't received any emails from them(checked my spam folder and everywhere).
I thought it was about the "useragent" that I used, but I searched for that exact line of code and I found it in 20k+ repositories, so truly I don't know.
@natassawakBros 3 วันที่ผ่านมา
Thanks Reda ! That's really interesting. I tried to scrape data on websites like amazon by scraping only one single page of a product with Llama3.1. However, I faced the token limit issue although I have a powerful MacBook M3 with 38 GB. The same page works well with Gemini1.5 and gpt. Do you have an explanation please ?
@redamarzouk 3 วันที่ผ่านมา
you're most welcome.
Yeah with Llama3.1 8b in lm studio you'll have to increase your context length to more than 100K tokens (you can do this in advanced configurations).
the smaller models as good as they are for their size, they really can't keep track of all the completion when having to extract from pages with markdowns more than 60K tokens which amazon is one of them.
@kritikusi-666 11 วันที่ผ่านมา
dafuq did github suspend your account for? Did you mark it as "Educational" or
@miguelgargallo 7 วันที่ผ่านมา
Will I get banned if i upload privately on my gh?
@redamarzouk 7 วันที่ผ่านมา
I can't guarantee anything, so it's really up to you, I got banned, hopefully no one else will.
@MrN00N3_ 12 วันที่ผ่านมา
Does it work on Google Maps for scraping leads?
@mrsai4740 11 วันที่ผ่านมา ⁺¹
i tried this with google maps (tried searching golf courses) and it seemed to work, but It only works on a small set of data since google only loads like 6 results unless you scroll. I got around this by just copying the element with all the Items i was interested in after i scrolled through the end of the list, then i shoved that into an html file and used a local file as the URL, The results seemed promising at first but it looks like the model gives up and doesnt capture everything
@redamarzouk 11 วันที่ผ่านมา ⁺¹
you're right.
When giving really long html files to smaller models (and even worse local small models 8B or 7B), they seems to give up on getting everything.
so going with gpt4-o or with the gemini 1.5 pro (they give 50 free request per day) will be your best option in this case of 100K+ tokens
@edma6613 11 วันที่ผ่านมา
Can you scrape pdf files from a website ?
@redamarzouk 11 วันที่ผ่านมา
No the scraper captures data that exists in the html of the website.
@hatemjaber 12 วันที่ผ่านมา
Are you using firecrawl?
@redamarzouk 11 วันที่ผ่านมา
Not anymore, I've used it in the first video of this series.
@hatemjaber 11 วันที่ผ่านมา
@@redamarzouk are you using puppeteer, playwright, or selenium? I'm curious about bot detection...
@eduardmart1237 10 วันที่ผ่านมา
Is it free?
@redamarzouk 10 วันที่ผ่านมา ⁺¹
when you use Llama 3.1 8b and Groq and Gemini (up to 1500 requests per day), yes it's free.
@lalamax3d 11 วันที่ผ่านมา ⁺¹
hey, just wanted to say thanks, please in next version
1- i want my llama3.1 8b to work via ollama (installation) anything special, i need to do.
2- pagination (setting is critical)
3- fields we define (i.e name / price) how it (llm) maps name with class type title (slightly confusing)...
@redamarzouk 11 วันที่ผ่านมา
Noted for the first 2 points.
For the third, it creates everything as a string, but the llm is smart enough to understand how to format other types like numbers for example. So no loss in format.
@VicKipyego 8 วันที่ผ่านมา
err
@acharafranklyn5167 11 วันที่ผ่านมา
Not that i am a bad guy but this project has been implemented by someone else ...all you could have done is give credits
@redamarzouk 11 วันที่ผ่านมา ⁺⁸
What?
This code is 100% mine and first time I created this project is 4 months ago, go back to my channel and watch a whole video of 20 minutes about it.
You could say that the idea has been around for sometime and I've seen big repos trying to do website to structured data like (firecrawl, jina AI, scrapegraph AI, etc..), and from there I tried to create a simple fully open source version because people in the comments asked me to.
So yeah credits to those libraries, but saying I've taking this from someone else is delusion.
@centurion7722 9 วันที่ผ่านมา
I think that without significant user input, it's unlikely to work. First, the script needs to capture the JS elements code from the site. Then, the user must provide what they are looking for, such as specific pages or other elements. finally, the script + LLM should automatically extract the relevant CSS selector or XPath

ต่อไป

เล่นอัตโนมัติ