Yeah but can it RUN LOCALLY?

Reda Marzouk

มุมมอง 21 621

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ธ.ค. 2024

ความคิดเห็น • 128

@bluetheredpanda 2 หลายเดือนก่อน ⁺⁸
Re: pagination, there would be a few ways to tackle this.
The simplest to implement would be to have the user specify the CSS selector of the next button. Then, the script could retrieve a page, wait a few seconds, launch a click on that selector (i.e. via a JS function), and loop.
Now, if you wanted to make it automatic, I think the simplest way to tackle this would to write a simple algorithm with a decision tree, that is triggered when the DOM of the first page is returned. The algo would go over the DOM, and look for specific signs of a pagination: next button, numbers in a tag, links with a valid href attribute (not just a #) that contain a number, etc.
Or if you really want to cover all basis, you could have both: the algo would perform an auto-detection tentative, but there would be an input as a fallback in case it fails, or in case the user wants to modify what has been detected.
But I wouldn't necessarily use a model for this, as the cost and duration is going to skyrocket compared to running a simple JS or Python algo as traditional scraper do.
@mrsai4740 2 หลายเดือนก่อน
I personally havent been able to get the ollama model working, i only got the gpt-4o-mini working, but that said, is cost a factor now that we can scrape with llama?
@bluetheredpanda 2 หลายเดือนก่อน
@@mrsai4740 well, i was thinking of cases where, for one reason or another, a local model couldn’t be used
But aside from that from my perspective using a simple algorithm seems more viable as we can roughly know what to expect from the DOM in that regard. There aren’t 10,000 possible implementations for a pagination that are valid HTML and with parsing and regex it should be fairly easy to develop
@mrsai4740 2 หลายเดือนก่อน ⁺¹
@@bluetheredpanda I agree, there should be a way to detect pagination in a page or an item detail view of a page. However, i feel like capturing every single way one may paginate is going to be a challenge. Looking at the dom itself on some sites, Ive seen people use tags to put the pagination in, ive also soon people use to put a pagination in, and these two setups had no specific CSS that made it obvious this was a paginator. What can be done maybe, is have selenium truly crawl a site by clicking every element it can, and try to parse and combine the results. It sounds doable but the performance will probably be ass
@bluetheredpanda 2 หลายเดือนก่อน ⁺¹
@@mrsai4740 table wouldn’t be valid html, it’s definitely going to be an outlier (that’s actually why I mentioned the input field so users can enter their own targeting).
I wouldn’t click every element at random (even though he did mention turning this into a crawler, so you never know), but beautiful soup already returns the entire DOM. We can scan it for links, and filter those based on the value of the href property, which would achieve the same thing while being a billion times faster.
Re: CSS - modern CSS declarations would 100% allow you to target even if no specific class is used, ie. a[href$="/page/2/"] which means “a link which points to a page whose URL ends in /page/2/”. Combine that with regex, and we get a[href$="/page/\d+/"], which works for any number, not just page 2. There’s definitely something in there.
@yazanrisheh5127 2 หลายเดือนก่อน ⁺³
I cant wait for the pagination part!
@thisisfabiop 2 หลายเดือนก่อน ⁺¹
Really looking forward to a follow-up video to this!
@redamarzouk 2 หลายเดือนก่อน ⁺¹
I'm working hard on that pagination, whenever I feel like I have a universal approach and test it on couple of websites I still find it needs approvement. But the next video will be out soon.
@IanHobday 2 หลายเดือนก่อน ⁺¹
From a performance standpoint, I think it's better to use the LLM to analyze the source page layout and have it write a scrapy (or similar) scraper, and then to use that to scrape the data. Using the LLM to process all the data is fine for one or two pages, but if you need to do a big scrape of 1000s of pages, the performance is going to be very poor compared to writing a dedicated scraper with the LLM and using that.
@EmilioGagliardi 2 หลายเดือนก่อน ⁺²
check if the website has a site map. the links in there are usually the content related ones. plus for SEO purposes, most business-related websites will use meaningful keywords in the URLs you can use to regex to filter/sort/prioritize. the issue you're trying to articulate is how to you allow the user to specify which content is scraped+paged. In your example, if you're on a shop site, you obviously want shop related links, you don't really care about the privacy policy or the returns policy. so in the same way you provide tags for data extract you could also provide a limited set of content type tags the user could select to guide which links are followed. use the ai to make a best guess about the nature of the site and then provide some helpful tags. Ai detects online store and blog. do you want to scrape both, the shop only, images only, blog only. if the user selects shop data only, then you can get pretty far in finding the links to follow
@lowbudgetgamer2439 2 หลายเดือนก่อน ⁺⁴
Thanks Buddy. I feel sorry for your account suspension. I hope they will remove suspension soon.
@andretaylor2646 2 หลายเดือนก่อน ⁺³
Thank you Thank you Thank you. They can not stop you.
@IdPreferNot1 2 หลายเดือนก่อน ⁺¹
Thx. Definitely following this project!
@wasserbesser 2 หลายเดือนก่อน ⁺⁵
great work!
I think it should also be an advanced spider, wich checks the full sitestructure and then use the most needed.
@redamarzouk 2 หลายเดือนก่อน
Thank you!
Yeah some prompting to detect the structure of the page will get us the crawler for pagination we want.
@DESX312 2 หลายเดือนก่อน ⁺¹
I'm literally doing this right now on my own scraper. So far, promising early results.
I'm scraping part of the html structure. Sending to Claude to parse, and then guiding the scraper based on the initial structure it gets. It's also integrating brightdata scraping browser functionality.
@idrinkmusic 2 หลายเดือนก่อน ⁺⁴
He listened and he provided
@adsgsd8205 2 หลายเดือนก่อน ⁺¹
What about save data in a database and run it for a while? That would be amazing ...that would be the real useful tool for so many people ..
@MrRossss1 2 หลายเดือนก่อน ⁺¹
Thanks great project. Regarding the pagination - you could have the user specify a placeholder in the url to identify the eg page= parameter for say the second page eg ?product=12345&page= ie to identify the pagination parameter ie page= in this case and then also specify a start at and end at page number separately and then have it open the different urls with the number inserted for that parameter.
@MrRossss1 2 หลายเดือนก่อน
or just have the user specify the page= parameter and the start and end pages
@MrRossss1 2 หลายเดือนก่อน
or have the user just provide the url for say the second page with the page= parameter in and ask the model to determine it. not sure how reliable it would be though in every case if the pagination param wasnt obvious.
@redamarzouk 2 หลายเดือนก่อน ⁺¹
do you think this will not burden the user with more steps, shouldn't we give a suggestion first and let them validate or modify it?
@MrRossss1 2 หลายเดือนก่อน
@@redamarzouk Yes probably better to give a suggestion from the url that they can change.
@unisol111 2 หลายเดือนก่อน ⁺²
Tutorial how to dockerize and launch on own linux server to access from any device and from everywhere would be great. Thanks for the app and code!
@redamarzouk 2 หลายเดือนก่อน ⁺¹
You're most welcome, and the docker part will be coming stay tuned!
@Cairthebest 2 หลายเดือนก่อน
An idea for the pagination : scraping the source code of the URL, sending it to the LLM to recognized the structure and applying the adapted scraping code selected in a library of differents approaches ?
@solporcima 2 หลายเดือนก่อน ⁺²
Hi Marzouk, first thanks so much.
can you show us which files i need to change to use this on linux?
are you planing to docker this app in a future?
@redamarzouk 2 หลายเดือนก่อน
You're welcome.
About the docker, I got this request so many times now, I will have to create one. Stay tuned for that.
To make it work on Linux, you'll need to change the path of the chromiumdriver because it's different for linux, and of course the commands to create the virtual env are different, but everything else should be the same.
@ld-yt. 2 หลายเดือนก่อน ⁺¹
Here is how I would implement automatic processing of Paginated and/or Nested data :
The first time the user runs the script, they should get the option of web scraping with pagination and/or nested data. We can prompt the LLM accordingly depending on what they chose and have it store the nested page URLs per listing in the appropriate object as well as extract the selectors for next/previous page elements (or at least one pagination link from the DOM, but that is probably tricky to implement).
The user could have the option at the start to decide whether to process all pages and if not, decide how many pages he wants starting from the current page that was linked. The same goes for Nested Pages, the user can choose to attempt extraction of additional data found in "detail page" for each listing.
The important part is that for either, it must process the pages one step at a time while saving the progress. If anything goes wrong or some pages are missing, there should be clear UI letting the user know that, so he tries to web scrape those pages again. We could also offer the user an input to give us the pagination selector if all else fails.
@antagonist6966 29 วันที่ผ่านมา
Good idea
@py_coder_fpv 2 หลายเดือนก่อน ⁺¹⁴
why did they suspend the github account?
@xXWillyxWonkaXx 2 หลายเดือนก่อน ⁺⁵
It's working now. Check the link in the description. I was going to replicate the entire thing using cursor ai lol thank god he updated it. Thanks Reda
@guerra_dos_bichos 2 หลายเดือนก่อน ⁺²
Probably tripped some automatic security measure
@MaxwellHay 2 หลายเดือนก่อน
Still unavailable
@DmitriZaitsev 2 หลายเดือนก่อน ⁺²
Also curious, never heard of someone having the github account suspended.
@redamarzouk 2 หลายเดือนก่อน ⁺¹
Only god knows, I didn't receive any explanation, and my demand to reinstate my account can take days, weeks or months (according to their forum).
I'm not alone, this is happening to a lot of people.
The issue is that similar AI Scrapers projects are up.
@andyshaw-v2p หลายเดือนก่อน
they ban you because unlike others I will not mention you are giving value without a financial lure of a subscription fee, this is bad for the business models of many others so they will always try shut you down. Keep being a rebel and giving REAL value, this is the only way someone with nothing can ever have a chance and trust me I have used this to get a new income and I was rock bottom so thank you!!! keep giving IT WILL GIVE BACK :D
@Alimehdimalpara หลายเดือนก่อน
Can you guide exactly which google api key to get? As there are lot of options out there. Getting confused.
@schongut9030 2 หลายเดือนก่อน ⁺¹
A lot modern websites don't use pagination but load as you scroll. You have to be able to handle that.
@redamarzouk 2 หลายเดือนก่อน
That also will be challenging yeah!
Do you have an example of a website in mind?
@rayhon1014 2 หลายเดือนก่อน
perhaps u allow users input url pattern with [1-X] at the end so the your code can turn it into page urls and run your code per each.
@remusomega 2 หลายเดือนก่อน
Have you considered making this more scalable by using an LLM to discover the exact settings to use for beautiful soup for each website?
@redamarzouk 2 หลายเดือนก่อน ⁺¹
actually that is a good idea because it can make the whole process faster.
So the idea is to give the page structure to the LLM first and let it decide where the data will be located, after that I should create the markdowns only from inside that tag reducing the number of tokens and making the call faster.
I thought about this approach before but was too lazy to recreate the app around it.
@SAVONASOTTERRANEASEGRETA 2 หลายเดือนก่อน
very beautiful and interesting. but for me being a beginner, I had to tell how to configure the LmStudio server in python✅✅😊😊
@abdopower5913 2 หลายเดือนก่อน
Can this tool scrape google maps 🤔
@mrsai4740 2 หลายเดือนก่อน ⁺¹
Idea: What if we don't run headless but instead scrape as you are navigating. At least this way we can try to capture multiple pages and item/detail style websites.
@redamarzouk 2 หลายเดือนก่อน
True, I've been avoiding using headless for a while, but apparently the new --headless=new attribute runs as if you're opening the page (I'm not so convinced).
Anyways if you don't like the headless option go to assets.py and you can simply remove it for the list of options, you don't need to touch the code itself.
@guerra_dos_bichos 2 หลายเดือนก่อน
I love this content, but I'd like to see your takes on the good old regular scraping, like scrapy with proxies
@redamarzouk 2 หลายเดือนก่อน
The market will take time to embrace these new methods of scraping with AI, even with very cheap or free options like groq and gemini, so scraping using regular ways still the dominant way to do scraping in general and I see them adding this type of AI Integration alongside what they already have, but it will take time.
@punktkommastrich007 2 หลายเดือนก่อน
Is it able to scrape prices from variable products? Need this for promotional gifts (printed with your Logo)… so Like 100pcs, printed with 1 Color, 2 colors, 3 colors… Same with other qtys. ?
@dirkpostma77 2 หลายเดือนก่อน
If a human can paginate, AI should be able to paginate. True Universal would be a mechanism with image recognition. Feed AI a screenshot of website and “which element does pagination?”
This doesn’t yet cover infinite scroll. You could try to scroll the page and detect if more data is loaded. If so, there you have your pagination.
@Status-romance 2 หลายเดือนก่อน
Hii i want to scrap the devices like laptops,mobiles,tablets which supports type c charging and upto 45w chatging ,i want to scrap it from amazon and also have to automate this like for everyday it will scrap again and again it updtas the database and if that devices not present then create that devices in database ,i am a mern stack developer so how can i do that
@jbssfl 2 หลายเดือนก่อน
Can this do dynamic JavaScript related scraping for something that’s not tied to an actual page/route?
@Jaridb 2 หลายเดือนก่อน
what if I wanted to use this scraper to reference a csv file to do skip tracing?
@iltodes7319 2 หลายเดือนก่อน
Realy hard to handle all paginations automatically for any website and caring in the same time about token cost for a mass of scraping pages. But is a great challenge 😅
@onlyms4693 2 หลายเดือนก่อน
Is it possible to use vision model to scrap a website that block or flag a scrapper by setting up some virtual Environment where the llm can control and open website with it able to scrolling checking even press button to be able fully scrapping the web.. So using normal llm with agent added with vision.
That just my mind i dont know if it too complex and dont make sense in term efficientcy specialy with the comput power it needed.
@adeolaojo หลายเดือนก่อน
Why was your github account suspended?
@tirsoan 2 หลายเดือนก่อน
How can I make it retrun more than 10 products?
@na1du 2 หลายเดือนก่อน
Hello, is there a docker container for unraid?
@dwaynemcpherson5379 2 หลายเดือนก่อน
Hello i cant get a display after it scraped.
@Chiren 2 หลายเดือนก่อน
i'd love a docker container for this.
@abd-elrahmanmahmoud3167 2 หลายเดือนก่อน
the pagination thing you can do by checking the URL, I usually do this
@redamarzouk 2 หลายเดือนก่อน
How do you do it in case you have numbers of pages with no next button versus times where there is only a next button?
@JuankM1050 2 หลายเดือนก่อน
can you add the option, to send the sorted file back to the llm to add more headers to the table or in general to modify the table, without having to do the scrap again?
@redamarzouk 2 หลายเดือนก่อน
I always save every markdown for every scraping inside a folder called output in the project.
you only need to tweak the code a bit to run on the same markdowns again with the fields that you need.
But yeah adding this feature will be nice, can you tell me when have you needed this?
@EmeraldTablets-info 2 หลายเดือนก่อน
Can it collect emails? On dune and Brad street
@kamleshbhandari8680 2 หลายเดือนก่อน ⁺¹
this only fetch the first page. what can we do to scrape from all pages
@familiea.2515 2 หลายเดือนก่อน
+1
@LibertyRecordsFree 2 หลายเดือนก่อน
I was dreaming of this... Do you have included a way to anynimize header / VPN set up to avoid beeing up banned?
@redamarzouk 2 หลายเดือนก่อน
I've have a random usergent chosen every time the app is launching a new process, it's in the assets file you can add more there easily without touching the code in the other files!
@satyaviswapavanranga5915 2 หลายเดือนก่อน
what if it has a login page?
@lalamax3d 2 หลายเดือนก่อน ⁺¹
hey, just wanted to say thanks, please in next version
1- i want my llama3.1 8b to work via ollama (installation) anything special, i need to do.
2- pagination (setting is critical)
3- fields we define (i.e name / price) how it (llm) maps name with class type title (slightly confusing)...
@redamarzouk 2 หลายเดือนก่อน
Noted for the first 2 points.
For the third, it creates everything as a string, but the llm is smart enough to understand how to format other types like numbers for example. So no loss in format.
@thisisfabiop 2 หลายเดือนก่อน ⁺²
Pagination is so important, but also going one level deeper to scrape data from each item would be a great add!
In addition to paginations with numbers, a complementary approach (perhaps an additional toggle?) involved pages with Next button. The Forward button is an anchor element. It navigates to the next page and grabs the link until it reaches the last page, with the forward button’s href value #.
Perhaps a similar approach can be done to scrape data from each item's page - listing all item links present in every page, and then scrape?
@mrsai4740 2 หลายเดือนก่อน
Yes, for websites that have a list/detail view, it would be nice if we can make selenium go into each and scrape data. I think the problem with this is the amount of tokens that will end up producing
@redamarzouk 2 หลายเดือนก่อน ⁺¹
so would you suggest having another way of launching the application only to crawl the URLs of the pages and put them somewhere to then launch the scraping on those urls ?
@thisisfabiop 2 หลายเดือนก่อน
@@mrsai4740 Right, but some applications may make it worth it - let's say analyze competitors. With mini costs. are quite competitive and if you use local computation it's free!
@thisisfabiop 2 หลายเดือนก่อน
@@redamarzouk that would be amazing
@user-wo3ym4bj8e 2 หลายเดือนก่อน
I created google api key but on running it says the API KEY is invalid ... even if it's freee, one must insert card data in order to use it ? I also noticed it extracts the whole raw data. If I give a tag, wouldnt it be better to search only that in orider to optimise the token / minute? On griq for example i have this problem: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.1-70b-versatile` in organization `org_0.......` on tokens per minute (TPM): Limit 20000, Used 0, Requested 46892.
@nuno2032 2 หลายเดือนก่อน
Very nice job but you are not facing the real problem. Does it work with JS heavy site? For the basic site with no JS you can make a web scraping tool in minutes with Claude. Plus pagination in many specific cases is really complicated. In my opinion ion you should use playground to let the user the ability to create a pattern specific for that web site (clicking around to memorize what to do to get the info) and than proceed with the auto scraping part
@UberLinny 2 หลายเดือนก่อน
Is anyone able to answer a support question in discord by chance ? its pretty quiet there
@VicKipyego 2 หลายเดือนก่อน
NoSuchDriverException: Message: Unable to obtain driver for chrome; For documentation on this error, please visit:
how can I go about this error?
@BradleyGreen-q8u หลายเดือนก่อน
Me too! Did you find a fix?
@MrMiguelChaves 2 หลายเดือนก่อน ⁺¹
We should create a scraper to scrape your site to get the code of the scraper.
@redamarzouk 2 หลายเดือนก่อน ⁺¹
Good idea 😄😄
@shankar9063 2 หลายเดือนก่อน
How much fast is the local llma3. 1?
@redamarzouk 2 หลายเดือนก่อน
It will really depend on each machine, better GPU+Smaller Models = faster inference and vice versa.
In my case I had a generation of 15 tokens per second which feels a bit on the slower side but it's not very bad.
@kritikusi-666 2 หลายเดือนก่อน
dafuq did github suspend your account for? Did you mark it as "Educational" or
@hatemjaber 2 หลายเดือนก่อน
Are you using firecrawl?
@redamarzouk 2 หลายเดือนก่อน ⁺¹
Not anymore, I've used it in the first video of this series.
@hatemjaber 2 หลายเดือนก่อน
@@redamarzouk are you using puppeteer, playwright, or selenium? I'm curious about bot detection...
@MrN00N3_ 2 หลายเดือนก่อน
Does it work on Google Maps for scraping leads?
@mrsai4740 2 หลายเดือนก่อน ⁺¹
i tried this with google maps (tried searching golf courses) and it seemed to work, but It only works on a small set of data since google only loads like 6 results unless you scroll. I got around this by just copying the element with all the Items i was interested in after i scrolled through the end of the list, then i shoved that into an html file and used a local file as the URL, The results seemed promising at first but it looks like the model gives up and doesnt capture everything
@redamarzouk 2 หลายเดือนก่อน ⁺¹
you're right.
When giving really long html files to smaller models (and even worse local small models 8B or 7B), they seems to give up on getting everything.
so going with gpt4-o or with the gemini 1.5 pro (they give 50 free request per day) will be your best option in this case of 100K+ tokens
@mrinalraj4801 2 หลายเดือนก่อน
Why did your github account get suspended? Which policy did you violate?
@redamarzouk 2 หลายเดือนก่อน ⁺¹
I have no idea, I haven't received any emails from them(checked my spam folder and everywhere).
I thought it was about the "useragent" that I used, but I searched for that exact line of code and I found it in 20k+ repositories, so truly I don't know.
@miguelgargallo 2 หลายเดือนก่อน
Will I get banned if i upload privately on my gh?
@redamarzouk 2 หลายเดือนก่อน
I can't guarantee anything, so it's really up to you, I got banned, hopefully no one else will.
@eduardmart1237 2 หลายเดือนก่อน ⁺¹
Is it free?
@redamarzouk 2 หลายเดือนก่อน ⁺¹
when you use Llama 3.1 8b and Groq and Gemini (up to 1500 requests per day), yes it's free.
@GodX36999 2 หลายเดือนก่อน
I think we just code and scrap but we need extension chrome . So not useful
@edma6613 2 หลายเดือนก่อน
Can you scrape pdf files from a website ?
@redamarzouk 2 หลายเดือนก่อน
No the scraper captures data that exists in the html of the website.
@natassawakBros 2 หลายเดือนก่อน
Thanks Reda ! That's really interesting. I tried to scrape data on websites like amazon by scraping only one single page of a product with Llama3.1. However, I faced the token limit issue although I have a powerful MacBook M3 with 38 GB. The same page works well with Gemini1.5 and gpt. Do you have an explanation please ?
@redamarzouk 2 หลายเดือนก่อน
you're most welcome.
Yeah with Llama3.1 8b in lm studio you'll have to increase your context length to more than 100K tokens (you can do this in advanced configurations).
the smaller models as good as they are for their size, they really can't keep track of all the completion when having to extract from pages with markdowns more than 60K tokens which amazon is one of them.
@s6yx 2 หลายเดือนก่อน
would rather use playwright instead of selenium
@acharafranklyn5167 2 หลายเดือนก่อน
Not that i am a bad guy but this project has been implemented by someone else ...all you could have done is give credits
@redamarzouk 2 หลายเดือนก่อน ⁺⁸
What?
This code is 100% mine and first time I created this project is 4 months ago, go back to my channel and watch a whole video of 20 minutes about it.
You could say that the idea has been around for sometime and I've seen big repos trying to do website to structured data like (firecrawl, jina AI, scrapegraph AI, etc..), and from there I tried to create a simple fully open source version because people in the comments asked me to.
So yeah credits to those libraries, but saying I've taking this from someone else is delusion.
@VicKipyego 2 หลายเดือนก่อน
err
@centurion7722 2 หลายเดือนก่อน
I think that without significant user input, it's unlikely to work. First, the script needs to capture the JS elements code from the site. Then, the user must provide what they are looking for, such as specific pages or other elements. finally, the script + LLM should automatically extract the relevant CSS selector or XPath

ต่อไป

เล่นอัตโนมัติ

This AI Scraper Update Changes EVERYTHING!!