🧠 Turn Websites into Powerful Chatbots with LangChain And Chroma

Bitswired

มุมมอง 7 514

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 25 ส.ค. 2024

ความคิดเห็น • 53

@y.pproduction812 ปีที่แล้ว ⁺³
That's freaking awesome man, you're feeding the tech enthusiasts community with fresh and quality content ;)
@bitswired ปีที่แล้ว ⁺¹
My man YP! Thanks 😁
Let’s keep going together 🚀
@pooriaarab ปีที่แล้ว ⁺¹
How can we create the embedding from a website (+1,800 webpages) once, check for it to be updated if is updated and create a chatbot (not just a single query)?
@bitswired ปีที่แล้ว
This is definitely doable, just requires a bit more work.
When we create the embeddings, we can save them in the vector database, and if a page change, only re-compute the embedding for this page.
Then you keep your chatbot up to date.
If you need help (advice, ideas), or look for someone to do it for you as a job, let me know and we chat 😁
@_charleshoang 5 หลายเดือนก่อน ⁺¹
Thanks for a great tutorial. Would I need Chat GPT 4 to get this to work? I'm trying this out with GPT 3.5 and its returning '0 URLs extracted'.
@bitswired 12 วันที่ผ่านมา
No problem!
Are you sure the sitemap is in the proper format?
Also is the sitemap accessible? (request could be blocked by the website for instance?)
@natevaub ปีที่แล้ว ⁺²
Great job bro, this is a very interesting project! Keep it up! :D
@bitswired ปีที่แล้ว ⁺¹
Thanks bro ❤️ let’s gooooooo 🚀
@mariof.1941 ปีที่แล้ว ⁺¹
if anyone else struggles with AzureOpenAI you need to include model=embedding-name, chunk_size=1 *without this it wont work
But then enjoy RateLimitError ;-)
@bitswired ปีที่แล้ว ⁺¹
Thanks for letting us know 👍🏽
What happens with rate limit though? Is AzureOpenAI more limited on usage?
@magicalPDF ปีที่แล้ว ⁺¹
amazing
@bitswired 10 หลายเดือนก่อน
💚
@konstantinrebrov675 ปีที่แล้ว ⁺¹
If I am having a website with many many pages, such as for example an online documentation, how can I use wget or some other tool to crawl through the website and get a list of pages to be indexed?
@bitswired ปีที่แล้ว
There are 2 main ways to build yourself the sitemap:
1. You have access to the backend:
If you know how the website is built, from a database or files or any other config, you could potentially build the sitemap using this information only
2. You don’t have specific information:
In this case you need to do it like Google would. First you get a page, then on this page you parse all the links, then you visit each links and you repeat …
To do so I would not use wget but a language like Python which is more flexible than bash and has libraries to easily parse the HTML (like BeautifulSoup)
What is your situation?
I would be happy to help you if I can 😁
@picklenickil ปีที่แล้ว ⁺¹
Beware to read terms of usage BEFORE SCRAPING! Websites flag your IP. Also just use a proxy
@bitswired ปีที่แล้ว ⁺¹
You’re right 👍🏽
Yes proxy helps, you can also use the selenium undetected_chromedriver which does a good job to hide itself apart from IP (user agent, flags, …)
@T1221T ปีที่แล้ว ⁺¹
thanks for the video it's very helpful. i'm wondering about the GPT model you used - was it GPT4? is this possible to be recreated using 3.5turbo?
@bitswired ปีที่แล้ว ⁺¹
I’m very glad you liked it! Thanks :)
Actually it was GPT 3.5
In the code at some point call this class: ChatOpenAI()
It uses GPT 3.5 by default.
To use GPT4 you can do something like:ChatOpenAI(model_name=“gpt-4”)
@SubhashPalsule ปีที่แล้ว ⁺³
Awesome! 😍🤩
@bitswired ปีที่แล้ว ⁺¹
Thank you! Much appreciated 😁
@marcojacome ปีที่แล้ว ⁺²
Great work
@bitswired ปีที่แล้ว ⁺¹
Thanks man 🙌🏽
@akash_chaudhary_ ปีที่แล้ว ⁺¹
Hi, I tried to integrate with Azure OpenAI but the problem I'm facing is when i try to run, after the "Loading URLs content ..." it throws error that there is exception libmagic is unavailable
@bitswired ปีที่แล้ว
Hi :)
Can you try to install libmagic on your computer?
What OS are you using?
@bitswired ปีที่แล้ว
@@akash_chaudhary_ No problem :)
Well done 👍🏽
@frazuppi4897 ปีที่แล้ว ⁺²
Amazing man!
@bitswired ปีที่แล้ว ⁺¹
Thanks for the support man!
It means a lot :)
@leewsimpson1 ปีที่แล้ว ⁺²
What if there is no sitemap ?
@bitswired ปีที่แล้ว ⁺⁴
If the website has no sitemap, you still need a way to scrape the content.
One way is to do as web crawlers: you start from a web page, then visit all the links present on the page. Then for each page you visit, again you visit all the links it contains … and so on.
After some time you should have visited the majority of the website m, and you can now build the chatbot.
Does it make sense to you?
Let me know if it’s not clear :)
@leewsimpson1 ปีที่แล้ว ⁺²
@@bitswired thanks - super clear. have you used any frameworks that may help?
@bitswired ปีที่แล้ว ⁺³
No problem :)
Do you know Scrapy? It’s a convenient Python package to scrape data and x’ crawl websites 📊
@mariegautier3765 ปีที่แล้ว ⁺²
🔥🔥🔥
@bitswired ปีที่แล้ว ⁺¹
Mims 🫶🏽
@ChiragDubey-is4ro ปีที่แล้ว
exception: Invalid file. The FileType.UNK file type is not supported in partition. this exception is coming how to solve this
@bitswired 12 วันที่ผ่านมา
Hey : )
Can you point out where the error occurs?
Do you have the full error trace?
@winglight2008 ปีที่แล้ว ⁺¹
I'd written a docs-based QA app in the same way like yours yesterday, but I'm stuck by a problem which is "exceeds token limit of model". I planed to solve it by setting vector store similar search docs' number returned but no arguments accepted in chain method. Do you have any idea for this? Any hint will be appreciated.
@bitswired ปีที่แล้ว ⁺¹
Hey! Sure I would gladly help :)
What chain are you using?
If you use the RetrievalQA chain you need to provide a vector store, like the Chroma database in the video.
Also you need to first split your documents in chunks that fit in the model context. You can use the character splitter to do so.
Also there are multiple type of chains: in the example we use Stuff, which uses all the text retrieved in the similarity search as input.
But other type of chains like MapReduce process the retrieved documents iteratively by cutting them in smaller pieces. It can also help to overcome the token limit.
Can you give me more details about your issue?
We could also discuss on my Discord tomorrow if you want
@winglight2008 ปีที่แล้ว ⁺¹
@@bitswired Thanks a lot. I made it like you said in half way. And the whole process including split, similarity search and send it to LLM for query, I need to qa with docs from milvus. So I use chain Stuff to compose milvus and llm, but there's no parameter to limit vector search documents return which default is 4 in the method similarity_search. I searched through the whole online docs of langchain and no result about this case. I'll join your discord if off work.
@bitswired ปีที่แล้ว
No problem :)
So you want to control the number of relevant documents returned by the vector database right?
I've found an example.
When you call the as_retriever() function (like we do in the example with the Chroma database), we can specify the parameter k we want to use:
retriever = db.as_retriever(search_kwargs={"k": 1})
The example is from this web page: python.langchain.com/en/latest/modules/indexes/retrievers/examples/vectorstore-retriever.html
Let me know if you still have problems ;)
@winglight2008 ปีที่แล้ว ⁺¹
@@bitswired Thanks a lot. I read the example you said, however, I've no idea to set this parameter in a chain function metioned in your video.
@bitswired ปีที่แล้ว ⁺¹
No problem :)
In my code example, at some point after creating the Chroma database I call the as_retriever() function line 78:
self.chain = RetrievalQAWithSourcesChain.from_chain_type(
ChatOpenAI(),
chain_type="map_reduce",
retriever=docsearch.as_retriever(),
)
Here instead of as_retriever() add the following parameter:
db.as_retriever(search_kwargs={"k": 1})
I don’t know if it answers your question.
Other wise do you have a GitHub repository with your code, or could you share a code sample of what you are currently doing?
@kashifraza3339 ปีที่แล้ว ⁺²
Some websites do not give access to sitemap URL. How can we do search from that?
@bitswired ปีที่แล้ว
Hi :)
Do you have some programming knowledge?
If so, you could build a web crawler to build a site map:
- Start from any page on the website
- Get the web page content and extract all the links
- Visit each link and from each page, proceed recursively
You could use Python to do so, with libraries like BeautifulSoup and Scrapy.
Let me know if you would need more details 😁.
@srijananand1319 ปีที่แล้ว ⁺¹
@@bitswired I have made a .xml sitemap file locally in my system, how can I incorporate that into the program shown in the video? I'll be really grateful if you help me. Thanks!
@bitswired 10 หลายเดือนก่อน
Did you find a solution? Sorry for the late answer I missed your comment
We could have a chat so I explain you

ต่อไป

เล่นอัตโนมัติ

🤯 AI Writes Your Pull Request (LazyCodr) !?