@shaw thanks for sharing the video. Really appreciate this . I am working on a task where I want to create a group that will contain all the variations of a product . Let’s say the variations can happen in colours . I am thinking of creating a single embedding ( text and image) for every product , then apply some sort of similarity on them. In this way , I will capture both modalities for the product . I am expecting all the variations of a product will have small cosine angle and thus can be grouped together . 1. Can I use clip for this task ? What is best way to merge text and image ? 2. Are there any better ways to solve this problem if not clip ?
Hi Shaw. I really like your channel. What do you think about creating a RAG system starting from a database with thousands of pdfs that can contain information in figures, flow charts, tables, images? We certainly can't manually preprocess all pdfs. What embedding method do you suggest?
I recently did an interview with Jason Liu on this topic which I think would be helpful: th-cam.com/video/WLCbHuRr0_0/w-d-xo.htmlsi=qNrSxMnDV5YWA548 The short answer is to build an initial (simple) version of your system (e.g. only consider text from PDFs to start), then iteratively improve it based on the queries user type into it.
Shaw this was great! Question: suppose you were to instead use e-commerce data that has search terms, product titles, images, and clicks. What would you use as anchor/positive/negative? One can use search click data or only title/image and I feel there can be arguments to either.
Great question! It depends on your use case. For example, if your goal is to improve search you might be able to get by without fine-tuning and just extract key search terms from images using a vision model.
Hope this videos helps improve your multimodal AI systems! The code, dataset, and model links are available in the description :)
Thanks a lot🎉 Best CLIP tuning video I've seen.
Glad it was helpful!
@shaw thanks for sharing the video. Really appreciate this .
I am working on a task where I want to create a group that will contain all the variations of a product . Let’s say the variations can happen in colours .
I am thinking of creating a single embedding ( text and image) for every product , then apply some sort of similarity on them. In this way , I will capture both modalities for the product . I am expecting all the variations of a product will have small cosine angle and thus can be grouped together .
1. Can I use clip for this task ? What is best way to merge text and image ?
2. Are there any better ways to solve this problem if not clip ?
You can definitely use CLIP here. Why do you want to merge the text and image embeddings for each product?
This is exactly what I needed for solving my problem. Do you happen to have some references or links for research papers for the same?
Great to hear! This page from sbert might be helpful: www.sbert.netexamples/applications/image-search/README.html
Hi Shaw. I really like your channel. What do you think about creating a RAG system starting from a database with thousands of pdfs that can contain information in figures, flow charts, tables, images? We certainly can't manually preprocess all pdfs. What embedding method do you suggest?
I recently did an interview with Jason Liu on this topic which I think would be helpful: th-cam.com/video/WLCbHuRr0_0/w-d-xo.htmlsi=qNrSxMnDV5YWA548
The short answer is to build an initial (simple) version of your system (e.g. only consider text from PDFs to start), then iteratively improve it based on the queries user type into it.
@@ShawhinTalebi Thank you !
Shaw this was great! Question: suppose you were to instead use e-commerce data that has search terms, product titles, images, and clicks. What would you use as anchor/positive/negative? One can use search click data or only title/image and I feel there can be arguments to either.
Great question! It depends on your use case. For example, if your goal is to improve search you might be able to get by without fine-tuning and just extract key search terms from images using a vision model.