Segment Anything Paper Explained: New Foundation Model From Meta AI Is Impressive!

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ม.ค. 2025

ความคิดเห็น •

  • @botsknowbest
    @botsknowbest  ปีที่แล้ว +6

    Thank you for watching! Feel free to ask any questions about SAM, the paper, or how to run it locally.
    UPDATE (4/21): Hugging Face 🤗 just added SAM into their library: huggingface.co/docs/transformers/main/en/model_doc/sam

  • @UnchartedExperience
    @UnchartedExperience 4 หลายเดือนก่อน +2

    I usually never like any video but you made me click the like button so HARD! u r good man

  • @aditya-lo6jy
    @aditya-lo6jy ปีที่แล้ว +1

    Great Explanation on SAM!

  • @kelvinpraises
    @kelvinpraises ปีที่แล้ว +2

    Very in-depth thanks

  • @yoavsnake
    @yoavsnake 11 หลายเดือนก่อน +2

    An academic paper in the thumbnail always let me know that the video is likely well researched, nice

  • @PA-eo7fs
    @PA-eo7fs ปีที่แล้ว +1

    As someone in technology, I’ll know this channel with gain followers you’re in detail

  • @ColorfullHD
    @ColorfullHD ปีที่แล้ว +1

    Great explanation. What I can't get my head around is how the training data for SAM is generated by a model in itself. Wouldn't you get a transfer of bias (e.g. the bias in the training set generating model is represented in what SAM learns)?
    I mean, if that bias is low, it can work, but conceptually that's a fairly odd thing to do in the field, right?

    • @botsknowbest
      @botsknowbest  ปีที่แล้ว +1

      @ColorfullHD That's a great question!
      So Data Engine annotated the training images in three stages. Human annotators were involved in the first two stages, so the model essentially learns from human annotations. The third stage is fully automatic; Data Engine annotates 11M images, producing 1.1B masks, and SAM is then trained on this data. So we have a mask generation model (Data Engine) that generates training instances for another similarly-structured mask generation model (SAM). This approach can seem unintuitive, but bear in mind a few things:
      First, you are generating your training data only once, and you can spend much more resources on it to get it right. For example, the paper mentioned that they used a special version of SAM for the fully-automated stage that sacrifices inference speed for improved mask generation properties. Second, you can post-process the data to improve quality and robustness. It seems like they applied some heavy post-processing after the masks were generated. They removed low-quality masks, identified and selected only confident and stable masks, applied some filtering tools to remove duplicates, and increased the quality of small masks. And having this post-processed dataset, you can then train your models that are more optimized for your end-task rather than high-quality data generation.
      Another thing is that the quality of a synthetic dataset is almost always worse than that of a human-annotated dataset, and that's the price you are paying for scaling to billions of data points. And this trade-off is usually worth it. For example, in neural machine translation, using low-quality synthetic training pairs produced by back-translation (i.e., you use a different model that translates in the opposite direction to get new training instances) helps models to generalize better and improve translation quality.
      I hope this helps - let me know if you have more questions! 🙂

  • @kobic8
    @kobic8 ปีที่แล้ว +1

    thanks fir the wonderful vid! I am intrested in *labeled* masks, have you seen the work of the hybrid mode of grounded-DINO + SAM? I'm curious to know how can I use a labeled dataset I have (of sea-objects) to learn the model to detect not only a boat/ship but to identify the name of the marine-vessel.

    • @botsknowbest
      @botsknowbest  ปีที่แล้ว +2

      Thanks!!
      Do you mean this one? github.com/IDEA-Research/Grounded-Segment-Anything
      Yeah, that's a great project and definitely addresses your problem. SAM will probably work fine for you out of the box, but you will have to fine-tune the Grounding DINO model on your dataset. Have you tried any fine-tuning? If so, how did it go?

    • @kobic8
      @kobic8 ปีที่แล้ว +1

      @@botsknowbest still haven't tried, since I have no clue how to sgtart with fine-tuning grounding-DINO weights:( have opened an issue in their repo. Was wondering if mayby you got any clue for how it could be done

    • @botsknowbest
      @botsknowbest  ปีที่แล้ว

      ​@@kobic8 You can start with these two notebooks to see how Grounding-DINO is used and what the input/output data look like:
      github.com/IDEA-Research/Grounded-Segment-Anything/blob/main/grounded_sam.ipynb
      colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb
      This is the Hugging Face repository: ShilongLiu/GroundingDINO
      And then you can fine-tune it as any other pre-trained model: huggingface.co/docs/transformers/training

  • @ashwiniyadav464
    @ashwiniyadav464 ปีที่แล้ว +1

    Very good explanation...can this Sam works for medical images?

    • @botsknowbest
      @botsknowbest  ปีที่แล้ว +1

      Thank you! It should work well if you fine-tune it to your dataset. Out of curiosity, I just tried to segment some MRI brain images, and it seems fine for zero-shot: www.dropbox.com/s/h538j2oeymps84v/mri_sam.png

  • @lifewithG-bengs
    @lifewithG-bengs ปีที่แล้ว +1

    Wow nice is
    it open source?

    • @botsknowbest
      @botsknowbest  ปีที่แล้ว

      Yes, it was released under Apache 2.0 license, so it can be used for both commercial and non-commercial use. 🙂

  • @berkertaskiran
    @berkertaskiran ปีที่แล้ว

    Can these models key/roto video hair strands as good as a human compositor?
    Take your video as an example. It is more or less acceptable for YT. But it is unacceptable even for a short film. You can see the despilled edges. You probably kept those because you wanted to preserve edge details. If you wanted to get rid of those you would lose detail. To do it both at the same time you need to use more advanced keying techniques pro vfx artists use than just picking a color, playing with balance and blur.
    If an AI model isn't as good as that, it can be used in social media and for people to have fun. But if you want to use it in movies to actually make it believeable, to allow more people to make movies more easily, to really take advantage of it, and to save a ton of time and money, that will require some precision.
    In films you can't really tell if a scene had green/blue screen even if you zoomed 400x. It will have perfect transition of edges that even if it was shown side by side, you really can't tell it. I would love to see an example where this is achieved with AI.
    Now all of this is chroma keying (green/blue screen). Rotoscoping, which doesn't involve single colors to key, relies fully on precision. Vfx artists can also do that perfectly but it is a much harder task. And to do that every 24 frame a second seamlessly without any flickering or changing edges is even harder.
    I would love to see an example where this is achieved.

    • @botsknowbest
      @botsknowbest  ปีที่แล้ว +2

      So for the tasks you mentioned, you could use SAM to identify all foreground objects and pass them down through your ML pipeline for background removal. Maybe that SAM's zero-shot capabilities could be useful in identifying some unique objects that would be otherwise problematic for other models.
      The current state of AI probably still needs to catch up to VFX artists, but lots of exciting research has been done in this area - e.g., in background matting. That technique involves generating a mask that labels each pixel as foreground or background using deep learning algorithms. The state-of-the-art models perform quite well, and it is ideal for real-time applications like video conferencing (Zoom).
      Right now, I can think of these two background matting papers:
      Sengupta, Soumyadip, et al. "Background matting: The world is your green screen." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
      Lin, Shanchuan, et al. "Robust high-resolution video matting with temporal guidance." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022.
      As for some non-academic examples, I think RunwayML does Automatic Rotoscoping with AI (runwayml.com/green-screen/).