Explaining the Segment Anything Model - Network architecture, Dataset, Training

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ธ.ค. 2024

ความคิดเห็น • 59

  • @avb_fj
    @avb_fj  ปีที่แล้ว

    Here's me from the future posting a detailed analysis of Neural Attention:
    th-cam.com/video/frosrL1CEhw/w-d-xo.html

  • @DatuxGames
    @DatuxGames ปีที่แล้ว +4

    Your videos just keep getting better and better! Editing is on point with this one. Also great topic and really valuable to have you break things down like this.

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Thank you so much! I’m learning things as I go, so I really appreciate feedback like this!

    • @rmayer4086
      @rmayer4086 ปีที่แล้ว

      ​@@avb_fj I agree with him. Your pacing is excellent and you're giving a perfect level of detail.

  • @man9mj
    @man9mj ปีที่แล้ว +5

    I am flabbergasted by the quality of this content. Thank you for the effort. I just subscribed to your channel. Keep up the good work brother! We look for more :)

  • @SlashDL
    @SlashDL 3 หลายเดือนก่อน +1

    Some more information at 10:25 - In the token to image attention, the query comes from the prompt + output tokens and the key, value comes from the image. In the image to token attention, the query comes from the image embedding and the key, value comes from the prompt + output tokens.

  • @anacaznok872
    @anacaznok872 ปีที่แล้ว +1

    The best video on the subject. Thank you! I'll keep watching your videos

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Awesome! Welcome to the channel and I’m glad you liked the video!

  • @jorgeabraham3414
    @jorgeabraham3414 ปีที่แล้ว +1

    this video will have tens of thousands of views in the upcoming days

  • @gingerderidder8665
    @gingerderidder8665 ปีที่แล้ว +1

    So happy I got recommended this video. Great quality content!

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Nice! Glad you enjoyed it!

  • @aprilaustin5569
    @aprilaustin5569 2 หลายเดือนก่อน

    Very good and clear explanation!

  • @SlashDL
    @SlashDL 3 หลายเดือนก่อน +1

    At 10:03, 4 new tokens are added to the sparse embeddings, 1 representing the IoU score, the rest of the 3 representing the masks. Just a minor correction.

  • @Sciencehub-oq5go
    @Sciencehub-oq5go ปีที่แล้ว +1

    I really like this explanation. Thanks a lot!

  • @hinchengchen3153
    @hinchengchen3153 6 หลายเดือนก่อน +1

    easy and short but splendid!!

  • @keneth4
    @keneth4 ปีที่แล้ว +2

    Awesome explanation 👏🏼

  • @mohamedkarim-p7j
    @mohamedkarim-p7j 6 หลายเดือนก่อน +1

    Thank for sharing 👍

  • @victorbjorklund
    @victorbjorklund ปีที่แล้ว +1

    Good quality video. You got a subscriber.

  • @billy.n2813
    @billy.n2813 ปีที่แล้ว +1

    Thank you for this!

  • @ItalianPizza64
    @ItalianPizza64 ปีที่แล้ว +1

    Excellent video, thank you very much! After watching this, there's no doubt in my mind that transformer-based architectures will take over AI for computer vision

    • @avb_fj
      @avb_fj  ปีที่แล้ว +1

      Very true. Vision Transformers are definitely here to stay. The generalization power of transformers/attention is so surprising sometimes… decades of computer vision research suggested that CNNs are best for images because they can encode spatial information about the image… it’s just counterintuitive and mind boggling that ViTs can still learn from images by flattening individual patches and lose spatial structure.

  • @willikappler1401
    @willikappler1401 ปีที่แล้ว

    Wonderful, I really like the way how you present complex topics!

  • @davidyu2372
    @davidyu2372 5 หลายเดือนก่อน

    great video!

  • @wkgates
    @wkgates ปีที่แล้ว

    Great explanation!

  • @SofieSimp
    @SofieSimp ปีที่แล้ว +2

    Nice video explaining the interactive training. I have one question: During each step in the interactive training, the loss is calculated during each step or at the end.
    To be more clear:
    Step 1: I sample a point at the middle of the ground truth mask
    Step 2: Feed the point as a prompt into the model
    Step 3: Get the best mask from the model
    Step 4: From the best mask, calculate error region and sample another positive OR negative point in the error region
    Step 5: Loop from step 2 until reached the maximum iteration
    Do I have to calculate loss between step 3 and step 4 then update the model, then move onto step 4, or I calculate loss at the end after step 5?

    • @avb_fj
      @avb_fj  ปีที่แล้ว +1

      That's a great question. We should definitely calculate the loss between step 3 and step 4. Every iteration is regarded as an isolated training example, so basically for each training example we input an image and a prompt (with a dense mask) and outputs predicted mask(s)... and then apply the loss over our prediction and the ground truth.
      As far as "updating the model" is concerned, it is largely a design choice I think. It's not incorrect to update the weights between each iteration, but it's probably better to do gradient accumulation (basically aggregating the losses over multiple iterations) before updating the weights to get a more stable training curve.
      Hope that helps!

    • @SofieSimp
      @SofieSimp ปีที่แล้ว +1

      @@avb_fjThanks a lot! Really great explanation!

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      🙌🙌@@SofieSimp

  • @VictorVelazquezEspitia
    @VictorVelazquezEspitia 10 หลายเดือนก่อน

    Hey man, congrats on the great video, rn i am doing my theisis on SAM was of help. May i ask you which camera did u use?

    • @avb_fj
      @avb_fj  9 หลายเดือนก่อน

      Good old iPhone. Good luck on your thesis man!

  • @meehai_
    @meehai_ 2 วันที่ผ่านมา

    Out of curiosity: what tool are you using to design the diagrams?

    • @avb_fj
      @avb_fj  2 วันที่ผ่านมา

      PowerPoint! In my latter videos, I’ve also used other stuff… Cavalry, Manim with Python, and just editing tricks with Davinci Resolve I use in other videos in my channel.

  • @miyutube1
    @miyutube1 9 หลายเดือนก่อน

    Very good, I am using SAM and want to understand better to tune the parameters, thus here I am struggling to understand your video (one of the few that actually try to explain the concepts...). What is pt in the focal loss definition?

    • @avb_fj
      @avb_fj  9 หลายเดือนก่อน

      Could you add a timestamp?

    • @miyutube1
      @miyutube1 9 หลายเดือนก่อน

      2:54

    • @avb_fj
      @avb_fj  9 หลายเดือนก่อน

      @@miyutube1 I see. “p” here is the simply the probability outputted by the model for the classification task. You can find more info in Page 3 of this paper
      arxiv.org/pdf/1708.02002.pdf

  • @EkShunya
    @EkShunya ปีที่แล้ว +1

    i like your energy.
    can you help the community with resources you refer to and channels/people you follow?

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Thanks for the comment! That’s great feedback, I’ll try to share more in the upcoming videos!

  • @nitinsurya1991
    @nitinsurya1991 ปีที่แล้ว

    - What could be the intuition for having MLP for IOU scores and MSE loss on top?
    - from their repository, don't see any interface of text prompt usage. Any examples available?

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      - Predicting the IoU scores helps during inference to determine which of the three predicted masks is most likely to be the "correct" mask to show the user. The three IOU predictions are simply considered “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how high the IOU prediction is. They are basically asking the network to output how confident it is for each of its three predictions.
      For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object.
      Again, note that this is all an inference-time thing. During training, we have the Ground Truth mask and use MSE loss to train the network to output the correct IOU scores. During inference, we just have the three predicted masks and the network's own confidence score as IoU for each of the three masks. Hope that helps.
      - Reg the text-prompt usage, they did not release it in their web-app. They just documented it in their paper. Don't know if there are plans to release in the future, or if there are other ways to access it.

    • @nitinsurya1991
      @nitinsurya1991 ปีที่แล้ว

      @@avb_fj thanks. Have you tried if the CLIP text embeddings would do the trick? Essentially, they mentioned they did train the model with the input.

  • @timanb2491
    @timanb2491 11 หลายเดือนก่อน

    May i iask your a question - one type of prompt is segmentation mask . if we have segmentation mask as a prompt why should we use SAM ? we already have binary segmentation mask

    • @avb_fj
      @avb_fj  11 หลายเดือนก่อน

      Check out the part about interactive training at around 5:00
      Basically the dense mask prompt is used during the training phase to iteratively improve the networks segmentations. Kinda like asking the model, “Hey you gave me this mask last time, but here’s another internal/external point prompt, give me an updated mask”.
      During inference, we can pass an image full of zeros as the dense mask into the model (meaning we have no idea where the segmentation should be) and ask the model to update it. Once it gives an initial estimation, we might recursively pass the network’s output logits (not binary, but the prob distribution) back into it as a dense prompt to iteratively improve the predicted mask.
      In other words, don’t assume that the mask we pass in as prompt need to be the “correct one”… they will be incorrect / a gross estimation of the correct mask, and the networks job is to iteratively improve it till it converges somewhere.

  • @prafulmathur4567
    @prafulmathur4567 ปีที่แล้ว

    Hi The explanation is awesome. But its still not clear how SAM handles nested annotations ? The GT annotations has no hierarchy defined, each part is independently annotated. Then how SAM learns whole, part and subpart for each object ?

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      That’s a great question… you are right the network does not explicitly outputs the part/subpart/whole stuff! Just the annotators are asked to annotate in such a way. I believe what ends up happening is that the network automatically learns to map each output head to one of the three labeled GT. Because each output head also has its own unique output token embedding, they technically learn to associate with different output masks as well. They leave it up to backpropagation and gradient descent to handle the rest…
      Note that neither during inference nor training do they provide labels for whole/part/subpart to the network… and during prediction/inference too, the network doesn’t explicitly output those labels. It just returns 3 distinct masks and their confidence scores…

  • @barbaraz5363
    @barbaraz5363 ปีที่แล้ว

    Hi, thanks for your video! It explained so grreat! I have just one question about IoU predicted score: how does it calculate during inference ? In the paper they juste said that it's calculated between predicted mask and object it covers. I wonder how they get the surface about object it covers( because basically we don't have gt)

    • @avb_fj
      @avb_fj  ปีที่แล้ว +1

      Yeah it’s kinda tricky to understand it. During training, they have the GT, so they can calculate the IOU with the 3 predicted masks and train the model to predict IOU scores.
      During inference, the three IOU predictions are simply considered as “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how good the iou prediction is.
      For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. That said, this is still a network’s own confidence on its own prediction, and there is no way of knowing the correct iou score coz we won’t have the GT during inference. It’s all an additional helpful output that tells us the network’s confidence on each of the masks.
      Hope that helps!

    • @barbaraz5363
      @barbaraz5363 ปีที่แล้ว

      @@avb_fj Thanks for your quick reply! If I understand well, the so-called "confidence scores" during inference, which are in fact calculated from a MLP head with input (3, embedding_dim) and some hidden layers of 256 neurons, in the end it outputs a tensor (3, 1) which represent the probability (0, 1) of each mask after pass a sigmoid activation ?

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Sorry for the late response, I must've missed the notification. Fwiw, what you said makes perfect sense to me.@@barbaraz5363

  • @Ye1324
    @Ye1324 ปีที่แล้ว

    In what format is the mask data saved , is it in tensors or numpy array

    • @avb_fj
      @avb_fj  ปีที่แล้ว +1

      It is mostly a design/implementation choice. Generally people would save the mask data in an image format (like png) or in any uint8 format (including numpy arrays)… during training though we would need to load/convert them to tensors for easy gradient calculations…

  • @scifaipy9301
    @scifaipy9301 หลายเดือนก่อน

    Just a quick suggestion, don't use background music. I mostly avoid videos with background music, it distracts from the informative explanations. Besides that, thanks for making videos that focus on AI research papers. Your English is very clear.

  • @Grenoble7
    @Grenoble7 ปีที่แล้ว +2

    hello. Great dense video. Suggestion: you are a bit too fast for me: i have to pause on every slide to read it. Usually i x2 the speed of the video, but you are the only opposite i’ve seen on youtube! Maybe you could describe each slide in more details to let us the time to understand it? Just an idea.

  • @Alice-yq6yy
    @Alice-yq6yy 4 หลายเดือนก่อน

    How does SAM guess the IoU for new images when there is no ground truth available?

    • @avb_fj
      @avb_fj  4 หลายเดือนก่อน

      During training, the ground truth images and their IOU scores are available, so we can train the SAM network to predict it using supervised training. During inference, the network predicts the segmentation masks and also the estimates of the IOU scores.

  • @egonvanpraet
    @egonvanpraet ปีที่แล้ว +1

    Your content is very underrated in the algorithm. Keep making videos, they are great :) Would be great if you could explain MusicLM from Google.

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Thanks for the suggestion. I’ll add it to my bucket list for next month!