Explaining the Segment Anything Model - Network architecture, Dataset, Training

แชร์
ฝัง
  • เผยแพร่เมื่อ 28 มิ.ย. 2024
  • In this video, I dive deep into the technical details and architecture behind the Segment Anything Model, also known as SAM. SAM is the world's first foundation model on image segmentation and is an amazing tool that can segment any image provided to it at multiple nested levels of granularity at interactive latency.
    #deeplearning #computervision #machinelearning
    To support the channel and access the Word documents/slides used in this video, consider JOINING the channel on TH-cam or Patreon. Members get access to scripts, slides, animations, and illustrations for most of the videos on my channel!
    Join and support the channel - www.youtube.com/@avb_fj/join
    Patreon - / neuralbreakdownwithavb
    Project page: segment-anything.com/
    Give the paper a read: arxiv.org/pdf/2304.02643.pdf
    0:00 - Intro
    1:29 - Architecture
    4:50 - Interactive Training
    6:30 - Dataset
    7:27 - Model Architecture
    12:30 - Outro
    Other papers cited:
    Focal Loss for Dense Object Detection: arxiv.org/pdf/1708.02002.pdf
    CLIP: arxiv.org/pdf/2103.00020.pdf
    Masked Autoencoders Are Scalable Vision Learners: arxiv.org/pdf/2111.06377.pdf
    Songs:
    Sunny Days - Anno Domini Beats
    Wellington Coffee Shop - Dyalla
    No 3 Morning Folk Song - Esther Abrami
  • เกม

ความคิดเห็น • 51

  • @avb_fj
    @avb_fj  7 หลายเดือนก่อน

    Here's me from the future posting a detailed analysis of Neural Attention:
    th-cam.com/video/frosrL1CEhw/w-d-xo.html

  • @willikappler1401
    @willikappler1401 ปีที่แล้ว

    Wonderful, I really like the way how you present complex topics!

  • @man9mj
    @man9mj 7 หลายเดือนก่อน +3

    I am flabbergasted by the quality of this content. Thank you for the effort. I just subscribed to your channel. Keep up the good work brother! We look for more :)

  • @DatuxGames
    @DatuxGames ปีที่แล้ว +4

    Your videos just keep getting better and better! Editing is on point with this one. Also great topic and really valuable to have you break things down like this.

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Thank you so much! I’m learning things as I go, so I really appreciate feedback like this!

    • @rmayer4086
      @rmayer4086 ปีที่แล้ว

      ​@@avb_fj I agree with him. Your pacing is excellent and you're giving a perfect level of detail.

  • @gingerderidder8665
    @gingerderidder8665 ปีที่แล้ว +1

    So happy I got recommended this video. Great quality content!

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Nice! Glad you enjoyed it!

  • @user-xv8dn4nm5k
    @user-xv8dn4nm5k หลายเดือนก่อน +1

    Thank for sharing 👍

  • @keneth4
    @keneth4 ปีที่แล้ว +2

    Awesome explanation 👏🏼

  • @anacaznok872
    @anacaznok872 9 หลายเดือนก่อน +1

    The best video on the subject. Thank you! I'll keep watching your videos

    • @avb_fj
      @avb_fj  9 หลายเดือนก่อน

      Awesome! Welcome to the channel and I’m glad you liked the video!

  • @Sciencehub-oq5go
    @Sciencehub-oq5go 9 หลายเดือนก่อน +1

    I really like this explanation. Thanks a lot!

  • @billy.n2813
    @billy.n2813 ปีที่แล้ว +1

    Thank you for this!

  • @wkgates
    @wkgates ปีที่แล้ว

    Great explanation!

  • @davidyu2372
    @davidyu2372 15 วันที่ผ่านมา

    great video!

  • @victorbjorklund
    @victorbjorklund 9 หลายเดือนก่อน +1

    Good quality video. You got a subscriber.

  • @hinchengchen3153
    @hinchengchen3153 หลายเดือนก่อน +1

    easy and short but splendid!!

  • @jorgeabraham3414
    @jorgeabraham3414 ปีที่แล้ว +1

    this video will have tens of thousands of views in the upcoming days

  • @ItalianPizza64
    @ItalianPizza64 ปีที่แล้ว +1

    Excellent video, thank you very much! After watching this, there's no doubt in my mind that transformer-based architectures will take over AI for computer vision

    • @avb_fj
      @avb_fj  ปีที่แล้ว +1

      Very true. Vision Transformers are definitely here to stay. The generalization power of transformers/attention is so surprising sometimes… decades of computer vision research suggested that CNNs are best for images because they can encode spatial information about the image… it’s just counterintuitive and mind boggling that ViTs can still learn from images by flattening individual patches and lose spatial structure.

  • @turboxxx8
    @turboxxx8 9 หลายเดือนก่อน +1

    Amazing video! Could you please explain what exactly are the "output tokens" and how do they get them?

    • @avb_fj
      @avb_fj  9 หลายเดือนก่อน +1

      Someone else had the same question, copy-pasting that reply here…
      So the output token is kind of a common trick that people use in Transformer based models to "aggregate information" about an input sequence. If you are familiar with the Next Sentence Prediction task in BERT models, they also use a similar concept with the [CLS] token.
      Basically, concept goes as follows:
      step 1> the output token is a dummy token you append onto the input sequence (say at the very end of the input seq)
      step 2> pass it through the transformer/attention layers
      step 3> the attention layers generates a sequence of contextual embeddings, one for each token in the input sequence
      step 4> you extract the embedding in the index corresponding to the dummy output token (i.e. the last embedding coz that's where you put the output token in the input sequence in step 1)
      step 5> this embedding now encapsulates or aggregates the entire context of the input sequence and can be used for downstream tasks like classification, etc.
      Hope that helped. The literature may be a bit thin on output token embeddings in the Segmentation space, but I'll strongly recommend to read about the [CLS] token in the BERT paper for Next Sentence Prediction to get a better understanding.

  • @VictorVelazquezEspitia
    @VictorVelazquezEspitia 5 หลายเดือนก่อน

    Hey man, congrats on the great video, rn i am doing my theisis on SAM was of help. May i ask you which camera did u use?

    • @avb_fj
      @avb_fj  3 หลายเดือนก่อน

      Good old iPhone. Good luck on your thesis man!

  • @SofieSimp
    @SofieSimp 10 หลายเดือนก่อน +1

    Nice video explaining the interactive training. I have one question: During each step in the interactive training, the loss is calculated during each step or at the end.
    To be more clear:
    Step 1: I sample a point at the middle of the ground truth mask
    Step 2: Feed the point as a prompt into the model
    Step 3: Get the best mask from the model
    Step 4: From the best mask, calculate error region and sample another positive OR negative point in the error region
    Step 5: Loop from step 2 until reached the maximum iteration
    Do I have to calculate loss between step 3 and step 4 then update the model, then move onto step 4, or I calculate loss at the end after step 5?

    • @avb_fj
      @avb_fj  10 หลายเดือนก่อน

      That's a great question. We should definitely calculate the loss between step 3 and step 4. Every iteration is regarded as an isolated training example, so basically for each training example we input an image and a prompt (with a dense mask) and outputs predicted mask(s)... and then apply the loss over our prediction and the ground truth.
      As far as "updating the model" is concerned, it is largely a design choice I think. It's not incorrect to update the weights between each iteration, but it's probably better to do gradient accumulation (basically aggregating the losses over multiple iterations) before updating the weights to get a more stable training curve.
      Hope that helps!

    • @SofieSimp
      @SofieSimp 10 หลายเดือนก่อน +1

      @@avb_fjThanks a lot! Really great explanation!

    • @avb_fj
      @avb_fj  10 หลายเดือนก่อน

      🙌🙌@@SofieSimp

  • @barbaraz5363
    @barbaraz5363 11 หลายเดือนก่อน

    Hi, thanks for your video! It explained so grreat! I have just one question about IoU predicted score: how does it calculate during inference ? In the paper they juste said that it's calculated between predicted mask and object it covers. I wonder how they get the surface about object it covers( because basically we don't have gt)

    • @avb_fj
      @avb_fj  11 หลายเดือนก่อน +1

      Yeah it’s kinda tricky to understand it. During training, they have the GT, so they can calculate the IOU with the 3 predicted masks and train the model to predict IOU scores.
      During inference, the three IOU predictions are simply considered as “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how good the iou prediction is.
      For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object. That said, this is still a network’s own confidence on its own prediction, and there is no way of knowing the correct iou score coz we won’t have the GT during inference. It’s all an additional helpful output that tells us the network’s confidence on each of the masks.
      Hope that helps!

    • @barbaraz5363
      @barbaraz5363 11 หลายเดือนก่อน

      @@avb_fj Thanks for your quick reply! If I understand well, the so-called "confidence scores" during inference, which are in fact calculated from a MLP head with input (3, embedding_dim) and some hidden layers of 256 neurons, in the end it outputs a tensor (3, 1) which represent the probability (0, 1) of each mask after pass a sigmoid activation ?

    • @avb_fj
      @avb_fj  7 หลายเดือนก่อน

      Sorry for the late response, I must've missed the notification. Fwiw, what you said makes perfect sense to me.@@barbaraz5363

  • @EkShunya
    @EkShunya ปีที่แล้ว +1

    i like your energy.
    can you help the community with resources you refer to and channels/people you follow?

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Thanks for the comment! That’s great feedback, I’ll try to share more in the upcoming videos!

  • @miyutube1
    @miyutube1 3 หลายเดือนก่อน

    Very good, I am using SAM and want to understand better to tune the parameters, thus here I am struggling to understand your video (one of the few that actually try to explain the concepts...). What is pt in the focal loss definition?

    • @avb_fj
      @avb_fj  3 หลายเดือนก่อน

      Could you add a timestamp?

    • @miyutube1
      @miyutube1 3 หลายเดือนก่อน

      2:54

    • @avb_fj
      @avb_fj  3 หลายเดือนก่อน

      @@miyutube1 I see. “p” here is the simply the probability outputted by the model for the classification task. You can find more info in Page 3 of this paper
      arxiv.org/pdf/1708.02002.pdf

  • @Ye1324
    @Ye1324 ปีที่แล้ว

    In what format is the mask data saved , is it in tensors or numpy array

    • @avb_fj
      @avb_fj  ปีที่แล้ว +1

      It is mostly a design/implementation choice. Generally people would save the mask data in an image format (like png) or in any uint8 format (including numpy arrays)… during training though we would need to load/convert them to tensors for easy gradient calculations…

  • @prafulmathur4567
    @prafulmathur4567 9 หลายเดือนก่อน

    Hi The explanation is awesome. But its still not clear how SAM handles nested annotations ? The GT annotations has no hierarchy defined, each part is independently annotated. Then how SAM learns whole, part and subpart for each object ?

    • @avb_fj
      @avb_fj  9 หลายเดือนก่อน

      That’s a great question… you are right the network does not explicitly outputs the part/subpart/whole stuff! Just the annotators are asked to annotate in such a way. I believe what ends up happening is that the network automatically learns to map each output head to one of the three labeled GT. Because each output head also has its own unique output token embedding, they technically learn to associate with different output masks as well. They leave it up to backpropagation and gradient descent to handle the rest…
      Note that neither during inference nor training do they provide labels for whole/part/subpart to the network… and during prediction/inference too, the network doesn’t explicitly output those labels. It just returns 3 distinct masks and their confidence scores…

  • @nitinsurya1991
    @nitinsurya1991 7 หลายเดือนก่อน

    - What could be the intuition for having MLP for IOU scores and MSE loss on top?
    - from their repository, don't see any interface of text prompt usage. Any examples available?

    • @avb_fj
      @avb_fj  7 หลายเดือนก่อน

      - Predicting the IoU scores helps during inference to determine which of the three predicted masks is most likely to be the "correct" mask to show the user. The three IOU predictions are simply considered “confidence scores” for each of the three predicted masks. This allows them to rank each of the mask outputs according to how high the IOU prediction is. They are basically asking the network to output how confident it is for each of its three predictions.
      For example, if one of the masks has a predicted IOU of 0.99 it suggests that the network thinks it’s strongly overlapping the queried object.
      Again, note that this is all an inference-time thing. During training, we have the Ground Truth mask and use MSE loss to train the network to output the correct IOU scores. During inference, we just have the three predicted masks and the network's own confidence score as IoU for each of the three masks. Hope that helps.
      - Reg the text-prompt usage, they did not release it in their web-app. They just documented it in their paper. Don't know if there are plans to release in the future, or if there are other ways to access it.

    • @nitinsurya1991
      @nitinsurya1991 7 หลายเดือนก่อน

      @@avb_fj thanks. Have you tried if the CLIP text embeddings would do the trick? Essentially, they mentioned they did train the model with the input.

  • @timanb2491
    @timanb2491 6 หลายเดือนก่อน

    May i iask your a question - one type of prompt is segmentation mask . if we have segmentation mask as a prompt why should we use SAM ? we already have binary segmentation mask

    • @avb_fj
      @avb_fj  6 หลายเดือนก่อน

      Check out the part about interactive training at around 5:00
      Basically the dense mask prompt is used during the training phase to iteratively improve the networks segmentations. Kinda like asking the model, “Hey you gave me this mask last time, but here’s another internal/external point prompt, give me an updated mask”.
      During inference, we can pass an image full of zeros as the dense mask into the model (meaning we have no idea where the segmentation should be) and ask the model to update it. Once it gives an initial estimation, we might recursively pass the network’s output logits (not binary, but the prob distribution) back into it as a dense prompt to iteratively improve the predicted mask.
      In other words, don’t assume that the mask we pass in as prompt need to be the “correct one”… they will be incorrect / a gross estimation of the correct mask, and the networks job is to iteratively improve it till it converges somewhere.

  • @Grenoble7
    @Grenoble7 ปีที่แล้ว +2

    hello. Great dense video. Suggestion: you are a bit too fast for me: i have to pause on every slide to read it. Usually i x2 the speed of the video, but you are the only opposite i’ve seen on youtube! Maybe you could describe each slide in more details to let us the time to understand it? Just an idea.

  • @egonvanpraet
    @egonvanpraet ปีที่แล้ว +1

    Your content is very underrated in the algorithm. Keep making videos, they are great :) Would be great if you could explain MusicLM from Google.

    • @avb_fj
      @avb_fj  ปีที่แล้ว

      Thanks for the suggestion. I’ll add it to my bucket list for next month!