How do Multimodal AI models work? Simple explanation

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ก.ย. 2024

ความคิดเห็น • 27

  • @emc3000
    @emc3000 9 หลายเดือนก่อน +8

    Thank you for giving actual application examples of this stuff.

  • @CharlesMacKay88
    @CharlesMacKay88 8 หลายเดือนก่อน +4

    great video. thanks for condensing this into the most important facts and avoiding any clickbait or annoying stuff.

  • @leastofyourconcerns4615
    @leastofyourconcerns4615 9 หลายเดือนก่อน +2

    awesome short introduction to the subject! appreciate you guys for those vids!

    • @AssemblyAI
      @AssemblyAI  9 หลายเดือนก่อน +1

      Thanks for watching!

  • @mystikalle
    @mystikalle 9 หลายเดือนก่อน +3

    Great video! I would like you to create a similar easy to understand video about the article "What AI Music Generators Can Do (And How They Do It)". Thanks!

  • @faisalron
    @faisalron 4 หลายเดือนก่อน +1

    Great content, really easy to understand! Thanks.
    Btw, the speaker looks like Nicholas Galitzine... 🤣🤣

  • @asfandiyar5829
    @asfandiyar5829 9 หลายเดือนก่อน +1

    Thanks for the awesome video! Though I think it was a little too quick given the topic being covered.

  • @InyourlanguezF
    @InyourlanguezF 15 วันที่ผ่านมา

    But what about chat gpt claiming to have been built an “Omni Model” for their gpt 4o. Basically one neural network for audio, images and text. That is not multimodal right ?

  • @PrantikRoychowdhury-e3c
    @PrantikRoychowdhury-e3c หลายเดือนก่อน

    Great explanation

  • @ShivangiTomar-p7j
    @ShivangiTomar-p7j หลายเดือนก่อน

    Awesome. Thanks!!

  • @MiroKrotky
    @MiroKrotky 9 หลายเดือนก่อน

    I have the same nose AS the speaker in the video, a little pushed to the side. Great vid. Best speaker on the channel

  • @pablofe123
    @pablofe123 8 หลายเดือนก่อน

    Brilliant, only six minutes.

  • @danielegrotti5231
    @danielegrotti5231 9 หลายเดือนก่อน

    HI, I saw a few second of your new video the Emergent Abilities of LLM, but after some hours disapper... Could you please re-upload the video? Was so interesting! Thanks you so much

    • @AssemblyAI
      @AssemblyAI  8 หลายเดือนก่อน

      Hi there - the video has been re-uploaded! Here's the link:
      th-cam.com/video/bQuVLKn10do/w-d-xo.html

  • @andrewdunbar828
    @andrewdunbar828 6 หลายเดือนก่อน +1

    Ah so they all convert to text in the pipeline? That's disappointing. I was wondering how they did the equivalent of tokenization for the other modalities. Text is rich but it's still inherently lossy or will introduce a certain kind of artefacting.

    • @andrewdunbar828
      @andrewdunbar828 6 หลายเดือนก่อน +1

      Actually I hunted around and it seems that multimodal models do in fact tokenize the other modes, often the term "patch" is used as the equivalent to "token" for the other modes.

  • @keithwins
    @keithwins 8 หลายเดือนก่อน

    Great!

  • @wealthassistant
    @wealthassistant 9 หลายเดือนก่อน

    How can chatGPT decode images? It’s mind boggling good at recognizing text in photos. I don’t see how you get that capability from training on images of cats and dogs.

    • @AssemblyAI
      @AssemblyAI  9 หลายเดือนก่อน +1

      Unfortunately, no paper for GPT-4 has been published, so it is unknown. It could somehow combine Optical Character Recognition with something like a Vision Transformer to be able to understand images and read text so well!

    • @xspydazx
      @xspydazx 5 หลายเดือนก่อน

      when training it learn captions for images (hence when inputting them you should give the most detailed description for each image) it then converts the image into its associated caption (because its not a database! , it has to have many images of a cat to recognise a cat image) (using haar cascades you cat pick individual items from an image) , so for object detection you would create a data set from ( a model using haar cascades to identify say eyes in a picture (Boxed) these recognized images can be fed into the model with thier descriptions:
      For medical imagry a whole case history and file can be added with an image, hence being very detailed later images can bring the same detailed information to the surface again:
      as a machine leaarning problem we need to remember how we trained networks to recognize pictures!
      we also have OCR so these pretraiend OCR images can also be labled !
      So once we have such data , we can slectively take information from a single image, its description as well as the other objects in the picture; (Captions do not include color information) so for coluor usage et and actual image inderstanding we have a diffuser !..... Hence stable diffusion ! with colour understanding we can generate simular images ! using a fractal !
      hence FULL STACK MODEL DEVELOMENT ,
      and that not the RAG! (which soon they will realize is a system whcih wwill need to be converted to a etl process! ( the llm long term memeory/ the rag the working memeory, the chat hiostory the shrot term memory) hence an ETL process will be required to update the local information into the main model , hence using the same tokenizer to tokenize the daat into the db so it can be loaded later quicker into the llm in a fine tuning! clearing the rag ! which should be performed as a backup ! ie monthly or anually !

  • @joelmaiza
    @joelmaiza 8 หลายเดือนก่อน

    For text are LLM?
    For image are...?

    • @xspydazx
      @xspydazx 5 หลายเดือนก่อน

      rom transformers import VisionEncoderDecoderModel, VisionTextDualEncoderProcessor, AutoImageProcessor, AutoTokenizer
      print('Add Vision...')
      # ADD HEAD
      # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
      Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
      "google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny"
      )
      _Encoder_ImageProcessor = Vmodel.encoder
      _Decoder_ImageTokenizer = Vmodel.decoder
      _VisionEncoderDecoderModel = Vmodel
      # Add Pad tokems
      LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel
      # Add Sub Components
      LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor
      LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer
      LM_MODEL
      This is how you add vision to llm (you can embed the head inside )
      print('Add Audio...')
      #Add Head
      # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
      _AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small")
      _AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small")
      _SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small")
      # Add Pad tokems
      _SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id
      _SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id
      LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder
      # Add Sub Components
      LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer
      LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor
      LM_MODEL
      This is how you can add vision :Sound (you need to make sure device = CPU ... as it takes at least 19gb ram to create the vision model (just from config)(plus the models in memory) ( they take probably a minute to run(if you begin a new mistral model it genrate weights for each layer in memeory also so it takes a few mins)

  • @QuintinMassey
    @QuintinMassey 4 หลายเดือนก่อน +1

    A Woman, questionable (it is 2024 after all). A Female a little more certain (same reason) 😂

  • @sereneThePity
    @sereneThePity 4 หลายเดือนก่อน

    backstreet freestyle

  • @seakyle8320
    @seakyle8320 9 หลายเดือนก่อน +1

    1:59 "concept of a woman"? ask woke people.