Great video! I would like you to create a similar easy to understand video about the article "What AI Music Generators Can Do (And How They Do It)". Thanks!
Hey thanks for the info! This is crazy cool. I'm brand new to all of this, I found this because I'm looking to run a LLM locally and have fluid TTS convos while watching a YT video for example, or listening to a podcast and discussing it live together. Is this possible yet with low latency? I'm chatting with GPT about it and they say yeah, but I'd like to ask you, is doing a multi-modal split possible where they can contextually process audio and video from a cpu source while recognizing my voice separately and carrying out a fairly complex convo? I'm running a 4080 mobile card which i guess can run up to 13bP well, but I'm eyeing the new 5080 too. Although it can't handle a lot more parameters, I'm wondering if the latency differences due to the architecture will be drastically better. Hope this makes sense!
HI, I saw a few second of your new video the Emergent Abilities of LLM, but after some hours disapper... Could you please re-upload the video? Was so interesting! Thanks you so much
i think chat gpt should have a imagegpt and soundgpt as multi modalites using the same method to generate text from pretrained text data with input querry to output but let say do the same with image by feeding images of a series of objects into a image ai that then do a transformer process guessing what part of a image that comes next to another part of a image by creating a cinematic space like a movie creating a scene. also do the same with sound so if you have a random sequence of input sound it can trigger like a memory a composition of sounds that related somehow to the input sounds trying to guess the next piece of sound transformer wise then combine all tree modalities in a fusion mode.
rom transformers import VisionEncoderDecoderModel, VisionTextDualEncoderProcessor, AutoImageProcessor, AutoTokenizer print('Add Vision...') # ADD HEAD # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny" ) _Encoder_ImageProcessor = Vmodel.encoder _Decoder_ImageTokenizer = Vmodel.decoder _VisionEncoderDecoderModel = Vmodel # Add Pad tokems LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel # Add Sub Components LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer LM_MODEL This is how you add vision to llm (you can embed the head inside ) print('Add Audio...') #Add Head # Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model _AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small") _AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small") _SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small") # Add Pad tokems _SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id _SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder # Add Sub Components LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor LM_MODEL This is how you can add vision :Sound (you need to make sure device = CPU ... as it takes at least 19gb ram to create the vision model (just from config)(plus the models in memory) ( they take probably a minute to run(if you begin a new mistral model it genrate weights for each layer in memeory also so it takes a few mins)
Ah so they all convert to text in the pipeline? That's disappointing. I was wondering how they did the equivalent of tokenization for the other modalities. Text is rich but it's still inherently lossy or will introduce a certain kind of artefacting.
Actually I hunted around and it seems that multimodal models do in fact tokenize the other modes, often the term "patch" is used as the equivalent to "token" for the other modes.
How can chatGPT decode images? It’s mind boggling good at recognizing text in photos. I don’t see how you get that capability from training on images of cats and dogs.
Unfortunately, no paper for GPT-4 has been published, so it is unknown. It could somehow combine Optical Character Recognition with something like a Vision Transformer to be able to understand images and read text so well!
when training it learn captions for images (hence when inputting them you should give the most detailed description for each image) it then converts the image into its associated caption (because its not a database! , it has to have many images of a cat to recognise a cat image) (using haar cascades you cat pick individual items from an image) , so for object detection you would create a data set from ( a model using haar cascades to identify say eyes in a picture (Boxed) these recognized images can be fed into the model with thier descriptions: For medical imagry a whole case history and file can be added with an image, hence being very detailed later images can bring the same detailed information to the surface again: as a machine leaarning problem we need to remember how we trained networks to recognize pictures! we also have OCR so these pretraiend OCR images can also be labled ! So once we have such data , we can slectively take information from a single image, its description as well as the other objects in the picture; (Captions do not include color information) so for coluor usage et and actual image inderstanding we have a diffuser !..... Hence stable diffusion ! with colour understanding we can generate simular images ! using a fractal ! hence FULL STACK MODEL DEVELOMENT , and that not the RAG! (which soon they will realize is a system whcih wwill need to be converted to a etl process! ( the llm long term memeory/ the rag the working memeory, the chat hiostory the shrot term memory) hence an ETL process will be required to update the local information into the main model , hence using the same tokenizer to tokenize the daat into the db so it can be loaded later quicker into the llm in a fine tuning! clearing the rag ! which should be performed as a backup ! ie monthly or anually !
hear me out We create a speech to image model and swap places so that the embedding is not a textual database but more of a gaussian visual map that retains concepts. so, it would be Speech to image> speech/image to speech/image (embedded image audio)> and then speech/image to text. This would more closely follow how humans think. The reason I think that's important is that text language is limited. I truly believe AI will be more powerful/capable than the bounds of our language, yet we confine it to that. why not rework the architecture of AI to be able to think without language, have it be boundless in HOW it thinks. and then have it express itself through text or audio.
Thank you for giving actual application examples of this stuff.
great video. thanks for condensing this into the most important facts and avoiding any clickbait or annoying stuff.
awesome short introduction to the subject! appreciate you guys for those vids!
Thanks for watching!
Great video! I would like you to create a similar easy to understand video about the article "What AI Music Generators Can Do (And How They Do It)". Thanks!
Great content, really easy to understand! Thanks.
Btw, the speaker looks like Nicholas Galitzine... 🤣🤣
Sure awesome explanation , thanks
Hey thanks for the info! This is crazy cool.
I'm brand new to all of this, I found this because I'm looking to run a LLM locally and have fluid TTS convos while watching a YT video for example, or listening to a podcast and discussing it live together.
Is this possible yet with low latency? I'm chatting with GPT about it and they say yeah, but I'd like to ask you, is doing a multi-modal split possible where they can contextually process audio and video from a cpu source while recognizing my voice separately and carrying out a fairly complex convo?
I'm running a 4080 mobile card which i guess can run up to 13bP well, but I'm eyeing the new 5080 too. Although it can't handle a lot more parameters, I'm wondering if the latency differences due to the architecture will be drastically better.
Hope this makes sense!
Great explanation! I have a pdf with text and tables only can I use a ollama with llama 3.2 as my LLM to run this process locally?
Thanks for the awesome video! Though I think it was a little too quick given the topic being covered.
Loved the simple explanation along with technical details!
Great explanation
I have the same nose AS the speaker in the video, a little pushed to the side. Great vid. Best speaker on the channel
Awesome. Thanks!!
HI, I saw a few second of your new video the Emergent Abilities of LLM, but after some hours disapper... Could you please re-upload the video? Was so interesting! Thanks you so much
Hi there - the video has been re-uploaded! Here's the link:
th-cam.com/video/bQuVLKn10do/w-d-xo.html
i think chat gpt should have a imagegpt and soundgpt as multi modalites using the same method to generate text from pretrained text data with input querry to output but let say do the same with image by feeding images of a series of objects into a image ai that then do a transformer process guessing what part of a image that comes next to another part of a image by creating a cinematic space like a movie creating a scene. also do the same with sound so if you have a random sequence of input sound it can trigger like a memory a composition of sounds that related somehow to the input sounds trying to guess the next piece of sound transformer wise then combine all tree modalities in a fusion mode.
For text are LLM?
For image are...?
rom transformers import VisionEncoderDecoderModel, VisionTextDualEncoderProcessor, AutoImageProcessor, AutoTokenizer
print('Add Vision...')
# ADD HEAD
# Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny"
)
_Encoder_ImageProcessor = Vmodel.encoder
_Decoder_ImageTokenizer = Vmodel.decoder
_VisionEncoderDecoderModel = Vmodel
# Add Pad tokems
LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel
# Add Sub Components
LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor
LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer
LM_MODEL
This is how you add vision to llm (you can embed the head inside )
print('Add Audio...')
#Add Head
# Combine pre-trained encoder and pre-trained decoder to form a Seq2Seq model
_AudioFeatureExtractor = AutoFeatureExtractor.from_pretrained("openai/whisper-small")
_AudioTokenizer = AutoTokenizer.from_pretrained("openai/whisper-small")
_SpeechEncoderDecoder = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained("openai/whisper-small","openai/whisper-small")
# Add Pad tokems
_SpeechEncoderDecoder.config.decoder_start_token_id = _AudioTokenizer.cls_token_id
_SpeechEncoderDecoder.config.pad_token_id = _AudioTokenizer.pad_token_id
LM_MODEL.SpeechEncoderDecoder = _SpeechEncoderDecoder
# Add Sub Components
LM_MODEL.Decoder_AudioTokenizer = _AudioTokenizer
LM_MODEL.Encoder_AudioFeatureExtractor = _AudioFeatureExtractor
LM_MODEL
This is how you can add vision :Sound (you need to make sure device = CPU ... as it takes at least 19gb ram to create the vision model (just from config)(plus the models in memory) ( they take probably a minute to run(if you begin a new mistral model it genrate weights for each layer in memeory also so it takes a few mins)
Diffusion models/GAN's
Brilliant, only six minutes.
Ah so they all convert to text in the pipeline? That's disappointing. I was wondering how they did the equivalent of tokenization for the other modalities. Text is rich but it's still inherently lossy or will introduce a certain kind of artefacting.
Actually I hunted around and it seems that multimodal models do in fact tokenize the other modes, often the term "patch" is used as the equivalent to "token" for the other modes.
How can chatGPT decode images? It’s mind boggling good at recognizing text in photos. I don’t see how you get that capability from training on images of cats and dogs.
Unfortunately, no paper for GPT-4 has been published, so it is unknown. It could somehow combine Optical Character Recognition with something like a Vision Transformer to be able to understand images and read text so well!
when training it learn captions for images (hence when inputting them you should give the most detailed description for each image) it then converts the image into its associated caption (because its not a database! , it has to have many images of a cat to recognise a cat image) (using haar cascades you cat pick individual items from an image) , so for object detection you would create a data set from ( a model using haar cascades to identify say eyes in a picture (Boxed) these recognized images can be fed into the model with thier descriptions:
For medical imagry a whole case history and file can be added with an image, hence being very detailed later images can bring the same detailed information to the surface again:
as a machine leaarning problem we need to remember how we trained networks to recognize pictures!
we also have OCR so these pretraiend OCR images can also be labled !
So once we have such data , we can slectively take information from a single image, its description as well as the other objects in the picture; (Captions do not include color information) so for coluor usage et and actual image inderstanding we have a diffuser !..... Hence stable diffusion ! with colour understanding we can generate simular images ! using a fractal !
hence FULL STACK MODEL DEVELOMENT ,
and that not the RAG! (which soon they will realize is a system whcih wwill need to be converted to a etl process! ( the llm long term memeory/ the rag the working memeory, the chat hiostory the shrot term memory) hence an ETL process will be required to update the local information into the main model , hence using the same tokenizer to tokenize the daat into the db so it can be loaded later quicker into the llm in a fine tuning! clearing the rag ! which should be performed as a backup ! ie monthly or anually !
Great!
hear me out
We create a speech to image model and swap places so that the embedding is not a textual database but more of a gaussian visual map that retains concepts.
so, it would be Speech to image> speech/image to speech/image (embedded image audio)> and then speech/image to text.
This would more closely follow how humans think. The reason I think that's important is that text language is limited. I truly believe AI will be more powerful/capable than the bounds of our language, yet we confine it to that. why not rework the architecture of AI to be able to think without language, have it be boundless in HOW it thinks. and then have it express itself through text or audio.
backstreet freestyle
A Woman, questionable (it is 2024 after all). A Female a little more certain (same reason) 😂
1:59 "concept of a woman"? ask woke people.