233
147 432

#233 Stable Diffusion 3 and MM-DiT: Rectified flow transformers for high-resolution image synthesis

20:21

#232 Genie: Generative interactive environments

14:45

#231 AlphaFold Part 2

29:55

#230 AlphaFold Part 1

59:54

#229 MiniCPM-V: A GPT-4V Level MLLM on Your Phone

21:10

#228 ChapVidMR: Chapter-based Video Moment Retrieval using Natural Language Queries

22:25

#234 MatFormer: Nested Transformer for Elastic Inference

Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints. MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model. During training, the parameters of multiple nested FFN blocks are optimized with varying sizes, enabling the extraction of hundreds of accurate smaller models without incurring additional computational costs. Efficacy of MatFormer is validated across different model classes (decoders and encoders) and modalities (language and vision), demonstrating its potential for real-world deployment. A 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters, each exhibiting better validation loss and one-shot downstream evaluations than independently trained counterparts. Furthermore, smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, speculative decoding with the accurate and consistent submodels extracted from MatFormer can lead to significant reduction in inference latency.
In this video, I talk about the following: How are the MatFormer models trained? How does MatFormer perform?
For more details, please look at arxiv.org/pdf/2310.07707 and github.com/devvrit/matformer
Kudugunta, Sneha, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. "Matformer: Nested transformer for elastic inference." arXiv preprint arXiv:2310.07707 (2023).

มุมมอง: 33

วีดีโอ

#233 Stable Diffusion 3 and MM-DiT: Rectified flow transformers for high-resolution image synthesis

20:21

#233 Stable Diffusion 3 and MM-DiT: Rectified flow transformers for high-resolution image synthesis

มุมมอง 162 ชั่วโมงที่ผ่านมา

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it i...

#232 Genie: Generative interactive environments

14:45

#232 Genie: Generative interactive environments

มุมมอง 70วันที่ผ่านมา

Genie is the first generative interactive environment model trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotempo...

29:55

#231 AlphaFold Part 2

มุมมอง 4214 วันที่ผ่านมา

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking ef...

59:54

#230 AlphaFold Part 1

มุมมอง 25114 วันที่ผ่านมา

#229 MiniCPM-V: A GPT-4V Level MLLM on Your Phone

21:10

#229 MiniCPM-V: A GPT-4V Level MLLM on Your Phone

มุมมอง 11921 วันที่ผ่านมา

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of ...

#228 ChapVidMR: Chapter-based Video Moment Retrieval using Natural Language Queries

22:25

#228 ChapVidMR: Chapter-based Video Moment Retrieval using Natural Language Queries

มุมมอง 8321 วันที่ผ่านมา

Video Moment Retrieval (VMR) is the task of linking a query with a relevant moment from a video. Although, recently, there has been work on the VMR task where a query is linked to a single moment, the corresponding task where the query needs to be linked to multiple moments has been understudied. In this paper, we aim to work on the VMR task primarily by leveraging chapters of TH-cam videos, i....

#227 Neural Corpus Indexer for Document Retrieval

19:25

#227 Neural Corpus Indexer for Document Retrieval

มุมมอง 9221 วันที่ผ่านมา

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. This paper aims to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. Neural Corpus Indexer (NCI) is a sequence-to-s...

#226 NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

11:02

#226 NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

มุมมอง 9021 วันที่ผ่านมา

Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. NV-Embed model significantly enhances the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, NV-Embed has a laten...

#225 ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos

37:00

#225 ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos

มุมมอง 5421 วันที่ผ่านมา

Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for videobased learning, recommending "People Also Ask" questions, video-based chatbots, and factch...

12:02

OpenAI o1

มุมมอง 17921 วันที่ผ่านมา

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These...

#223 Multimodal Models Part2 (as part of IIT Delhi course on Large Language Models (LLMs))

1:02:00

#223 Multimodal Models Part2 (as part of IIT Delhi course on Large Language Models (LLMs))

มุมมอง 359หลายเดือนก่อน

Text Generation with multimodal inputs Applications like Visual conversation, Multimodal ChatBot, Scene Understanding, Knowledge-grounded image description, Visual Question Answering, Audio-visual integration perception ability, Common-knowledge concept recognition, Capture temporal dynamics in videos, Story and Song generation; comic understanding; Meeting notes with multiple speakers (speaker...

#222 Multimodal Models Part1 (as part of IIT Delhi course on Large Language Models (LLMs))

46:18

#222 Multimodal Models Part1 (as part of IIT Delhi course on Large Language Models (LLMs))

มุมมอง 830หลายเดือนก่อน

Discussion on various multimodal encoder models as follows. Vision-and-Language Tasks Vision Transformers VisualBERT ViLBERT CLIP Visually-rich Document Understanding LayoutLM VideoCLIP ImageBind

#221 Imagen 3: Evaluating Image generation models

18:20

#221 Imagen 3: Evaluating Image generation models

มุมมอง 82หลายเดือนก่อน

Imagen 3 is a latent diffusion model that generates high quality images from text prompts. Imagen 3 is preferred over other state-of-the-art (SOTA) models as of Aug 2024. It generates images at 1024 × 1024 resolution, and can be followed by 2×, 4×, or 8× upsampling. It performs well at photorealism, and in adhering to long and complex user prompts. In this video, I talk about the following: Ima...

11:20

#216 OPRO: LLMs as optimizers

มุมมอง 181หลายเดือนก่อน

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. Optimization by PROmpting (OPRO) is a simple and effective approach to leverage LLMs as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new s...

#220 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 4

41:07

#220 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 4

มุมมอง 145หลายเดือนก่อน

#220 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 4

#219 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 3

29:03

#219 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 3

มุมมอง 78หลายเดือนก่อน

#219 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 3

#218 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 2

30:08

#218 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 2

มุมมอง 90หลายเดือนก่อน

#218 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 2

#217 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 1

54:40

#217 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 1

มุมมอง 367หลายเดือนก่อน

#217 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 1

12:31

#215 Llama 3.2

มุมมอง 2342 หลายเดือนก่อน

#215 Llama 3.2

#214 PaliGemma: A versatile 3B VLM for transfer

8:51

#214 PaliGemma: A versatile 3B VLM for transfer

มุมมอง 1332 หลายเดือนก่อน

#214 PaliGemma: A versatile 3B VLM for transfer

#213 TravelPlanner: A Benchmark for Real-World Planning with Language Agents

18:13

#213 TravelPlanner: A Benchmark for Real-World Planning with Language Agents

มุมมอง 1152 หลายเดือนก่อน

#213 TravelPlanner: A Benchmark for Real-World Planning with Language Agents

10:16

#212 Microsoft Phi 3.5

มุมมอง 2012 หลายเดือนก่อน

#212 Microsoft Phi 3.5

#211 LLaVA-OneVision: Easy Visual Task Transfer

15:23

#211 LLaVA-OneVision: Easy Visual Task Transfer

มุมมอง 1522 หลายเดือนก่อน

#211 LLaVA-OneVision: Easy Visual Task Transfer

#210 ToolkenGPT: Augmenting Frozen LMs with Massive Tools via Tool Embeddings

22:58

#210 ToolkenGPT: Augmenting Frozen LMs with Massive Tools via Tool Embeddings

มุมมอง 1542 หลายเดือนก่อน

#210 ToolkenGPT: Augmenting Frozen LMs with Massive Tools via Tool Embeddings

22:56

#208 LLaMA 3.1

มุมมอง 1722 หลายเดือนก่อน

#208 LLaMA 3.1

21:08

#207 Segment Anything 2

มุมมอง 1672 หลายเดือนก่อน

#207 Segment Anything 2

#206 A Graph RAG Approach to Query-Focused Summarization

12:15

#206 A Graph RAG Approach to Query-Focused Summarization

มุมมอง 5303 หลายเดือนก่อน

#206 A Graph RAG Approach to Query-Focused Summarization

#209 Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

14:03

#209 Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

มุมมอง 1693 หลายเดือนก่อน

#209 Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

#205 Chameleon: Plug-and-Play Compositional Reasoning with LLMs

12:12

#205 Chameleon: Plug-and-Play Compositional Reasoning with LLMs

มุมมอง 1314 หลายเดือนก่อน

#205 Chameleon: Plug-and-Play Compositional Reasoning with LLMs

ความคิดเห็น

@cadetmanishtiwari8694 6 ชั่วโมงที่ผ่านมา
hello sir
@AlejoJasmine 14 ชั่วโมงที่ผ่านมา
Great content, as always! I need some advice: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
@rogerscott529 6 วันที่ผ่านมา
Aren't *all* behaviors in LLMs "emergent", in the sense that if you have a small enough model it can't do *any*thing, so *all* behaviors "emerge" as the model gets larger, and at unpredictable model sizes? Some of what they're doing inducing emergence seems like making up "useless" metrics that happen to be smoother in the model size region below which useful behavior emerges. The age at which viable babies appear is "emergent" vs. time in that there's a big step function at 9 months, but if you change the metric to baby's weight then it suddenly becomes relatively smooth. But if you care about viable babies that's not a relevant metric.
@littleyogi3341 11 วันที่ผ่านมา
Greatest Teacher
@amitk3098 18 วันที่ผ่านมา
Do you have this deck available please?
@dhirajshanbhag1731 18 วันที่ผ่านมา
Amazing video.
@dhirajshanbhag1731 19 วันที่ผ่านมา
This is genuinely a Gem.
@onlyshorts6837 19 วันที่ผ่านมา
hello sir , hope you are doing well , could you set the steps on how to create the dataset for other languages and how to do training and inferences ?
@TheGsinghg 25 วันที่ผ่านมา
Hi Manish. It would be great if you could also provide some insights into model training, why such models are important and what paths they open up for future or what the performance numbers truly indicate. Vanilla numbers don't mean much to us.
@sagardesai1253 หลายเดือนก่อน
great explanation thanks Manish!
@subhendukundu4595 หลายเดือนก่อน
Not good teaching quality..
@jessedbrown1980 หลายเดือนก่อน
Very interesting, thanks!
@nikhilmugganawar หลายเดือนก่อน
How many videos as part of the course?
@sforshubham1 หลายเดือนก่อน
FF @10.25
@aliathar891 หลายเดือนก่อน
Wonderful tutorial, sir!!
@fayezalhussein7115 หลายเดือนก่อน
great, thank you so much
@zagreus6300 หลายเดือนก่อน
Thanx . It cleared some of my doubts🙌
@DED_Search หลายเดือนก่อน
Great explanation. Is it possible to get the slides? Thanks.
@GodLike420s หลายเดือนก่อน
Thanks a lot for this.
@Itsmethinking หลายเดือนก่อน
Interesting view
@RajKumar-su4ms หลายเดือนก่อน
How to join llm course
@HopsGuy 3 หลายเดือนก่อน
Do you plan to do an update on the purple llama 3 feature updates?
@debojitghosh1244 3 หลายเดือนก่อน
thanks sir
@son2388 3 หลายเดือนก่อน
Thanks for the great video! I want to ask further: how can I compare the matching of two feature vectors from two different images generated by BLIP-2?
@ashokmittal87 3 หลายเดือนก่อน
Thanks Manish ! Excellent work..
@anoubhav 4 หลายเดือนก่อน
Is there any Github for the slides? Thanks a tonne for these videos!
@parulsharma1456 4 หลายเดือนก่อน
Very detailed tutorial, thank you so much!
@HeyFaheem 4 หลายเดือนก่อน
TQSM for you great efforts. It would be great, if you could make a video on ReWOO & LLMCompiler architectures ans stateful agents
@mraarone 4 หลายเดือนก่อน
Does the end normalization in FA2 only stay stable with double precision or fewer tokens?
@sm_xiii 4 หลายเดือนก่อน
Hi Manish Content is awesome. Thank you for all your videos. One thing I wanted to mention, delivery speed of this video is optimal, at least for me. In some of your other videos, delivery was too fast, by the time I could internalise a critical commentary it moved to 4th point! Please maintain this speed, at least no faster than this. Thank you again. 😊
@MrGss1234 5 หลายเดือนก่อน
I love your videos and mastery over all contemporary research in this rapidly emerging field… keep up this amazing work
@thelonewarrior2413 5 หลายเดือนก่อน
Sir Does it take care of granularity before giving us the answer ? Reason I ask is joins could cause over reporting if granularity is not taken care of
@gouravagrwal4282 5 หลายเดือนก่อน
Hidden Gems in this noisy world of Machine learning ! Thank you so much sir.
@inishkohli273 5 หลายเดือนก่อน
You are the best sir, I went through the paper but it was very difficult for me to understand even after 3 4 repetitions, because terminology were too complex for early 2 month practitioners like me, but you described the whole paper so seamlessly, I am so grateful for your time and knowledge,,,keep up the good work, sir..Love from NY, bhaiyaa
@nileshkokane7724 5 หลายเดือนก่อน
Thanks for very informative videos. Can you also upload the slides for all the videos?
@ArkSriva 5 หลายเดือนก่อน
Read about it Good thing found a detailed video
@srkiancr7 5 หลายเดือนก่อน
One feedback: Please just dont read the slides. Try to go deeper and explain intuitively
@yb801 6 หลายเดือนก่อน
How many frames does the input video have?
@vini8123 6 หลายเดือนก่อน
Excellent explanation! Please make a video on Mamba 2 as well
@astaragmohapatra9 6 หลายเดือนก่อน
Great explanation. Looking forward to the Mamba 2 paper
@MeinDeutschkurs 6 หลายเดือนก่อน
Your channel is underrated! That was great!
@siddharthsingi6772 6 หลายเดือนก่อน
Great explanation! Thank you, Manish
@MercyPrasanna 7 หลายเดือนก่อน
Great explanation!
@Shubham_gupta18 7 หลายเดือนก่อน
Great video sir, Thanks for this explanation !!
@4141462 7 หลายเดือนก่อน
Would love to understand how single stage voice to text to audio works. Intresting to know if it's clever programming or a method.
@jdk997 7 หลายเดือนก่อน
Great explanation of this architecture.
@ShivamThakur-tp2br 7 หลายเดือนก่อน
Wonderful explanation sir 👍
@ashraf_isb 7 หลายเดือนก่อน
thank you for explaining this prof!
@visheshmittal468 7 หลายเดือนก่อน
Amazing explanation sir, Can you cover You Only Cache Once next it looks very promising
@ashraf_isb 7 หลายเดือนก่อน
very much helpful prof! thank you

Data Science Gems

ความคิดเห็น