- 233
- 147 432
Data Science Gems
India
เข้าร่วมเมื่อ 14 มี.ค. 2022
These are recordings of the latest pieces of breakthrough research on deep learning for NLP and vision. The goal is to put up at least one update per week.
HomePage: sites.google.com/view/manishg/
LinkedIn: www.linkedin.com/in/manishsgupta/
Manish Gupta is a Principal Applied Researcher at Microsoft India R&D Private Limited at Hyderabad, India. He is also an Adjunct Faculty at IIIT, Hyderabad and a visiting faculty at ISB, Hyderabad. He received his Masters in Computer Science from IIT Bombay (2007) and Ph.D. from the Univ of Illinois at Urbana-Champaign in 2013. He worked for Yahoo! Bangalore from 2005-07. His research interests are in the areas of deep learning, natural language processing, web mining and data mining. He has published 150+ research papers in reputed refereed journals and conferences. He has also co-authored two books: one on Outlier Detection for Temporal Data and another one on Information Retrieval with Verbose Queries.
HomePage: sites.google.com/view/manishg/
LinkedIn: www.linkedin.com/in/manishsgupta/
Manish Gupta is a Principal Applied Researcher at Microsoft India R&D Private Limited at Hyderabad, India. He is also an Adjunct Faculty at IIIT, Hyderabad and a visiting faculty at ISB, Hyderabad. He received his Masters in Computer Science from IIT Bombay (2007) and Ph.D. from the Univ of Illinois at Urbana-Champaign in 2013. He worked for Yahoo! Bangalore from 2005-07. His research interests are in the areas of deep learning, natural language processing, web mining and data mining. He has published 150+ research papers in reputed refereed journals and conferences. He has also co-authored two books: one on Outlier Detection for Temporal Data and another one on Information Retrieval with Verbose Queries.
#234 MatFormer: Nested Transformer for Elastic Inference
Foundation models are applied in a broad spectrum of settings with different inference constraints, from massive multi-accelerator clusters to resource-constrained standalone mobile devices. However, the substantial costs associated with training these models often limit the number of unique model sizes that can be offered. Consequently, practitioners are compelled to select a model that may not be optimally aligned with their specific latency and cost requirements. MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints. MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model. During training, the parameters of multiple nested FFN blocks are optimized with varying sizes, enabling the extraction of hundreds of accurate smaller models without incurring additional computational costs. Efficacy of MatFormer is validated across different model classes (decoders and encoders) and modalities (language and vision), demonstrating its potential for real-world deployment. A 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters, each exhibiting better validation loss and one-shot downstream evaluations than independently trained counterparts. Furthermore, smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval. Finally, speculative decoding with the accurate and consistent submodels extracted from MatFormer can lead to significant reduction in inference latency.
In this video, I talk about the following: How are the MatFormer models trained? How does MatFormer perform?
For more details, please look at arxiv.org/pdf/2310.07707 and github.com/devvrit/matformer
Kudugunta, Sneha, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. "Matformer: Nested transformer for elastic inference." arXiv preprint arXiv:2310.07707 (2023).
In this video, I talk about the following: How are the MatFormer models trained? How does MatFormer perform?
For more details, please look at arxiv.org/pdf/2310.07707 and github.com/devvrit/matformer
Kudugunta, Sneha, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. "Matformer: Nested transformer for elastic inference." arXiv preprint arXiv:2310.07707 (2023).
มุมมอง: 33
วีดีโอ
#233 Stable Diffusion 3 and MM-DiT: Rectified flow transformers for high-resolution image synthesis
มุมมอง 162 ชั่วโมงที่ผ่านมา
Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it i...
#232 Genie: Generative interactive environments
มุมมอง 70วันที่ผ่านมา
Genie is the first generative interactive environment model trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotempo...
#231 AlphaFold Part 2
มุมมอง 4214 วันที่ผ่านมา
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking ef...
#230 AlphaFold Part 1
มุมมอง 25114 วันที่ผ่านมา
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking ef...
#229 MiniCPM-V: A GPT-4V Level MLLM on Your Phone
มุมมอง 11921 วันที่ผ่านมา
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of ...
#228 ChapVidMR: Chapter-based Video Moment Retrieval using Natural Language Queries
มุมมอง 8321 วันที่ผ่านมา
Video Moment Retrieval (VMR) is the task of linking a query with a relevant moment from a video. Although, recently, there has been work on the VMR task where a query is linked to a single moment, the corresponding task where the query needs to be linked to multiple moments has been understudied. In this paper, we aim to work on the VMR task primarily by leveraging chapters of TH-cam videos, i....
#227 Neural Corpus Indexer for Document Retrieval
มุมมอง 9221 วันที่ผ่านมา
Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. This paper aims to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. Neural Corpus Indexer (NCI) is a sequence-to-s...
#226 NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
มุมมอง 9021 วันที่ผ่านมา
Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. NV-Embed model significantly enhances the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, NV-Embed has a laten...
#225 ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos
มุมมอง 5421 วันที่ผ่านมา
Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for videobased learning, recommending "People Also Ask" questions, video-based chatbots, and factch...
OpenAI o1
มุมมอง 17921 วันที่ผ่านมา
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These...
#223 Multimodal Models Part2 (as part of IIT Delhi course on Large Language Models (LLMs))
มุมมอง 359หลายเดือนก่อน
Text Generation with multimodal inputs Applications like Visual conversation, Multimodal ChatBot, Scene Understanding, Knowledge-grounded image description, Visual Question Answering, Audio-visual integration perception ability, Common-knowledge concept recognition, Capture temporal dynamics in videos, Story and Song generation; comic understanding; Meeting notes with multiple speakers (speaker...
#222 Multimodal Models Part1 (as part of IIT Delhi course on Large Language Models (LLMs))
มุมมอง 830หลายเดือนก่อน
Discussion on various multimodal encoder models as follows. Vision-and-Language Tasks Vision Transformers VisualBERT ViLBERT CLIP Visually-rich Document Understanding LayoutLM VideoCLIP ImageBind
#221 Imagen 3: Evaluating Image generation models
มุมมอง 82หลายเดือนก่อน
Imagen 3 is a latent diffusion model that generates high quality images from text prompts. Imagen 3 is preferred over other state-of-the-art (SOTA) models as of Aug 2024. It generates images at 1024 × 1024 resolution, and can be followed by 2×, 4×, or 8× upsampling. It performs well at photorealism, and in adhering to long and complex user prompts. In this video, I talk about the following: Ima...
#216 OPRO: LLMs as optimizers
มุมมอง 181หลายเดือนก่อน
Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. Optimization by PROmpting (OPRO) is a simple and effective approach to leverage LLMs as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new s...
#220 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 4
มุมมอง 145หลายเดือนก่อน
#220 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 4
#219 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 3
มุมมอง 78หลายเดือนก่อน
#219 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 3
#218 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 2
มุมมอง 90หลายเดือนก่อน
#218 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 2
#217 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 1
มุมมอง 367หลายเดือนก่อน
#217 Large Language Models are Human-like Annotators. KR 2024 tutorial Part 1
#214 PaliGemma: A versatile 3B VLM for transfer
มุมมอง 1332 หลายเดือนก่อน
#214 PaliGemma: A versatile 3B VLM for transfer
#213 TravelPlanner: A Benchmark for Real-World Planning with Language Agents
มุมมอง 1152 หลายเดือนก่อน
#213 TravelPlanner: A Benchmark for Real-World Planning with Language Agents
#211 LLaVA-OneVision: Easy Visual Task Transfer
มุมมอง 1522 หลายเดือนก่อน
#211 LLaVA-OneVision: Easy Visual Task Transfer
#210 ToolkenGPT: Augmenting Frozen LMs with Massive Tools via Tool Embeddings
มุมมอง 1542 หลายเดือนก่อน
#210 ToolkenGPT: Augmenting Frozen LMs with Massive Tools via Tool Embeddings
#206 A Graph RAG Approach to Query-Focused Summarization
มุมมอง 5303 หลายเดือนก่อน
#206 A Graph RAG Approach to Query-Focused Summarization
#209 Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
มุมมอง 1693 หลายเดือนก่อน
#209 Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
#205 Chameleon: Plug-and-Play Compositional Reasoning with LLMs
มุมมอง 1314 หลายเดือนก่อน
#205 Chameleon: Plug-and-Play Compositional Reasoning with LLMs
hello sir
Great content, as always! I need some advice: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
Aren't *all* behaviors in LLMs "emergent", in the sense that if you have a small enough model it can't do *any*thing, so *all* behaviors "emerge" as the model gets larger, and at unpredictable model sizes? Some of what they're doing inducing emergence seems like making up "useless" metrics that happen to be smoother in the model size region below which useful behavior emerges. The age at which viable babies appear is "emergent" vs. time in that there's a big step function at 9 months, but if you change the metric to baby's weight then it suddenly becomes relatively smooth. But if you care about viable babies that's not a relevant metric.
Greatest Teacher
Do you have this deck available please?
Amazing video.
This is genuinely a Gem.
hello sir , hope you are doing well , could you set the steps on how to create the dataset for other languages and how to do training and inferences ?
Hi Manish. It would be great if you could also provide some insights into model training, why such models are important and what paths they open up for future or what the performance numbers truly indicate. Vanilla numbers don't mean much to us.
great explanation thanks Manish!
Not good teaching quality..
Very interesting, thanks!
How many videos as part of the course?
FF @10.25
Wonderful tutorial, sir!!
great, thank you so much
Thanx . It cleared some of my doubts🙌
Great explanation. Is it possible to get the slides? Thanks.
Thanks a lot for this.
Interesting view
How to join llm course
Do you plan to do an update on the purple llama 3 feature updates?
thanks sir
Thanks for the great video! I want to ask further: how can I compare the matching of two feature vectors from two different images generated by BLIP-2?
Thanks Manish ! Excellent work..
Is there any Github for the slides? Thanks a tonne for these videos!
Very detailed tutorial, thank you so much!
TQSM for you great efforts. It would be great, if you could make a video on ReWOO & LLMCompiler architectures ans stateful agents
Does the end normalization in FA2 only stay stable with double precision or fewer tokens?
Hi Manish Content is awesome. Thank you for all your videos. One thing I wanted to mention, delivery speed of this video is optimal, at least for me. In some of your other videos, delivery was too fast, by the time I could internalise a critical commentary it moved to 4th point! Please maintain this speed, at least no faster than this. Thank you again. 😊
I love your videos and mastery over all contemporary research in this rapidly emerging field… keep up this amazing work
Sir Does it take care of granularity before giving us the answer ? Reason I ask is joins could cause over reporting if granularity is not taken care of
Hidden Gems in this noisy world of Machine learning ! Thank you so much sir.
You are the best sir, I went through the paper but it was very difficult for me to understand even after 3 4 repetitions, because terminology were too complex for early 2 month practitioners like me, but you described the whole paper so seamlessly, I am so grateful for your time and knowledge,,,keep up the good work, sir..Love from NY, bhaiyaa
Thanks for very informative videos. Can you also upload the slides for all the videos?
Read about it Good thing found a detailed video
One feedback: Please just dont read the slides. Try to go deeper and explain intuitively
How many frames does the input video have?
Excellent explanation! Please make a video on Mamba 2 as well
Great explanation. Looking forward to the Mamba 2 paper
Your channel is underrated! That was great!
Great explanation! Thank you, Manish
Great explanation!
Great video sir, Thanks for this explanation !!
Would love to understand how single stage voice to text to audio works. Intresting to know if it's clever programming or a method.
Great explanation of this architecture.
Wonderful explanation sir 👍
thank you for explaining this prof!
Amazing explanation sir, Can you cover You Only Cache Once next it looks very promising
very much helpful prof! thank you