AI Papers Academy
AI Papers Academy
  • 59
  • 226 713
Writing in the Margins: Better LLM Inference Pattern for Long Context Retrieval
In this video, we explain the Writing in the Margins (WiM) method, introduced in a recent research paper, titled: "Writing in the Margins: Better Inference Pattern for Long Context Retrieval".
With WiM, the researchers were able to achieve significant performance improvement on long input sequences, for off-the-shelf large language models (LLMs) such as Phi-3, Qwen2 and Llama-3.1.
How it works?
As part of the LLM inference process, the WiM method feed the input context to the LLMs by chunks, rather than all at once. And when each chunk is being processed, the LLM is also instructed to generate a note about the information in the current chunk. Finally, both the context, and the notes (which we refers as the margins), are available for the LLM to come up with the final response.
Paper page - www.arxiv.org/abs/2408.14906
Code - github.com/writer/writing-in-the-margins
-----------------------------------------------------------------------------------------------
✉️ Join the newsletter - aipapersacademy.com/newsletter/
👍 Please like & subscribe if you enjoy this content
-----------------------------------------------------------------------------------------------
Chapters:
0:00 Introduction
0:58 Writing in the Margins (WiM)
3:15 WiM Example
3:53 WiM Results
มุมมอง: 565

วีดีโอ

Sapiens by Meta AI: Foundation for Human Vision Models
มุมมอง 2.2Kหลายเดือนก่อน
In this video we dive into Sapiens, a new family of models for four fundamental human-centric tasks, presented by Meta AI in a recent research paper titled "Sapiens: Foundation for Human Vision Models". The model's architecture is based on Vision Transformer (ViT), which for the first time pushed to train on 1K resolution images, x5 in size than DINOv2's input images size! We cover the model's ...
Mixture of Nested Experts: Adaptive Processing of Visual Tokens | AI Paper Explained
มุมมอง 519หลายเดือนก่อน
In this video we dive into a recent research paper by Google, titled: "Mixture of Nested Experts: Adaptive Processing of Visual Tokens". While standard Mixture of Experts (MoE) is successfully applied in LLMs, and also in computer vision, to increase computational cost without a proportional increase to model size, it comes with a large memory footprint. The Mixture of Nested Experts (MoNE) whi...
Introduction to Mixture-of-Experts (MoE)
มุมมอง 2.2K2 หลายเดือนก่อน
In this video we go back to the extremely important Google paper which introduced the Mixture-of-Experts (MoE) layer with authors including Geoffrey Hinton. The paper is titled Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. MoE today is widely used in various top Large Language Models and interestingly, it was published at the beginning of 2017, while the Atten...
Mixture-of-Agents (MoA) Enhances Large Language Model Capabilities
มุมมอง 2.3K3 หลายเดือนก่อน
A new paper titled "Mixture-of-Agents Enhances Large Language Model Capabilities" shows a method to win GPT-4o on AlpacaEval 2.0 using open-source large language models (LLMs). In this video we explain what is the Mixture-of-Agents (MoA) method by diving into that research paper. Mixture-of-Agents (MoA) is inspired by the well-known Mixture-of-Experts (MoE) method, but unlike MoE, which embeds ...
Arithmetic Transformers with Abacus Positional Embeddings | AI Paper Explained
มุมมอง 5814 หลายเดือนก่อน
In this video we dive into a recent research paper, titled: Transformers Can Do Arithmetic with the Right Embeddings. The paper introduces Abacus Embeddings, a new type of positional embeddings. Using Abacus Embeddings, the researchers were able to train state-of-the-art transformers for numbers addition, with impressive logical extrapolation capabilities - a model that was trained on 20-digit ...
CLLMs: Consistency Large Language Models | AI Paper Explained
มุมมอง 9294 หลายเดือนก่อน
In this video we dive into Consistency Large Language Models (CLLMs), a new method which was introduced in a recent research paper, to significantly improve the inference latency of Large Language Models (LLMs). CLLMs efficiently decode multiple tokens in one forward pass, which makes the response generation faster since there is no need to do a forward pass for each generated token. CLLMs rely...
ReFT: Representation Finetuning for Language Models | AI Paper Explained
มุมมอง 3K5 หลายเดือนก่อน
Can LoReFT be a rival for LoRA? According to ReFT paper, it has the potential to replace LoRA in various cases. In this video we dive into the research paper that presents ReFT and LoReFT. We'll explain what is representation fine-tuning (ReFT), and how it is different than previous parameter-efficient fine-tuning (PEFT) methods, such as LoRA. ReFT is a family of methods that can be used to ada...
Stealing Part of a Production Language Model | AI Paper Explained
มุมมอง 1.7K6 หลายเดือนก่อน
Many of the top LLMs today are closed source. What if we could discover their internal weights? In this video we dive into a recent research paper from Google DeepMind which presents an attack on large language models. The attack targets transformer-based LLMs, that expose log probabilities as part of their API, which includes GPT-4 and PaLM-2. The researchers successfully used the attack to di...
The Era of 1-bit LLMs by Microsoft | AI Paper Explained
มุมมอง 90K7 หลายเดือนก่อน
In this video we dive into a recent research paper by Microsoft: "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits". This paper introduce an interesting and exciting architecture for large language models, called BitNet b1.58, which significantly reduces LLMs memory consumption, and speeds-up LLMs inference latency. All of that, while showing promising results, that do not fall...
V-JEPA by Meta AI - A Human-Like Computer Vision Video-based Model
มุมมอง 5K7 หลายเดือนก่อน
In this video we dive into V-JEPA, a new vision models collection, created by Meta AI. V-JEPA stands for Video Joint-Embedding Predictive Architecture, and is part of the Meta AI's implementation of Yann LeCun's vision for a more human-like AI. In this video we dive deep into the researcher paper which presented V-JEPA, titled: "Revisiting Feature Prediction for Learning Visual Representations ...
Self-Rewarding Language Models by Meta AI - Path to Open-Source AGI?
มุมมอง 3.7K8 หลายเดือนก่อน
In this video we review a new paper titled "Self-Rewarding Language Models" by Meta AI. This paper is published on the same day that Mark Zuckerberg announced that Meta AI is working towards building an open-source AGI, and this paper may be a step in that direction. The paper introduces a method to self-align a pre-trained large language model (LLM) that can replace standard RLHF and RLAIF. Th...
Fast Inference of Mixture-of-Experts Language Models with Offloading
มุมมอง 1.3K8 หลายเดือนก่อน
In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading". Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays. However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to ru...
TinyGPT-V: Small but Mighty Multimodal Large Language Model
มุมมอง 1.5K9 หลายเดือนก่อน
In this video we explain how TinyGPT-V model was built, by reviewing its presenting research paper: "TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones". TinyGPT-V is a small multimodal large language model (MLLM) that is based on Phi-2 as its backbone LLM. By being based on Phi-2, TinyGPT-V has only 2.8B params which makes it smaller comparing to other MLLMs that are base...
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
มุมมอง 4.1K9 หลายเดือนก่อน
In this video we review a recent important paper from Apple, titled: "LLM in a flash: Efficient Large Language Model Inference with Limited Memory". This paper presents a method to run large language models (LLMs) on devices that does not have enough memory to store the entire model's weights. This is an exciting progress in LLMs democratization as it brings closer to using top large language m...
Introduction to Vision Transformers (ViT) | An Image is Worth 16x16 Words
มุมมอง 2.4K9 หลายเดือนก่อน
Introduction to Vision Transformers (ViT) | An Image is Worth 16x16 Words
Orca 2 by Microsoft: Teaching Small Language Models How to Reason
มุมมอง 2.1K10 หลายเดือนก่อน
Orca 2 by Microsoft: Teaching Small Language Models How to Reason
LCM-LoRA: From Diffusion Models to Fast SDXL with Latent Consistency Models
มุมมอง 2.9K10 หลายเดือนก่อน
LCM-LoRA: From Diffusion Models to Fast SDXL with Latent Consistency Models
CODEFUSION by Microsoft: A Pre-trained Diffusion Model for Code Generation
มุมมอง 1.1K11 หลายเดือนก่อน
CODEFUSION by Microsoft: A Pre-trained Diffusion Model for Code Generation
Table-GPT by Microsoft: Empower LLMs To Understand Tables
มุมมอง 7K11 หลายเดือนก่อน
Table-GPT by Microsoft: Empower LLMs To Understand Tables
Vision Transformers Need Registers - Fixing a Bug in DINOv2?
มุมมอง 2.5K11 หลายเดือนก่อน
Vision Transformers Need Registers - Fixing a Bug in DINOv2?
Emu by Meta AI: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
มุมมอง 794ปีที่แล้ว
Emu by Meta AI: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
NExT-GPT: Any-to-Any Multimodal LLM
มุมมอง 7Kปีที่แล้ว
NExT-GPT: Any-to-Any Multimodal LLM
Large Language Models As Optimizers - OPRO by Google DeepMind
มุมมอง 3.1Kปีที่แล้ว
Large Language Models As Optimizers - OPRO by Google DeepMind
FACET by Meta AI - Fairness in Computer Vision Evaluation Benchmark
มุมมอง 431ปีที่แล้ว
FACET by Meta AI - Fairness in Computer Vision Evaluation Benchmark
Code Llama Paper Explained
มุมมอง 2.1Kปีที่แล้ว
Code Llama Paper Explained
WizardMath from Microsoft - Best Open Source Math LLM with Reinforced Evol-Instruct
มุมมอง 3.5Kปีที่แล้ว
WizardMath from Microsoft - Best Open Source Math LLM with Reinforced Evol-Instruct
Shepherd by Meta AI - A Critic for Large Language Models
มุมมอง 696ปีที่แล้ว
Shepherd by Meta AI - A Critic for Large Language Models
Soft Mixture of Experts - An Efficient Sparse Transformer
มุมมอง 4.8Kปีที่แล้ว
Soft Mixture of Experts - An Efficient Sparse Transformer
Universal and Transferable LLM Attacks - A New Threat to AI Safety
มุมมอง 2.6Kปีที่แล้ว
Universal and Transferable LLM Attacks - A New Threat to AI Safety

ความคิดเห็น

  • @thienthuoan1081
    @thienthuoan1081 11 วันที่ผ่านมา

    nice vedioeasy to understand the principle

  • @aamir122a
    @aamir122a 24 วันที่ผ่านมา

    Is there an impementation somewhere

  • @armaneshaghi6732
    @armaneshaghi6732 หลายเดือนก่อน

    Are these models also good for segmentation ?

  • @liangzijian4452
    @liangzijian4452 หลายเดือนก่อน

    nice video!

  • @wainrebGilad
    @wainrebGilad หลายเดือนก่อน

    thank you for the clear explanation

  • @ariamehrmaleki8964
    @ariamehrmaleki8964 หลายเดือนก่อน

    Thanks for the video Can you also do a video on how to use these models from github in google colab please?

  • @xxlvulkann6743
    @xxlvulkann6743 หลายเดือนก่อน

    Great explanation! It is interesting to see how attention matrices aid in interpretability research and in getting better representations! I wonder how this could be applied to other modalities (such as audio).

  • @TurboKoder
    @TurboKoder 2 หลายเดือนก่อน

    Sorry but this paper is quite a briefing to 1 bit LLMs and the video itself is not explaining anything more than reading it out loud. There are multiple questions like what is the viable option to train such models, how it influences activation functions and what's the real benefit here as it suggests without multiplications today's GPUs would not be required which is not really true there. And requirement for new optimized hardware is not really a cool path to go forward.

  • @marzi869
    @marzi869 2 หลายเดือนก่อน

    Thanks, but remove the music in background.

  • @OpenAITutor
    @OpenAITutor 2 หลายเดือนก่อน

    I love this approach. I created a version using groq and open-webui ! It rocks !!

  • @geraldkenneth119
    @geraldkenneth119 2 หลายเดือนก่อน

    It reminds me of BYOL, but with an enhanced training scheme

  • @stevenkies802
    @stevenkies802 2 หลายเดือนก่อน

    Another excellent episode. Your channel is underappreciated.

  • @karthickdurai2157
    @karthickdurai2157 3 หลายเดือนก่อน

    I think the math wizard is removed from huggingface

  • @menkiguo7805
    @menkiguo7805 3 หลายเดือนก่อน

    what is the background music btw

  • @fallinginside3001
    @fallinginside3001 3 หลายเดือนก่อน

    Thank you

  • @gabrielsandstedt
    @gabrielsandstedt 3 หลายเดือนก่อน

    How feasible is it to adapt BitNet b1.58's ternary quantization (-1, 0, 1) for quantum computing using qutrits, given the current state of qutrit-based hardware, error correction, and the development of specialized quantum algorithms?

  • @SparshGarg-n8e
    @SparshGarg-n8e 3 หลายเดือนก่อน

    Thanks a lot!

  • @RobBrogan
    @RobBrogan 3 หลายเดือนก่อน

    A little bit like how I’m using Perplexity that lets me refresh a response with a different model. Except I’m using my human brain to choose an ideal model or draw info from different ones. Or how back in the day, the best search engine was this one that used multiple services (dogpile? Can’t remember). Maybe my comparison is bad, but definitely look forward to a tool that combines multiple LLMs.

  • @vladyslavkorenyak872
    @vladyslavkorenyak872 4 หลายเดือนก่อน

    This channel is amazing! I feel inspired by the simplicity of the ideas and their results. So many low-hanging fruits!

  • @TheSparkoi
    @TheSparkoi 4 หลายเดือนก่อน

    thank you so much for all your complex explanation :)

  • @TheSparkoi
    @TheSparkoi 4 หลายเดือนก่อน

    hey do you think we can have more than 0.7 frame par second if you render only 500X500 with a 4090 as hardware

  • @eladwarshawsky7587
    @eladwarshawsky7587 4 หลายเดือนก่อน

    Great job on the video. I read this paper a while ago, and this is a great explanation. I hear the accent, so if you’re ever in tel aviv I’d be happy to meet up.

  • @jameswhitaker4357
    @jameswhitaker4357 5 หลายเดือนก่อน

    So interesting! 👀

  • @StrugglingIdiot
    @StrugglingIdiot 5 หลายเดือนก่อน

    Is it over already? I was sleeping. 😴

  • @aryamanarora4967
    @aryamanarora4967 5 หลายเดือนก่อน

    Thank you for making this excellent video about our work! Minor note: at the end you mention 18-minute training time for our instruction-following ReFT, but that number is only for the small 1K subset of Ultrafeedback (last row in table). It takes a couple hours to train on the whole dataset, but we wanted to show that ReFT is also data-efficient through that number.

    • @aipapersacademy
      @aipapersacademy 5 หลายเดือนก่อน

      Thank you Aryaman for the kind feedback and for the correction 🙏

  • @xuantungnguyen9719
    @xuantungnguyen9719 5 หลายเดือนก่อน

    Thanks

  • @SuperCombatarms
    @SuperCombatarms 6 หลายเดือนก่อน

    Is there any code associated with this study?

  • @ameynaik2743
    @ameynaik2743 6 หลายเดือนก่อน

    I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.

  • @TommyJefferson1801
    @TommyJefferson1801 6 หลายเดือนก่อน

    Or else let's do Distillation + less bit transformer 😅

  • @caseyalanjones
    @caseyalanjones 6 หลายเดือนก่อน

    Interesting, but what does "JOINT" mean in this context?

    • @兴宣大院君-h4s
      @兴宣大院君-h4s 6 หลายเดือนก่อน

      Maybe means the embeddings of target and context. They are joint?

    • @caseyalanjones
      @caseyalanjones 6 หลายเดือนก่อน

      @@兴宣大院君-h4syes, that could be - thanks!

  • @caseyalanjones
    @caseyalanjones 6 หลายเดือนก่อน

    Thanks!

  • @imaginebaggins2691
    @imaginebaggins2691 6 หลายเดือนก่อน

    Good content, but i think it would be better if you went a little more in depth into all the smaller details, for me it felt like it was going a bit too fast and didnt understand a bit of what was happening, so i would prefer if there were more in depth explanation of the technical details.

  • @sanesanyo
    @sanesanyo 6 หลายเดือนก่อน

    Is there benchmarking data available for larger LLMs like GPT4-Turbo or Claude-3-Opus?

  • @oryxchannel
    @oryxchannel 6 หลายเดือนก่อน

    The thinking around BitNet b1.58 is intimately tied to the .gif in the paper “Stanford engineers propose a simpler design for quantum computers.” See the short .gif in action. Funding for that research began prior to 2021. Funding was provided largely by the US Department of Defense. Guess who virtually IS the US military by virtue of having a $ 3 T market cap to keep secret projects, secret? Thats right. Microsoft.

  • @arjavgarg5801
    @arjavgarg5801 6 หลายเดือนก่อน

    Model weights will make a lot more sense

  • @burthacklin
    @burthacklin 6 หลายเดือนก่อน

    This is something I predicted would happen in AI. It's cool to see a concrete usage of it. Ternary computers are the most efficienty computers and base 3 is the most efficient base. So this isn't surprising. Read up on Radix Economy to learn more.

    • @antonf.9278
      @antonf.9278 6 หลายเดือนก่อน

      How would you represent ternaries in hardware? Would you leave pins floating, force them to the middle with a voltage divider or add a second pin? * Also, in general computing multiplication by unknowns and division by non powers of 2 are rare operations. All of that ignores the added complexity that would nullify the advantages of radix economy because it would increase the complexity of division by abandoning the check in binary long division for the guess and check needed in bases larger than 2. * In the first case you could not run at high clock speeds because strai capacitance and inductance would cause errors. Second case: Transistors become inefficient at the midpoint between high and low, causing massive energy consumption and heating. Third case: A second line allows you to use nibbles, meaning you just ignore certain states out of principle and wasting computational power.

    • @burthacklin
      @burthacklin 6 หลายเดือนก่อน

      @@antonf.9278 Just use negative voltages. Also division by non powers of 2 are VERY common in computing. As in most division for applications will not be a power of 2, like in machine learning.

  • @anilaxsus6376
    @anilaxsus6376 6 หลายเดือนก่อน

    but how is the accuracy ?

  • @giacintoboccia9386
    @giacintoboccia9386 6 หลายเดือนก่อน

    We had a lecture about single bit neural networks at one of my uni courses, some 5 years ago. It was interesting.

  • @maxvell77
    @maxvell77 6 หลายเดือนก่อน

    Thanks!

  • @maxvell77
    @maxvell77 6 หลายเดือนก่อน

    Well explained! Thanks for the well-written script, it helped me so much. Keep going!

  • @xianghaisheng7800
    @xianghaisheng7800 6 หลายเดือนก่อน

    It's a bit difficult to understand your accent, probably because I'm not a native speaker. Do you consider using an AI synthesized voice?

    • @rkvkydqf
      @rkvkydqf 6 หลายเดือนก่อน

      Please don't. Most TTS engines have became my personal heuristic for low effort spam (sometimes including automated content farms). Voice acting is a skill and will improve over time if you let it. Individuality, the subtle inflection and candidness of a person's interior thoughts matching the waveforms you hear, that neither a hired voice actor nor a TTS model could replicate.

  • @hypervanse
    @hypervanse 7 หลายเดือนก่อน

    wonder why people don’t use this approach from the beginning. It’s like LLMs in assembly language. And as far as I know, every linear operator has a kernel. The kernel means that a linear operator H always maps the zero vector to itself. When we use a computer, we represent the zero vector as a column matrix of n zeros. Since the layers of LLMs are in the same vector space, we have H\vec{0} = \vec{0} for any H. I apologize for my bad LaTeX, but \vec{0} is supposed to be a vector. It’s important to remember that 0 is the trivial element in the kernel. For example, let Z be the set of all integers, and let H be the multiplication operator. Then, in ordinary algebra, we have positive, zero, and negative integers. The operator is \cdot, not x. The multiplication operator is often used in quantum mechanics of many particles, where the vector space grows exponentially, just like the number of bits for multiple objects.

  • @chodnejabko3553
    @chodnejabko3553 7 หลายเดือนก่อน

    This might have advantage even more when we get dedicated hardware, since tri-state logic is already a thing in CMOS. A dedicated tri-state matrix multiplication architecture for this type of networks should be easy to engineer with modern processes. NVIDIA should be all over that.

  • @Tohidkhan-lt4pd
    @Tohidkhan-lt4pd 7 หลายเดือนก่อน

    🎉😊❤

  • @adamhafchadi4924
    @adamhafchadi4924 7 หลายเดือนก่อน

    what is that accent?

    • @ilianos
      @ilianos 7 หลายเดือนก่อน

      was looking for the same

    • @Jonas-gm4my
      @Jonas-gm4my 6 หลายเดือนก่อน

      I would guess french

  • @forheuristiclifeksh7836
    @forheuristiclifeksh7836 7 หลายเดือนก่อน

    2:13

  • @fernandos-bs6544
    @fernandos-bs6544 7 หลายเดือนก่อน

    I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.

  • @ntal5859
    @ntal5859 7 หลายเดือนก่อน

    So in summary everything is either Yes=1, Never mind =0, No = - 1 If only women were so simple to work out.

  • @yash1152
    @yash1152 7 หลายเดือนก่อน

    1:56 is the "same peformance" with "perito improvement" just an illustration of theoretical prediction or actual model weights data from real trit-model?

  • @pmarreck
    @pmarreck 7 หลายเดือนก่อน

    This is great! FYI, you can create a model of your voice in ElevenLabs, do a voice-to-voice transformation, and out would come perfectly pronounced English. I found this out by accident because I created a model of Arnold Schwarzenegger's voice, but everything I made it say LOST the accent but kept his tone of voice, LOL

    • @hypervanse
      @hypervanse 6 หลายเดือนก่อน

      That's maybe be fun, but I clearly can be potentially much more dangerous than a password leak. You trained with your voice right? would you want someone to make some calls with hate speech using your voice for example?