Run 70Bn Llama 3 Inference on a Single 4GB GPU

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 พ.ค. 2024
  • Code : github.com/rohan-paul/LLM-Fin...
    🐦 Connect with me in Twitter : / rohanpaul_ai
    Airllm Github - github.com/lyogavin/Anima/tre...
    Checkout the MASSIVELY UPGRADED 2nd Edition of my Book (with 1300+ pages of Dense Python Knowledge) 🐍🔥
    Covering 350+ Python 🐍 Core concepts ( 1300+ pages ) 🚀
    🟠 Book Link - rohanpaul.gumroad.com/l/pytho...
    -----------------
    Hi, I am a Machine Learning Engineer | Kaggle Master. Connect with me on 🐦 TWITTER: / rohanpaul_ai - for daily in-depth coverage of Large Language Model bits
    ----------------
    You can find me here:
    **********************************************
    🐦 TWITTER: / rohanpaul_ai
    👨🏻‍💼 LINKEDIN: / rohan-paul-ai
    👨‍🔧 Kaggle: www.kaggle.com/paulrohan2020
    👨‍💻 GITHUB: github.com/rohan-paul
    🧑‍🦰 Facebook : / rohan.paul.562
    📸 Instagram: / rohan_paul_2020
    **********************************************
    Other Playlist you might like 👇
    🟠 MachineLearning & DeepLearning Concepts & interview Question Playlist - bit.ly/380eYDj
    🟠 ComputerVision / DeepLearning Algorithms Implementation Playlist - bit.ly/36jEvpI
    🟠 DataScience | MachineLearning Projects Implementation Playlist - bit.ly/39MEigt
    🟠 Natural Language Processing Playlist : bit.ly/3P6r2CL
    ----------------------
    #LLM #Largelanguagemodels #Llama2 #LLMfinetuning #opensource #NLP #ArtificialIntelligence #datascience #textprocessing #deeplearning #deeplearningai #100daysofmlcode #neuralnetworks #datascience #generativeai #generativemodels #OpenAI #GPT #GPT3 #GPT4 #chatgpt #genai

ความคิดเห็น • 49

  • @scottmiller2591
    @scottmiller2591 หลายเดือนก่อน +1

    Good writeup - covered when it's applicable, and pros and cons. I would recommend using it on a machine w a lot of RAM, setting up a RAM disk, and using that for your cache - that would knock the latency down somewhat.

  • @tshawtshi3040
    @tshawtshi3040 หลายเดือนก่อน

    I was thinking about this for a while. Im glad someone did it. O think if done properly you can have similar performance to all weights in vram

  • @honestgoat
    @honestgoat หลายเดือนก่อน +4

    Using this method then, is it possible to run say a 350b model on a gpu with 20/24gb vram?
    Say running Grok-1 which is 314b could run on a 3090/4090 using this method?
    I know it would be slow af, but it could work right?

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      theoretically possible . The layered inference approach will just do the sequential loading and unloading of model layers. Ofcourse, the latency will accumulate and result in super super slow inference.

  • @gaborcsurke6937
    @gaborcsurke6937 หลายเดือนก่อน

    The question is if we have more VRAM like 16 or 24GB that can be used and mitigate the SSD bottleneck more? Maybe that way can read not only one layer but multiple and that way can be even faster

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      Yes, I think its possible i.e. you can managing the num of layers to allocate to GPU reducing the frequency of SSD reads.
      Here's the long ans.
      - In the current implementation of their code (check the github repo), the `AirLLMBaseModel` class in `airllm_base.py` loads and processes one layer at a time during the forward pass. However, you can modify the `forward` method to load and cache a certain number of layers based on the available GPU memory.
      For example, you can introduce a configuration parameter to specify the number of layers to cache in GPU memory. Then, in the `forward` method, you can load and store the layers in a cache until the specified number of layers is reached. When processing the next layer, you can check if it is already in the cache before loading it from SSD.
      Here's a simplified example of how you could modify the `forward` method to cache multiple layers:
      ```python
      def forward(self, ...):
      ...
      cached_layers = []
      max_cached_layers = 4 # Specify the maximum number of layers to cache
      for i, (layer_name, layer) in enumerate(zip(self.layer_names, self.layers)):
      if layer_name in cached_layers:
      # Layer is already cached, use it directly
      layer = self.cached_layers[layer_name]
      else:
      # Load the layer from SSD and add it to the cache
      state_dict = self.load_layer_to_cpu(layer_name)
      self.move_layer_to_device(state_dict)
      cached_layers.append(layer_name)
      self.cached_layers[layer_name] = layer
      # Remove the oldest cached layer if the cache size exceeds the maximum
      if len(cached_layers) > max_cached_layers:
      oldest_layer = cached_layers.pop(0)
      del self.cached_layers[oldest_layer]
      # Process the layer
      ...
      ```
      In this example, the `max_cached_layers` variable determines the maximum number of layers to cache in GPU memory. The `cached_layers` list keeps track of the currently cached layers. When processing a layer, it first checks if it is already cached. If not, it loads the layer from SSD, adds it to the cache, and removes the oldest cached layer if the cache size exceeds the maximum.
      - By caching multiple layers in GPU memory, you can reduce the number of SSD reads required during inference.
      Additionally, you may need to handle the case where a single layer itself exceeds the available GPU memory. In such scenarios, you might need to explore other techniques like tensor parallelism or model sharding to distribute the layer across multiple GPUs or devices.

  • @javiergimenezmoya86
    @javiergimenezmoya86 หลายเดือนก่อน +12

    Is it possible configure that library for use of RAM instead if SSD? It would be useful if you have a computer with many RAM (p.e 64GB of RAM) because all layers would be able in memory in 4 bit quantization.

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +3

      I was thinking the same about offloading to RAM, as it has become so much cheap. However on my quick search could not find that option yet with that lib. Will need to investigate more. If you find please let me know as well.

    • @i6od
      @i6od หลายเดือนก่อน +4

      ... isnt this question ironic? doesnt LLM naturually load into RAM / VRAM, and the whole reason of this project is to switch it to an Actual Storage Drive so You can use the 70B in the Drive instead of having issues with over loading VRAM / RAM

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      @@i6od Indeed, this project brings a completely new way to deal with LLMs beyond RAM/VRAM.

    • @brianlink391
      @brianlink391 หลายเดือนก่อน

      Really simple to do just create a ram Drive a simple application you can download and then put your model into the Rand Drive and load it from there and you're all set

    • @poldiderbus3330
      @poldiderbus3330 หลายเดือนก่อน

      I would then just try to use a RAM-disk..

  • @Linuslkm
    @Linuslkm หลายเดือนก่อน

    have you tried it on a Ramdisk? If so, could you make another video comparing perfomance?

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      No haven't tried on that yet, but will try.

  • @nexusphreez
    @nexusphreez หลายเดือนก่อน

    So my only question is can this be integrated with ollama?

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน

      Dont think ollama supports this.

  • @krisKrag
    @krisKrag หลายเดือนก่อน

    is there a paper of apple in 2023 doing this the difference is that Apple targets efficiency in reading chunks specifically on its own hardware. TH-cam censor my previous comment where i paste the link and tittle of the paper :/

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน

      Yes I think you are talking about "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory"
      twitter.com/rohanpaul_ai/status/1737425137451073573

  • @jnchacon
    @jnchacon หลายเดือนก่อน +4

    Why ssd? Why not RAM?
    If i have enough RAM to save the entire LLM? Can the layer be read from RAM? (RAM to VRAM)

    • @brianmi40
      @brianmi40 หลายเดือนก่อน +1

      google "RAM disk" still a thing in Win 11...

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      Yes offloading to RAM will always be better given how cheap it is.
      But this library wanted a new way to deal with LLM bypassing RAM/VRAM as much as possible.

  • @RobertMcGovernTarasis
    @RobertMcGovernTarasis หลายเดือนก่อน

    How much disk space would this all need?

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน

      You just need to be able to accomodate the entire model into your SSD.

  • @lou.later269
    @lou.later269 หลายเดือนก่อน +2

    damn, imagine the same optmization for an 8B model, the speeds would rival Groq

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      Yes, but actual speed may not improve much as you still have to do Disk IO. So you will always be bottlenecked by your SSD read speed.

    • @dinoscheidt
      @dinoscheidt หลายเดือนก่อน +4

      Exactly. The smaller the model the higher the proportional IO overhead compared to compute… at 8B paging memory like this in and out makes it far slower than it is right now. That is because the compute time needed per additional parameter in an XB model grows exponentially. So large models are so slow in compute, that IO overheads like these can become neglectable. But there are interesting developments like vLLMs that use something like virtual memory management to pack a very large model still in small GPU memory. Skipping the need for IO speed (since there is no IO to disk), since everything is still in memory on the graphics card.

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      @@dinoscheidt very well explained. Thanks.

    • @damien2198
      @damien2198 หลายเดือนก่อน

      To my understanding, Groq uses a similar trick as their LPU has only 250MB(yes MB) memory

    • @lostpianist
      @lostpianist หลายเดือนก่อน

      @@dinoscheidt can't wait for vLLM Llama 3 400B. For a few years I've been hoping for something like that, then really top level AI can be run locally by anyone with a reasonable computer and ok graphics card... Will be amazing for productivity, gaming, etc.

  • @BrokenOpalVideos
    @BrokenOpalVideos หลายเดือนก่อน

    How many tokens per second would you get though

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +1

      Depends on SSD read speed. It may vary but in Mac hardware was getting 1 tok/2sec.

    • @Gatrehs
      @Gatrehs 29 วันที่ผ่านมา

      @@RohanPaul-AI is this an regular SSD or an NVME?

    • @RohanPaul-AI
      @RohanPaul-AI  29 วันที่ผ่านมา

      @@Gatrehs its NVME

  • @perelmanych
    @perelmanych หลายเดือนก่อน

    There are many comments about loading layers from RAM instead of SSD. Basically, it doesn't make sense. You will have a better performance doing all the computations on CPU. Why? Very simple, when you run LLM on CPU the main bottleneck is not a CPU speed, but the bandwidth of RAM and that is why it is much faster to run LLM on GPU because it has much higher bandwidth. With this lib you will have to copy each time a layer form RAM to VRAM and then compute output of a layer on GPU. That doesn't make sense, since your CPU makes computations faster than it gets data from RAM. So no magic here, if you want to run very big model and it fits to the RAM then just run it on CPU.

  • @MuhammadAdnan-tq3fx
    @MuhammadAdnan-tq3fx หลายเดือนก่อน +1

    It's possible offline?

    • @RohanPaul-AI
      @RohanPaul-AI  หลายเดือนก่อน +2

      Yes, you can use the locally downloaded model's local path like below
      model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

    • @caseyhoward8261
      @caseyhoward8261 หลายเดือนก่อน

      ​@@RohanPaul-AIThank you! ❤