Prepare Fine-tuning Datasets with Open Source LLMs

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ม.ค. 2025

ความคิดเห็น •

  • @nkhuang1390
    @nkhuang1390 ปีที่แล้ว +10

    I purchased full access to your repo because I love and want to support the work you are doing. Some of the clearest and most articulate explanations about embedding, fine-tuning. Supervised vs unsupervised methods, data prep. Keep it up!

  • @AmbarPathak-w6c
    @AmbarPathak-w6c 6 วันที่ผ่านมา

    is there a way to do the training on a local nv link paired rtx 4090 gpus from raw data(multimodal pdfs) for a llava 13b ?

    • @TrelisResearch
      @TrelisResearch  6 วันที่ผ่านมา

      Yes! But probably better to use Qwen VL 7B. It’s more powerful

    • @AmbarPathak-w6c
      @AmbarPathak-w6c 5 วันที่ผ่านมา

      @@TrelisResearch even for dealing with multimodal pdfs in high volume ?

    • @AmbarPathak-w6c
      @AmbarPathak-w6c 3 วันที่ผ่านมา

      @@TrelisResearch why do you think it will be more powerful ?

  • @AmbarPathak-w6c
    @AmbarPathak-w6c 6 วันที่ผ่านมา

    Is it actually possible to do it on a rtx 4090 machine locally without using any cloud api or cloud gpu provider and using multimodal pdf as your input data ?

  • @devtest202
    @devtest202 10 หลายเดือนก่อน

    Hi thanks!! A question for a model in which I have more than 2,000 pdfs. Do you recommend improving the handling of vector databases? When do you recommend fine tunning and when do you recommend vector database

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน

      Start with a vector database unless, a) you need high latency and short prompts, or b) you want to do structured generation. Fine-tuning may give a small boost but embeddings will be best.

  • @GrahamAndersonis
    @GrahamAndersonis 11 หลายเดือนก่อน

    On Runpod, How do I get/amend Llama 70B API by TrelisResearch Template to work with an exposed TCP?
    The terminal says connection is refused in the terminal and in VScode (preferred).
    Other templates work fine.
    Doesn't work: The SSH over exposed TCP: (Supports SCP & SFTP)
    Works: the Basic SSH Terminal: (No support for SCP & SFTP) works fine.
    The basic SSH terminal is not going to work with VScode to my knowledge.
    Perhaps there is a way to edit the templates for these containers so they can work with VS code?
    I'm really looking forward to digging into your tutorials :)

    • @sagardesai1253
      @sagardesai1253 11 หลายเดือนก่อน

      Hello @GrahamAndersonis,
      out of the box debian linux does not comes SSH installed.
      1. In run pod image, you have to pass the public_key, as well as TCP port 22.
      2. please use following commands in the basic command prompt -
      ####
      # Update package lists for upgrades and new package installations
      apt update;
      # Install OpenSSH server in a non-interactive mode to avoid prompts and questions during installation
      DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;
      # Start the SSH service to enable remote connections
      service ssh start;
      ####
      3. post this the run pod will have SSH available to connect.
      4. use VScodes remote extension to connect to runpod as remote server.
      5. this will have SCP and SFTP enabled

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน +1

      Hi Graham, yeah had this issue too and will post shortly with a workaround. Ultimately the image would need to be updated for a permanent fix (but I don't control that image).

    • @GrahamAndersonis
      @GrahamAndersonis 10 หลายเดือนก่อน

      @@TrelisResearch fantastic work

  • @unshadowlabs
    @unshadowlabs ปีที่แล้ว +2

    Great video! How are you chunking the videos, by paragraph, sentence, word char, etc? Are you using any overlap in the chunks? Have you tested you system with a smaller llama 2 model? What type of results would one get from maybe a llama 2 13B, or even a 7B that could possibly be ran from home?

    • @TrelisResearch
      @TrelisResearch  ปีที่แล้ว

      Howdy!
      Here, I chunk into 500 or 750 token chunks. If you chunk too little, then the cropped sentence at the end has too much effect and you get hallucination. If you use too big chunks then you'll get too many questions (and llms aren't able to respond consistently with very long lists of questions, often).
      Check out my supervised fine-tuning video, that's done on 13B. With enough data, you can get to reasonable quality. 7B - unless you have a lot of data (or are fine-tuning for structured responses) is tough.

    • @unshadowlabs
      @unshadowlabs ปีที่แล้ว

      @@TrelisResearch Thanks for the reply. I watched the whole series after I had posted this. Very good series! :) What are your thoughts about using a 7B model just for the Q&A creation, and then fine tuning that on the larger 70B model? Is there any benefit for using such a large model on the Q&A creation step?

    • @TrelisResearch
      @TrelisResearch  ปีที่แล้ว +1

      @@unshadowlabs yeah I think you need to use a big model for Q&A because you don't want hallucination in the Q&A set - data quality is crucial and 7B hallucinates too much

  • @HemangJoshi
    @HemangJoshi 10 หลายเดือนก่อน

    I want to fine-tune on my code. I have multiple folders and files in each project on which i want to fine-tune. Can this private repo work in that? Basically i want to fine-tune on my coding projects.

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน

      Yes, this can work. If dealing with a file structure, you may want to decide what files to include and then flatten them into one single .txt file. It can also help to include a directory structure within that txt file as well so the llm knows what it's looking at.

  • @MarinaRodriguesCrespo
    @MarinaRodriguesCrespo 10 หลายเดือนก่อน

    you used plain text for the dataset, is it better than the json format? when choosing one or the other? thanks for the video!

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน +1

      Well if you have json available to start that’s going to be even easier to process and modify to meet your needs. Plaintext is hardest as there is no structure to go on.

  • @TheLokiGT
    @TheLokiGT 7 หลายเดือนก่อน

    Hi Ronan. Where is the code relevant to this video as of june 2024? In the Adv. FT repo, there is no trace of it AFAIK. Thanks.

    • @TrelisResearch
      @TrelisResearch  7 หลายเดือนก่อน +1

      Howdy, code is in the supervised-fine-tuning branch

    • @TheLokiGT
      @TheLokiGT 7 หลายเดือนก่อน

      @@TrelisResearch Thanks!

  • @babyfox205
    @babyfox205 10 หลายเดือนก่อน

    is "Context" a keyword which this specific model knows? how would it notice it after the blob of text

    • @TrelisResearch
      @TrelisResearch  9 หลายเดือนก่อน

      It should know Context like any other english word and also have seen training data of what that refers to.

  • @MarxOrx
    @MarxOrx ปีที่แล้ว +1

    Hi, I just paid for the access to the repo of this video, but I wasn't aware of the option to buy access to all projects in the repo, Is there any way to pay the difference and upgrade? how can I get in touch with you for that? love the work btw!

    • @TrelisResearch
      @TrelisResearch  ปีที่แล้ว

      Howdy, everyone gets emailed a receipt, so you can just respond to that email!

  • @carthagely122
    @carthagely122 11 หลายเดือนก่อน

    Thank you very much

  • @el.kochevnik
    @el.kochevnik ปีที่แล้ว +1

    Great 🤠

  • @enriquecolladofernandez8758
    @enriquecolladofernandez8758 ปีที่แล้ว +1

    cheeeeez u give it to me man !