At 10:32 the unsloth comment says "We only need to update 1 to 10% of all parameters" what does that mean? I recently created my own training data, it has 1015 questions and answers, when I run the trainer for 1 epoch, it only does 127 steps, shouldn't do more?
There are "base model" parameters, and then "adapter layer" parameters that are added at the end of the base model, when doing this LoRA fine-tuning. The comment is highlighting that we are only working with the adapter layers at the end when doing this fine tuning - which is around 1-10% of all the parameters. This is normal. You could do full-parameter fine-tuning (which updates the base model parameters), but that's not worth the high computational demands and complexity for most use cases. Each one of your steps is doing a batch when fine-tuning. The effective batch size is: per_device_train_batch_size * gradient_accumulation_steps * number_of_devices. For the demonstrated setup, the effective batch size is 8, meaning 127 steps covers up to 127*8=1016 of your Q&A examples. So, you're using all your Q&A examples, and doing a full pass over your training data, in the 127 steps. You could bump the epochs value if you want to do multiple passes.
The Hugging Face datasets library is used in either case, to compile a dataset of training strings. The load_dataset("yahma/alpaca-cleaned") approach (or similar) is only if you have your dataset in Hugging Face. The Dataset.from_dict used in the video should work if you read in the data from your local json and use it for the dictionary's "text" value. Depending upon how the text is structured in your JSON, you may need to do string interpolation - the end result "text" values for the dataset need to be pure strings.
Thanks for your question-it's definitely not a stupid one! In your dataset, have fields like "instruction", "prompt", and "function", and then do the string interpolation to create your text field (you could do it similar to the video, but replace "### Story" with "### Prompt" and "### Summary" with "### Function"). Make sure your training set has a consistent format for the function to trigger, and a consistent fallback value for non-triggering cases. Overall, the process should be quite similar to the video. Your model itself won't be able to actually trigger the function - only identify the right function to trigger (and possibly the arguments to supply to the function). You'll need to execute the function as a "next step" in some broader pipeline, application, service, or script. Hope I'm understanding the question correctly and that helps.
Really comprehensive and well-explained! Great work! I wonder if it is also possible to fine-tune not a text generator but an image generator. Does someone have any ideas? I am super new to this field and pretty much in the dark so far. Could not find something for image generation yet :/ Thanks for any suggestions!
We haven't tested this, but it should work. The biggest concern would be if you don't have enough GPU memory on your local machine or if you don't have a clean Python packages and CUDA setup.
RoPE is the standard way to solve the context window size issue with these open models. It can come at a quality cost, but it's basically the best method we have if you need to go beyond the model's default context window. Use it only if you truly need the additional tokens. In the video's example, the RoPE scaling is needed, because you simply can't summarize a 16k token story by only looking at the second-half 8k of tokens.
Yes, you can use RoPE without fine-tuning (e.g., off-the-shelf Llama 3 with a 32k context). I would recommend using Hugging Face libraries, which can be configured for RoPE scaling (for example TGI RoPE scaling is detailed here huggingface.co/docs/text-generation-inference/en/basic_tutorials/preparing_model).
Would better, if someone give noob tutorial or guide for how to prepare dataset. I do get data is set of input and output, but I dont know to label data
I believe it's possible, but I haven't tried yet and there isn't an existing Unsloth model for this. We'll look into it though and try to create a video. Thanks for the suggestion.
Nice Video, What should be the format for data extraction, if I want to extract data from a chunk? Can I include something like: """ {Instruction or System Prompt} ### {Context or Chunks} ### {Question} ### {Answer} """
The "###" lines signify headers, so I wouldn't put your content on those lines - rather, they are used to categorize the line(s) of text below each header. If you're using a chunk of content (e.g., via some sort of RAG approach), yes, you could have that as a separate categorization. Something like: """ {instruction} ### Background {chunk} ### Question {question} ### Answer {answer} """ For the best results, use the header terms in your instruction. For the example above, this could be something like "Based on the provided background, which comes from documentation, FAQs, and/or support tickets, answer the supplied question as clearly and factually as possible. If the background is insufficient to answer the question, answer "I don't know".".
Thanks. You can reach out via email at community@nodematic.com. We often do not have the staff to handle technical troubleshooting or architectural consulting, but we'll answer if we can.
Amazing video!, been curious if had to train a set of codes, which would have indentations (take example python code), will it still require data to be in ]standard format of having 'instruction', 'output' and 'input'? 150+ codes with quite high complexity will it be possible to train it? are there any other ways to set up the dataset? and is Llama3 capable of getting trained on un-structured data?
Yes, you could use a different, non-Alpaca-style format. For the "text" field creation via string interpolation, replace that with a text block of your code lines (including line breaks). Llama-3 does well on HumanEval, so I suspect it would work well for your described use case. Just be careful with how you create your samples - getting the model to stop after generating the right line/block of code may not be easy (although you could trim things down with post-processing).
No, just replace "hf/model" with your username (or organization name) and desired model name. Also, if you want a private repo, add a private=True argument to push_to_hub_merged.
Hi, i keep getting this error "TypeError: argument of type 'NoneType' is not iterable" It is originating from "usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py" Could you please share the requirements.txt. Also it only happens when i try to push "merge_16bit". merge_4bit works just fine!
LOL “We don’t need this code, so let’s put it in a text cell”
At 10:32 the unsloth comment says "We only need to update 1 to 10% of all parameters" what does that mean? I recently created my own training data, it has 1015 questions and answers, when I run the trainer for 1 epoch, it only does 127 steps, shouldn't do more?
There are "base model" parameters, and then "adapter layer" parameters that are added at the end of the base model, when doing this LoRA fine-tuning. The comment is highlighting that we are only working with the adapter layers at the end when doing this fine tuning - which is around 1-10% of all the parameters. This is normal. You could do full-parameter fine-tuning (which updates the base model parameters), but that's not worth the high computational demands and complexity for most use cases.
Each one of your steps is doing a batch when fine-tuning. The effective batch size is: per_device_train_batch_size * gradient_accumulation_steps * number_of_devices. For the demonstrated setup, the effective batch size is 8, meaning 127 steps covers up to 127*8=1016 of your Q&A examples. So, you're using all your Q&A examples, and doing a full pass over your training data, in the 127 steps. You could bump the epochs value if you want to do multiple passes.
Oh fantastic video as always - absolutely packed with detailed information so great work!
Great explanation. The background music was a little distracting.
Thanks for the feedback - we'll keep this in mind on future videos.
Really love the play list! Great vidieo.
is it worth using unsloth with amazon sagemaker ?
What is the difference between (push_to_hub) and (push_to_hub_merged in 4bits)?
Great video btw, many thanks!!
Hello, I don't understand how at 11:00 I can change the "yahma/alpaca-cleaned" to a local .json file on my pc?
The Hugging Face datasets library is used in either case, to compile a dataset of training strings. The load_dataset("yahma/alpaca-cleaned") approach (or similar) is only if you have your dataset in Hugging Face. The Dataset.from_dict used in the video should work if you read in the data from your local json and use it for the dictionary's "text" value. Depending upon how the text is structured in your JSON, you may need to do string interpolation - the end result "text" values for the dataset need to be pure strings.
@@nodematic Thank you! I may have more questions in the future. :)
Great explanation. This could be a stupid question. How do we fine-tune for trigger function calling?
Thanks for your question-it's definitely not a stupid one! In your dataset, have fields like "instruction", "prompt", and "function", and then do the string interpolation to create your text field (you could do it similar to the video, but replace "### Story" with "### Prompt" and "### Summary" with "### Function"). Make sure your training set has a consistent format for the function to trigger, and a consistent fallback value for non-triggering cases. Overall, the process should be quite similar to the video.
Your model itself won't be able to actually trigger the function - only identify the right function to trigger (and possibly the arguments to supply to the function). You'll need to execute the function as a "next step" in some broader pipeline, application, service, or script.
Hope I'm understanding the question correctly and that helps.
Great Video. Loved the fun generated musics. We don't need this code, so let's put it in the text cell =))
Really comprehensive and well-explained! Great work!
I wonder if it is also possible to fine-tune not a text generator but an image generator. Does someone have any ideas? I am super new to this field and pretty much in the dark so far. Could not find something for image generation yet :/
Thanks for any suggestions!
We'll try to make a video on this. Thanks for the suggestion.
I have been having tremendous difficulty, can this be run locally in VScode?
We haven't tested this, but it should work. The biggest concern would be if you don't have enough GPU memory on your local machine or if you don't have a clean Python packages and CUDA setup.
@@nodematic I have read about it more and it looks like windows isnt acting too friendly and most people are running Linux. :(
Great Video!
I had a doubt about RoPE Scaling. How efficient is it and to what extent does it help solve the LLM context window size issue?
Thanks!
RoPE is the standard way to solve the context window size issue with these open models. It can come at a quality cost, but it's basically the best method we have if you need to go beyond the model's default context window. Use it only if you truly need the additional tokens. In the video's example, the RoPE scaling is needed, because you simply can't summarize a 16k token story by only looking at the second-half 8k of tokens.
@@nodematic @nodematic Is there an easy API for RoPE?
I don't even need fine-tuning, I just need a chat completion API for 32k context Llama 3
Yes, you can use RoPE without fine-tuning (e.g., off-the-shelf Llama 3 with a 32k context). I would recommend using Hugging Face libraries, which can be configured for RoPE scaling (for example TGI RoPE scaling is detailed here huggingface.co/docs/text-generation-inference/en/basic_tutorials/preparing_model).
whats the name of the some at 3:47? sounds pretty cool
That's a Udio-generated custom song, and isn't published.
why arent we tokenizing the finetuningdataset? is it automatically done in the sft trainer
Yes, it's done by the Trainer
Would better, if someone give noob tutorial or guide for how to prepare dataset.
I do get data is set of input and output, but I dont know to label data
Great video and nice code, can you do this context length extension for Deepseek Coder model ?
I believe it's possible, but I haven't tried yet and there isn't an existing Unsloth model for this. We'll look into it though and try to create a video. Thanks for the suggestion.
Nice Video,
What should be the format for data extraction, if I want to extract data from a chunk?
Can I include something like:
"""
{Instruction or System Prompt}
### {Context or Chunks}
### {Question}
### {Answer}
"""
The "###" lines signify headers, so I wouldn't put your content on those lines - rather, they are used to categorize the line(s) of text below each header. If you're using a chunk of content (e.g., via some sort of RAG approach), yes, you could have that as a separate categorization. Something like:
"""
{instruction}
### Background
{chunk}
### Question
{question}
### Answer
{answer}
"""
For the best results, use the header terms in your instruction. For the example above, this could be something like "Based on the provided background, which comes from documentation, FAQs, and/or support tickets, answer the supplied question as clearly and factually as possible. If the background is insufficient to answer the question, answer "I don't know".".
hi, I am fine tuning llama 3 model but i am facing some issue. Your video was great. I was hoping to connect with you. Can we connect?
Thanks. You can reach out via email at community@nodematic.com. We often do not have the staff to handle technical troubleshooting or architectural consulting, but we'll answer if we can.
Amazing video!, been curious if had to train a set of codes, which would have indentations (take example python code), will it still require data to be in ]standard format of having 'instruction', 'output' and 'input'? 150+ codes with quite high complexity will it be possible to train it? are there any other ways to set up the dataset? and is Llama3 capable of getting trained on un-structured data?
Yes, you could use a different, non-Alpaca-style format. For the "text" field creation via string interpolation, replace that with a text block of your code lines (including line breaks).
Llama-3 does well on HumanEval, so I suspect it would work well for your described use case. Just be careful with how you create your samples - getting the model to stop after generating the right line/block of code may not be easy (although you could trim things down with post-processing).
do we need to create repo first before push to hub command ?
No, just replace "hf/model" with your username (or organization name) and desired model name. Also, if you want a private repo, add a private=True argument to push_to_hub_merged.
this is not for dummies i could not understand anything
i hate how everyone does unsloth tutorials not able of using multigpu setup
not gonna lie the ai song was a banger
You lost me withn 60 secs how is this for dummies
Hi, i keep getting this error "TypeError: argument of type 'NoneType' is not iterable" It is originating from "usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py" Could you please share the requirements.txt. Also it only happens when i try to push "merge_16bit". merge_4bit works just fine!