Thanks, this gives me exactly what I needed to understand how to create a dataset for fine tuning. Most of the other videos skip over the details of the formatting and other parameters that go into creating your own dataset. Thanks again!
Thank you so much! This just gives me a really good basis on how I can start finetuning my own model! Because the model will in the end be as good as the training set.
FYI you're the man. idk why it was so hard to find a good pipeline to train literally went througfh all the libs and no one mentioned autotrainer advanced lol
If you don’t mind sharing, what’s the performance of a Mac like when fine tuning? I’m quite keen to see how long it takes to fine tune a 7B vs a 13B parameter model on a consumer machine on a small/medium sized dataset. Thanks for the tutorial, very helpful!
Thanks. very nice way you explained the concept. it gives boost to the knowledge and to the area where usually people have fear in mind to grasp but the way you explained it, to me it looks very easy. today i got the ability to fine tune the model myself. thanks a lot Sir. looking forward to more advanced topics from you.
@Prompt Engineering. Wow, exactly what I was looking for . I have another request, Can you please make a video on Prompt-Tuning/P-Tuning which is also a PEFT technique ?
I need help please. I just want to be pointed in the right direction since I'm new to this and since I couldn't really find any proper guide to summarize the steps for what I want to accomplish. I want to integrate a LLama 2 70B chatbot into my website. I have no idea where to start. I looked into setting up the environment on one of my cloud servers(Has to be private). Now I'm looking into training/fine-tuneing the chat model using our data from our DBs(It's not clear for me here but I assume it involves two steps, first I have to have the data in a CSV format since it's easier for me, second I will need to format it in Alpaca or Openassistant formats). After that, the result should be a deployment-ready model ? Just bullet points I'd highly appreciate that.
You will need to use that if you are using the instruct/chat version. Since I was fine tuning the base version, you can define your own format. Hope this helps
How could I limit it, for example I train it with several relevant paragraphs about the little prince novel, how do I limit it so that it only answers questions that are in the context of the little prince novel
Thank you very much. Is 300 rows a good number for training? I know it depends on many factors but I don't know how to identify if my dataset is bad or it's just too small.
I am facing the following problem. The model gets uploaded to the hugging face repo but without a config.json file. Any solutions? Also can the finetuned model run on the free google colab or should we shard it?
@@engineerprompt I am fine-tuning it on Google-Colab. In a post I made on your other video about fine-tuning Llama2 you mentioned that it seems to be a problem with the free tier of colab. I hope 🙏 you will find a fix , because not everyone owns a gpu.
Very coherent and well explained. Thank you kindly. I'm curious also if you have any advice about creating a dataset that would allow me to fine tune my model on my database schema? What I'd like to do is run my model locally, and ask it to interact with my database, and have it do so in a smooth and natural manner. I'm curious about how one would structure a database schema as a dataset for fine tuning. Any recommendations or advice would be greatly appreciated. Thanks again! Great videos!
I wonder if it is possible train LLAMA, on data where input are numbers and categorical variables(string), of fixed length, to predict a timer series of fixed length, anyone knows if this is possible? And how to fine the model if I have it locally
Can I use my files for data sets? I'm just planning to train a model that can remind me of commands in linux and its options so I don't have to keep reading manuals everytime I use commands that aren't regularly use.
Getting this error with AutoModelForCausalLM: from transformers import AutoModelForCausalLM MJ_Prompts does not appear to have a file named config.json. instead i had to import from peft import AutoPeftModelForCausalLM and use the AutoPeftModelForCausalLM to inference from the model. and one more question, did we train an adapter model here? please tell me how can i solve this. I am using free collab.
Getting the error: FileNotFoundError: Couldn't find a dataset script at /content/train.csv/train.csv.py or any data file in the same directory. Even though I'm running the autotrain code line in the same directory where my train.csv file present. I'm running on colab btw
Hello, thank you for sharing - is this method applicable to GGML / GPTQ models from say TheBloke's repo for example the 'Firefly Llama2 13B v1.2 - GPTQ' or would the training parameters need to be adjusted?
I haven't tried this with Quantized models so I am not sure how that will behave. One thing to keep in mind is that you want to use the "base" model not the chat version for best results. Will look at it and see if it can be done.
Thanks for the video. Two things please: 1. When you use autotrain package, then all details are hidden and one is not able to see what is being done and in what exact steps. I would suggest a video like that please if you have even same example. 2. Secondly, it is not clear to me what is the data vs label being fed into the model training phase, what is the loss function, how it is being calculated, etc...
I agree with you. Autotrain abstracts alot of details but if you are interested in more detailed setup. I would recommend to look for "fine-tune" videos on my channel. Here is one example: th-cam.com/video/lCZRwrRvrWg/w-d-xo.html
Probably you cant use it commercial purposes. But most of the open source models out there (at least the initial versions) were trained on data generated by chatgpt
For fine-tuning of the large language models (llama-2-13b-chat), what should be the format(.text/.json/.csv) and structure (like should be an excel or docs file or prompt and response or instruction and output) of the training dataset? And also how to prepare or organise the tabular dataset for training purpose?
why can't you fine-tune chat model of llama 2? the text completion of the fine tuned model I'm using is giving terrible results from my exact instructions in my prompt. I am using Puffin13B, but when feeding exact instructions it just cannot do them like I am prompting it to do.
Getting the error: ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration. Does anyone know what is the issue ? running on google collab thanks
@@engineerprompt So I watch the video 5 times, and it is still not clear what columns go where. You didn't even bother to open the .csv file so that we can see the schema. But you did show us the log file!
You just need a "text" column present in your train.csv file, the other columns will be ignored... if you want you can change which column will be used with --text_column column_name
I would like to ask, is the RTX4090 sufficient to fine-tune the 13B model, or can it only fine-tune the 7B model? Because I've noticed that the 13B model with default settings doesn't pose a problem for the RTX4090 in terms of parameter handling, but I'm uncertain whether a single RTX4090 is enough if data fine-tuning is required
Thanks SO MUCH for sharing this! Really helpful. I also trying to train on my own data on LLAMA 2. But I am facing a problem from deploy the model. I trained the model on AWS Sagenmaker and store the model in an S3 bucket. When I try to deploy the model and feed it with the prompt, I keep getting errors. my input follows the rule like ###Human....###Assistant. But I still have errors. I wonder if I use the wrong tokenizer. But I couldn't use AutoTokenizer.from_pretrained() in Sagemaker. Wonder if you have some advice!!
I did something similar working on my test data set to get this a bit more understood. I created a python script to merge all the data sets together. I still seem to be struggling to grasp the core training approach using SFT and what models work with what. Its' like the last puzzle missing
Hey sir, thank you for the great tutorial. I've some question, it seems in this training you didn't define "--model_max_length" parameter. Is there any differences if you define this parameter or not ?
It's really upto you how you want to define the format. Some models accepts instructions along with the user input, so really you get to decide based on your application.
Hola, soy novata en esto, lo que trato de hacer es un chat personalizado con llama2, por ejemplo tengo datos de una empresa x que son como 15 columnas X 300filas. Hasta ahora me responde aunque aún esta con sus respuestas ilógicas , EN FIN quiero saber es que si creo la columan text de {humano, assitant } para cada columna o pregunta posible Y como se prepararía los datos para entrenar el modelo Porfa, alguien me guía en esto?
Hey, i have fine-tuned the model using my own dataset but when running the bash command somehow the model did not get uploaded to hugginface but after the training completed i zipped the model and downloaded it. Now is there a way by which i can upload this fine-tuned model to hugginface now?
@@engineerprompt Thank you so much sir. But, I decided I'd do it without any LLMs. So, I wrote my own code using python and pandas. If you want, I could share the code to you?
So, if you have a problem, same on my, in google colab this code not will work, because on free google colab this script don't end job and don't create config.json, and you will have aproblems. And i think, that is reason, why this script don't push my model on huggingface hub. But your work is great, so thanks for that.
I am also facing the same problem. Actually in my case the model is uploaded to the huggingface repo but it is missing the config.json file. Any sollutions?
It is not allowed to use GPT-4's output, any of it even if the data fed into it is yours, to train other models than OpenAI's according to their terms.
Thanks, this gives me exactly what I needed to understand how to create a dataset for fine tuning. Most of the other videos skip over the details of the formatting and other parameters that go into creating your own dataset. Thanks again!
Thank you for your support. I'm glad it was helpful 😊
Datasets are key for fine tuning. This is a great video!
Yes! Thank you!
Thank you so much! This just gives me a really good basis on how I can start finetuning my own model! Because the model will in the end be as good as the training set.
FYI you're the man. idk why it was so hard to find a good pipeline to train literally went througfh all the libs and no one mentioned autotrainer advanced lol
Thank you!
If you don’t mind sharing, what’s the performance of a Mac like when fine tuning? I’m quite keen to see how long it takes to fine tune a 7B vs a 13B parameter model on a consumer machine on a small/medium sized dataset. Thanks for the tutorial, very helpful!
7B with 4 bit quantization takes about 12.9 GBs of GPU RAM, i dont think mac will be able to run it locally
You are a legend my man
🙏🙏🙏
Thanks for covering this topic!
My pleasure!
Thanks. very nice way you explained the concept. it gives boost to the knowledge and to the area where usually people have fear in mind to grasp but the way you explained it, to me it looks very easy. today i got the ability to fine tune the model myself. thanks a lot Sir. looking forward to more advanced topics from you.
Thanks and welcome!
@Prompt Engineering. Wow, exactly what I was looking for . I have another request, Can you please make a video on Prompt-Tuning/P-Tuning which is also a PEFT technique ?
Can I finetune llama 2 for pdf to question answers generation?
Wow. Thanks again, sir 🙏
Daaaamn bro that's brilliant
Thank you. More to come on fine-tuning :)
I need help please. I just want to be pointed in the right direction since I'm new to this and since I couldn't really find any proper guide to summarize the steps for what I want to accomplish.
I want to integrate a LLama 2 70B chatbot into my website. I have no idea where to start. I looked into setting up the environment on one of my cloud servers(Has to be private). Now I'm looking into training/fine-tuneing the chat model using our data from our DBs(It's not clear for me here but I assume it involves two steps, first I have to have the data in a CSV format since it's easier for me, second I will need to format it in Alpaca or Openassistant formats). After that, the result should be a deployment-ready model ?
Just bullet points I'd highly appreciate that.
I have a question. Why don't we use the conversation format given by llama2, which contains , something like that? thanks
You will need to use that if you are using the instruct/chat version. Since I was fine tuning the base version, you can define your own format. Hope this helps
How does this differ if I'm looking to fine-tune for Llama2 7b code instruct
Very informative
You're an AI champion. Thanks for the fine-tuning lectures 🙏🙏🙏
Thank you for your kind words!
@@engineerprompt Welcome brother!
How could I limit it, for example I train it with several relevant paragraphs about the little prince novel, how do I limit it so that it only answers questions that are in the context of the little prince novel
when I try to run the command in the terminal it gives error: autotrain [] llm: error: the following arguments are required: --project-name
Thank you man , that is exactly what i am looking for
Glad I could help
Thank you very much for the video. In the case of plaintext, how the dataset could be formatted?
Thank you very much. Is 300 rows a good number for training? I know it depends on many factors but I don't know how to identify if my dataset is bad or it's just too small.
Hello is this still relevant for fine tuning llama 3.1? Thank you.
there is a way to do it with the 13b original from llama2 already in my hard drive?
I am facing the following problem. The model gets uploaded to the hugging face repo but without a config.json file. Any solutions? Also can the finetuned model run on the free google colab or should we shard it?
Are you fine-tuning it locally or on Google Colab? I am doing it locally without any issues.
@@engineerprompt I am fine-tuning it on Google-Colab. In a post I made on your other video about fine-tuning Llama2 you mentioned that it seems to be a problem with the free tier of colab. I hope 🙏 you will find a fix , because not everyone owns a gpu.
You are a saviour
Thank you 😊
Thanks for great service to the community.
On my experiment, the Config.json is not created, is that normal?
Very coherent and well explained. Thank you kindly. I'm curious also if you have any advice about creating a dataset that would allow me to fine tune my model on my database schema? What I'd like to do is run my model locally, and ask it to interact with my database, and have it do so in a smooth and natural manner. I'm curious about how one would structure a database schema as a dataset for fine tuning. Any recommendations or advice would be greatly appreciated. Thanks again! Great videos!
Thanks for the informative video. I am wondering: Is there a way to do this, but with local LLMs?
I wonder if it is possible train LLAMA, on data where input are numbers and categorical variables(string), of fixed length, to predict a timer series of fixed length, anyone knows if this is possible? And how to fine the model if I have it locally
thanks for this video
Can I use my files for data sets? I'm just planning to train a model that can remind me of commands in linux and its options so I don't have to keep reading manuals everytime I use commands that aren't regularly use.
Getting this error with AutoModelForCausalLM:
from transformers import AutoModelForCausalLM
MJ_Prompts does not appear to have a file named config.json.
instead i had to import
from peft import AutoPeftModelForCausalLM
and use the AutoPeftModelForCausalLM to inference from the model.
and one more question, did we train an adapter model here?
please tell me how can i solve this. I am using free collab.
Getting the error:
FileNotFoundError: Couldn't find a dataset script at /content/train.csv/train.csv.py or any data file in the same directory.
Even though I'm running the autotrain code line in the same directory where my train.csv file present. I'm running on colab btw
The solution is to just provide the path of folder where the csv file is present. But don't write the csv file name...
it's very helpful thanks, is it the same process to create a data from multiple documents for a Qusetion Answering model ?
Yes, this will work
Hello, thank you for sharing - is this method applicable to GGML / GPTQ models from say TheBloke's repo for example the 'Firefly Llama2 13B v1.2 - GPTQ' or would the training parameters need to be adjusted?
I haven't tried this with Quantized models so I am not sure how that will behave. One thing to keep in mind is that you want to use the "base" model not the chat version for best results. Will look at it and see if it can be done.
And the model comes up with new prompts on its own?
thank you so much brother❤❤
Thanks for the video.
Two things please:
1. When you use autotrain package, then all details are hidden and one is not able to see what is being done and in what exact steps. I would suggest a video like that please if you have even same example.
2. Secondly, it is not clear to me what is the data vs label being fed into the model training phase, what is the loss function, how it is being calculated, etc...
I agree with you. Autotrain abstracts alot of details but if you are interested in more detailed setup. I would recommend to look for "fine-tune" videos on my channel. Here is one example:
th-cam.com/video/lCZRwrRvrWg/w-d-xo.html
Can I use this technique for document based question answers generation dataset?
Hey did you found any solution for Q & A model
Can I train my own LLM model using data generated by chatpgt, if the model is intended for academic/commercial use?
Probably you cant use it commercial purposes. But most of the open source models out there (at least the initial versions) were trained on data generated by chatgpt
For fine-tuning of the large language models (llama-2-13b-chat), what should be the format(.text/.json/.csv) and structure (like should be an excel or docs file or prompt and response or instruction and output) of the training dataset? And also how to prepare or organise the tabular dataset for training purpose?
Hey, did you find the answer for your question? If yes, please tell me what format shouldnthe dataset be for fine tuning please.
So if we are fine tuning the chat model, can we use same format as above? ; Human...., Assistant.....
Yes
why can't you fine-tune chat model of llama 2? the text completion of the fine tuned model I'm using is giving terrible results from my exact instructions in my prompt. I am using Puffin13B, but when feeding exact instructions it just cannot do them like I am prompting it to do.
Getting the error:
ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration.
Does anyone know what is the issue ?
running on google collab thanks
Do you need the 3 columns (Concept, Description, text) in train.csv or just 1 column (text) is enough?
Just one column
Just the last text column
@@engineerprompt So I watch the video 5 times, and it is still not clear what columns go where. You didn't even bother to open the .csv file so that we can see the schema. But you did show us the log file!
@@engineerprompt i wanna know too
You just need a "text" column present in your train.csv file, the other columns will be ignored... if you want you can change which column will be used with --text_column column_name
I would like to ask, is the RTX4090 sufficient to fine-tune the 13B model, or can it only fine-tune the 7B model? Because I've noticed that the 13B model with default settings doesn't pose a problem for the RTX4090 in terms of parameter handling, but I'm uncertain whether a single RTX4090 is enough if data fine-tuning is required
I don't think you can fine tune 13B with 24GB vRAM. Your best bet will be 7B
Thanks SO MUCH for sharing this! Really helpful. I also trying to train on my own data on LLAMA 2. But I am facing a problem from deploy the model. I trained the model on AWS Sagenmaker and store the model in an S3 bucket. When I try to deploy the model and feed it with the prompt, I keep getting errors. my input follows the rule like ###Human....###Assistant. But I still have errors. I wonder if I use the wrong tokenizer. But I couldn't use AutoTokenizer.from_pretrained() in Sagemaker. Wonder if you have some advice!!
on which GPU u training the model
Kudos on the excellent video! Your hard work is acknowledged. Could we expect a video about DemoGPT from you?
What hardware specifications would be needed to fine tune a 70b model?
Once the fine-tuning is complete, can you run the model using oogabooga?
I did something similar working on my test data set to get this a bit more understood. I created a python script to merge all the data sets together. I still seem to be struggling to grasp the core training approach using SFT and what models work with what. Its' like the last puzzle missing
would u pls give a sample of how the csv file looks like ? thanks a lot !
Let me see what I can do, the format is shown in the video.
Great bro 👍
Hey sir, thank you for the great tutorial. I've some question, it seems in this training you didn't define "--model_max_length" parameter. Is there any differences if you define this parameter or not ?
how or why do we decide on ###Human: ??
i see lots of variations on different videos. some use ->: , others use ###Input: etc. etc.
It's really upto you how you want to define the format. Some models accepts instructions along with the user input, so really you get to decide based on your application.
how I can fine tuning llama-2-7b-chat ,can I use your dataset format?
Can Chatgpt3.5 generate files?
Hola, soy novata en esto, lo que trato de hacer es un chat personalizado con llama2, por ejemplo tengo datos de una empresa x que son como 15 columnas X 300filas.
Hasta ahora me responde aunque aún esta con sus respuestas ilógicas , EN FIN quiero saber es que si creo la columan text de {humano, assitant } para cada columna o pregunta posible Y como se prepararía los datos para entrenar el modelo
Porfa, alguien me guía en esto?
Hey, i have fine-tuned the model using my own dataset but when running the bash command somehow the model did not get uploaded to hugginface but after the training completed i zipped the model and downloaded it. Now is there a way by which i can upload this fine-tuned model to hugginface now?
Sir, I don't have ChatGPT Plus. Are there any alternatives?
Look into the Groq, its a free API (at the moment).
@@engineerprompt Thank you so much sir. But, I decided I'd do it without any LLMs. So, I wrote my own code using python and pandas. If you want, I could share the code to you?
Is it possible to directly finetune gptq models?
Qlora
How do I use this local template in the localGPT project?
The localGPT code has support for custom prompt template. You will need to provide your template there.
So, if you have a problem, same on my, in google colab this code not will work, because on free google colab this script don't end job and don't create config.json, and you will have aproblems.
And i think, that is reason, why this script don't push my model on huggingface hub.
But your work is great, so thanks for that.
I am also facing the same problem. Actually in my case the model is uploaded to the huggingface repo but it is missing the config.json file. Any sollutions?
there is no code and sinppet for how to it work? I don't know the meaning
How can i contact you sir.
Check out my email
It is not allowed to use GPT-4's output, any of it even if the data fed into it is yours, to train other models than OpenAI's according to their terms.
Who cares? Who will know? 😅
your session crashed after using all ram error
How can I build my label with input and output? I found that llama 2 pieced the input and output together, can my label match the input_id?