Train a Custom GPT LLM Using Your Own Data From Scratch

Stephen Blum

มุมมอง 9 282

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 21 ม.ค. 2025

ความคิดเห็น • 59

@SaturnKK1 8 หลายเดือนก่อน ⁺³
Keyvan here, thanks for your great video ❤
@StephenBlum 8 หลายเดือนก่อน ⁺¹
Hi Keyvan! femtoGPT is amazing 🤩 thank you!
@raass9316 8 หลายเดือนก่อน ⁺¹
how you found this video 😵‍💫
@videos4mydad 8 หลายเดือนก่อน ⁺³
Thank you Stephen for answering my last question. I have another one!
are you supposed to delimit in the training set between the different inputs?
my set looks like:
user_input= agent_output=
user_input= agent_output=
.
.
.
is the GPT reading my entire file as one input, or do I need to separate each conversation, or does it matter?
this way, when i use the GPT, i want the code to be:
prompt = "user_input= agent_output=
@StephenBlum 8 หลายเดือนก่อน ⁺²
Happy to help! Looks like you are diving right in. What you are doing right now is exactly the plan for one of the upcoming videos. Yes, delimiting is the correct approach. It's the industry's approach actually. They use delimiters like you have for this purpose. You have the correct format for input. The rust code should fully scan the entire dataset.txt file if you train it for long enough. You'll want to make sure to train for as long as possible once you are ready. Good news is you can pause and resume training as much as you want. The training function will sample your dataset.txt randomly. github.com/keyvank/femtoGPT/blob/main/src/gpt.rs#L28 here is the code that does the sample selection. It will take a random range of text. Something we could do is update that function to look for "
" denoting end of input. That could help improve the model, preventing it from bleeding between sequences. The industry calls that the EOS / STOP token, the EOS_TOKEN.
@videos4mydad 8 หลายเดือนก่อน ⁺¹
@@StephenBlum thank you! this is very helpful. the current code is not at all what i need. I want the sampling to be just one line at a time from beginning to end!! i will make those changes !
@StephenBlum 8 หลายเดือนก่อน ⁺¹
@@videos4mydad nice! 🙌😄
@bruninhohenrri 7 หลายเดือนก่อน ⁺¹
@@StephenBlum I'm planning to make something like this too. My dataset are built like this: {question}
{answer}

I will try to update this function to look for
at the end of input, I'm not a Rust Dev but I think I can do it with help of TH-cam and ChatGPT 😅😆
@StephenBlum 7 หลายเดือนก่อน
@@bruninhohenrri excellent idea 👍'
' will be a good stop-token for end of input. Rust + YT + ChatGPT = 🎉
@bruninhohenrri 7 หลายเดือนก่อน ⁺²
Hey ! I finally had time to test it. Now that i have the model, how could it inference it ? Thanks for the video !
@StephenBlum 7 หลายเดือนก่อน ⁺¹
Great to hear! Now that you have the model, the code needs some modification to run in inference mode. Thinking it makes sense to setup an axum web server, or you can just keep it as a CLI and run it on-demand as needed. Here is the inference function:
let inference = gpt.infer(
&mut rng,
&tokenizer.tokenize("Your Prompt Here
"),
100,
inference_temperature,
|_ch| {},
)?;
println!("{}", tokenizer.untokenize(&inference)); // print model response
You could put that in a second binary file, or parameterize the main.rs to run "inference mode" or "training mode" based on command line parameter. Lots of options! Making a follow-on video in a few weeks to show how that would work in a new github fork
@bruninhohenrri 7 หลายเดือนก่อน
@@StephenBlum I was trying to make it work by myselft and i think it finally worked !
I had to recreate the entire GPT object on a new binary file and even reload the dataset into the tokenizer, not the optimal way to do it, but it's working !
#[cfg(not(feature = "gpu"))]
let graph = femto_gpt::graph::CpuGraph::new();
#[cfg(not(feature = "gpu"))]
let is_gpu = false;
#[cfg(feature = "gpu")]
let graph = femto_gpt::graph::gpu::GpuGraph::new()?;
#[cfg(feature = "gpu")]
let is_gpu = true;
let prompt: &str = "I just ";
let training_state_path = Path::new("training_state.dat");
let mut rng = rand::thread_rng();
let inference_temperature = 0.7; // How creative? 0.0 min 1.0 max
let dataset_char =
fs::read_to_string("dataset.txt").expect("Should have been able to read the file");
let tokenizer = SimpleTokenizer::new(&dataset_char);
let batch_size = 32;
let num_tokens = 64;
let vocab_size = tokenizer.vocab_size();
let embedding_degree = 64;
let num_layers = 4;
let num_heads = 4;
let head_size = embedding_degree / num_heads;
let dropout = 0.0;
assert_eq!(num_heads * head_size, embedding_degree);
println!("Vocab-size: {} unique characters", vocab_size);
let mut gpt = GPT::new(
&mut rng,
graph,
is_gpu.then(|| batch_size),
vocab_size,
embedding_degree,
num_tokens,
num_layers,
num_heads,
head_size,
dropout,
)?;
gpt.sync()?;
println!("Number of parameters: {}", gpt.num_params());
if training_state_path.is_file() {
let mut ts_file = fs::File::open(training_state_path).unwrap();
let mut bytes = Vec::new();
ts_file.read_to_end(&mut bytes).unwrap();
let ts: TrainingState = bincode::deserialize(&bytes).unwrap();
gpt.set_training_state(ts, true)?;
}
println!();
println!("Starting the inference process...");
println!();
let inference = gpt.infer(
&mut rng,
&tokenizer.tokenize(prompt),
50,
inference_temperature,
|_ch| {},
)?;
println!("{}", tokenizer.untokenize(&inference)); // print model response

Ok(())
@bruninhohenrri 7 หลายเดือนก่อน ⁺¹
@@StephenBlum I trained with a poor dataset. Now i think i'm going to make soemthing cool, like a text-to-sql AI model :D
@StephenBlum 7 หลายเดือนก่อน ⁺¹
@@bruninhohenrri Nice! text-to-sql sounds amazing 🤩
@videos4mydad 8 หลายเดือนก่อน ⁺¹
I have just started the training on some data.
How do I test it?
Where do I give it a sentence and have it finish it?
I think its:
let inference = gpt.infer(
&mut rng,
&tokenizer.tokenize("
"),
100,
inference_temperature,
|_ch| {},
)?;
replace the '
' with my prompt?
thanks
@StephenBlum 8 หลายเดือนก่อน ⁺¹
Oh yes good question! After training is complete, you will want to be able to use the model inference capabilities so that it can complete output sequences. Looking at your code it appears to be you are on the right track. I remember during the training cycles, there were moments where inference testing occurred and printed the output on the screen. This happened during training. That code is going to be the same code you use to run inference. Inference is used to use the model after it has been trained to generate the string patterns of letters based on your training data. I will validate the correct function in a follow-on comment here shortly 😄👍
@StephenBlum 8 หลายเดือนก่อน ⁺¹
Yes you found it. Confirmed. This is the right place to run your trained model using a prompt: `gpt.infer()` function ✅
@abdullahsohail803 5 หลายเดือนก่อน
what if this dataset consisted only of numeric data? how to train our custom gpt model then?
@StephenBlum 5 หลายเดือนก่อน
good question! yes you can absolutely do this. your alphabet is "0-9. " with a period, and space character. you would be able to predict the next "number" in the series based on your training set. For example a stock quote price.
@ajaykumarsinghlondon 4 หลายเดือนก่อน ⁺²
Hi Stephen, this is a great tutorial and perhaps only one I could found. I'm running it right now and seeing some good result. I wanted to ask how do I turn this retrained model to answer questions? Code please if possible as I'm not expert in rust. Cheers!
@StephenBlum 4 หลายเดือนก่อน ⁺¹
Hi Ajay! Good question. How to train the model into a Question/Answer model. You just have to change the dataset.txt to be in the "Question: ..." and "Answer: ..." format. Then you can train the model to answer questions. Note that you have to prefix the input as "Question: your_question_here" and the model will reply with "Answer: model_answer". You'll need a lot of data to get a good result.
@ajaykumarsinghlondon 4 หลายเดือนก่อน ⁺²
@@StephenBlum Tx for the reply. The dataset is already formatted with that. My question is how to run any script or command (if it is already there) or write a new code to ask model a question for which it replies me back with an answer?
@StephenBlum 4 หลายเดือนก่อน ⁺¹
@@ajaykumarsinghlondon ah yes. Okay. So you'd actually have to code for this. You need to define an interface. A command line argument for example. And you'd use that as the input. Then you'd need to code that into the Rust app as the input. You'd print the output. You can create an updated src/main.rs file. You can remove the training sections, and execute the gpt.infer() function and print the output.
@StephenBlum 4 หลายเดือนก่อน ⁺¹
@@ajaykumarsinghlondon there is a comment on this page that shows the function needed to call: th-cam.com/video/jEyPQUyNhD0/w-d-xo.html&lc=Ugy7Zy4-ZvZTBQk5G4l4AaABAg.A3xgWTfwS0PA3xiq6nxncm (threaded comment) code example from that comment thread is:
let inference = gpt.infer(
&mut rng,
&tokenizer.tokenize("Your Prompt Here
"),
100,
inference_temperature,
|_ch| {},
)?;
println!("{}", tokenizer.untokenize(&inference)); // print model response
@StephenBlum 4 หลายเดือนก่อน ⁺¹
@@ajaykumarsinghlondon here is the code: gist.github.com/stephenlb/9e919c0c2523048aeda022b1fafe91b7
@utpalprajapati7775 8 หลายเดือนก่อน ⁺¹
is there any such program for python/javascript devs?
@StephenBlum 8 หลายเดือนก่อน
yes totally for Python you'll see it's already built by Meta: pytorch.org/docs/stable/generated/torch.nn.Transformer.html
@utpalprajapati7775 8 หลายเดือนก่อน ⁺¹
@@StephenBlum thats big! thanks for sharing ♥️
@MrStevemur 8 หลายเดือนก่อน ⁺¹
I'm guessing it wouldn't be any smarter than the predictive text feature on a phone, since it's only predicting which letter is most likely to come next. If you can understand the code though, it could be interesting as an example of how these work.
@StephenBlum 8 หลายเดือนก่อน
With the transformer model, it should outperform your phones' predictive text feature. And the good news is that you can even customize how much CPU/Memory to allocate to make improvements as you need. It's really powerful! Testing training on a GPU it was able to learn the entire 1MB of text in a few minutes. Imagine, you can gave it specific text and way more than 1MB. Lots of opportunity! 😄
@JehovahsaysNetworth 8 หลายเดือนก่อน ⁺¹
I’m updating my ai for PHP html js and css and it can build templates
@StephenBlum 8 หลายเดือนก่อน
Nice! Updating your AI for PHP + HTML + JS + CSS sounds sounds like a great idea 😄 🙌 Tuning your AI for distinct use cases like this is powerful. I think we'll be seeing a lot more models like this going forward. Where we'll have better performing use-case specific models. 🎉
@TimJSwan 8 หลายเดือนก่อน ⁺¹
ironically I was hoping to use the apple silicon for its neural engine yet this project uses amd and intel
@StephenBlum 8 หลายเดือนก่อน
Ah yes. You are right. It uses OpenCL. Apple wants everyone to deprecate OpenCL and migrate to Metal. That is a drawback. There may be a way to set up a wrapper. Though it seems like that could be a bit of effort. femtoGPT will still work for the CPU. It will use every CPU core on your machine 📈
@StephenBlum 8 หลายเดือนก่อน
GitHub Repository: github.com/keyvank/femtoGPT to download the code
Command: cargo run --features gpu --release
@avi7278 8 หลายเดือนก่อน ⁺³
There's going to come a day when you can train a GPT-5 level model on an old computer and that's gonna be hilarious and quaint like running 20 different game emulators on a raspberry pi or something.
@StephenBlum 8 หลายเดือนก่อน ⁺¹
😂 you are right! The Amiga by Commodore level of quaint old computers. That day when the GPT-5 level models running on an old computer is closer than we might think.
@matijsbrs 8 หลายเดือนก่อน ⁺¹
Really nice explanation!
@StephenBlum 8 หลายเดือนก่อน
Thank you! Your feedback is excellent and it helps me continue to make better videos 😄🙌
@HomeEngineer-wm5fg 8 หลายเดือนก่อน ⁺¹
YES!!!! This Sir! Thank you!
@StephenBlum 8 หลายเดือนก่อน
You are very welcome! 😄 Happy to help. This is pretty exciting as you can use it to recreate anyone's digital likeness. It is powerful and can even recreate digital yourself if you have enough data 📈
@punch3n3ergy37 8 หลายเดือนก่อน ⁺¹
Would've been nice if you would show how to make it into a chatbot :)
@StephenBlum 8 หลายเดือนก่อน ⁺¹
Nice! Good idea. This definitely needs to happen. Adding it to the planned videos list 🙌🎉😊
@shaunlim3759 8 หลายเดือนก่อน ⁺¹
great video!
@StephenBlum 8 หลายเดือนก่อน
Thank you! 😊 If you have questions or ideas to cover, let me know! 😄🙌
@YourBrandDead 3 หลายเดือนก่อน ⁺¹
Stephen, its Steven, whats up twin😂 Malik Yusef (Kanye's main collaborator) and I are launching a platform. I'm super creative, never claimed to be smart, so I tend to get myself in situations like these often, where I know it can be done, it’s just the learning curve.. LMK if you have time to connect, would love to run the platform by you and possibly get you involved, however that looks🙏🏼
@StephenBlum 3 หลายเดือนก่อน
Hi Steven! Yes let's do it. send email to stephen@pubnub.com
@Whothebossis 8 หลายเดือนก่อน ⁺¹
you sharing gems😍😭😭
@StephenBlum 8 หลายเดือนก่อน
💎🙌🎉😄 great to hear thank you 😊
@hyperhippyhippohopper 8 หลายเดือนก่อน ⁺¹
Are you Alex Honnold's brother?
@StephenBlum 8 หลายเดือนก่อน
Oh yeah! The free climber 🪨🧗that is scary and impressive 😄
@bruninhohenrri 8 หลายเดือนก่อน ⁺¹
Let's GOOOOOOOI
@StephenBlum 8 หลายเดือนก่อน
🎉🎉🎉 😄🙌
@shaileshrana7165 7 หลายเดือนก่อน ⁺²
Hey Stephen. Great explanation.
I am trying to train a model on Jira tickets. Can you suggest the way I can format the data in the dataset file.
I want to give the description of the ticket. The comments with the commenter name. The state changes and the values of other parameters and their changes like assignee name.
This is the kind of thing I have in mind:
NUMBER: BACK-356 /n
TITLE: Invoice dump job failure /n
DESCRIPTION: The job for ingesting invoices from the Production tables has failed on June 26th, 2024. We need yo resolve this because the financial reporting is due at the end of the month. /n
ASSIGNEE: Ramesh Vesvaraya /n
COMMENT: /n
WRITER: Ram Gupta /n
BODY: @Saurabh Sharma can you look into this.
@StephenBlum 7 หลายเดือนก่อน
Yes that's perfect! You'll want to add a "STOP_TOKEN" something like an "END" character that indicates to the generator to stop fetching. It can be anything.The format you have is amazing! This is a good start 🎉🙌🚀 What I did is find a double "

" where the model would output two new-line characters in a sequence. Your training data should add the "

" to the end of each training sample. Separating each Jira Ticket.
@shaileshrana7165 6 หลายเดือนก่อน ⁺¹
@@StephenBlum Fantastic. Thanks.

ต่อไป

เล่นอัตโนมัติ

"How to give GPT my business knowledge?" - Knowledge embedding 101