This is awesome. For the next video, I have a suggestion: Suppose I have multiple PDF files containing a lot of information about my organization. How can I use a large language model (LLM) like the one you used above to create a dataset extracted from the knowledge provided in these PDFs?
i also generate synthetic datasets... a secret tip for alignment... set the mood and tone as parameters in the prompt as well to generate the questions as response (make the dataset a little more dynamic)
Wow! Great Tutorial Mervin this is what I'm exactly looking for fine tuning. I tested it and it worked perfectly. I have a question, from my understanding from your blog and video, this dataset is suitable for ORPO fine-tuning (AI feedback scores)? Can I still use it for SFT by filtering out the responses (rows) with best scores?
I'm still fuzzy about when is creating a synthetic data useful in a practical scenario for me, as a single person, not a large company that needs to fine-tune LLM's. Can someone clarify? What's the real world use for this?
LLM can hallucinate for questions you ask. Specially for low resource languages. For my own language Bengali, it hallucinates a lot and gives wrong answer for facts/events. Now, you can use RAG to stop hallucination. But, RAG depends on the size of the context. If I want to build a specialized model which knows and answers facts about Bengali culture and recent events, it’s useful if I can rather fine tune with the facts and recent event dataset, so that it becomes part of my model itself, therefore no hallucination. You can think of RAG as an open book exam, where you can search for answers in the book while taking the exam versus, a fine tuned model is you having the knowledge in your brain.. of course depending on your memory and reasoning ability, you will give accurate vs hallucinated answer. But if it becomes part of your memory accurately and you can retrieve it on demand, you now no longer have to check and search your books every time someone ask you a question. So, I hope that now you understand why it may be useful to finetune. The ultimate goal is “no hallucination” and therefore better accuracy.
@@brishtiteveja Tnx for the response, I really appreciate it. I believe you're talking about real world data, not generated by the AI. My question was focused on the usefulness of synthetic data though or am I interpreting synthetic data the wrong way?
@@MervinPraison Pretty cool. Can you point me to how I can do something like this for another language? I am trying to help build one for the Yoruba language.
This is awesome. For the next video, I have a suggestion: Suppose I have multiple PDF files containing a lot of information about my organization. How can I use a large language model (LLM) like the one you used above to create a dataset extracted from the knowledge provided in these PDFs?
first time to know that LLMs can generate datasets! thank a lot
i also generate synthetic datasets... a secret tip for alignment... set the mood and tone as parameters in the prompt as well to generate the questions as response (make the dataset a little more dynamic)
hi @ByteBop i would like to create a synthetic dataset for images, do you know how to do it?
Hey bro can you tell does it cost anything?
Also can you share your codes ?? I'm a newbie want to learn
Really great tutorial. Keep em' coming. First time seeing NIMs demo.
Thank you for this awesome tutorial. I request you to kindly make a video on synthetic dataset generation from pdf files.
Thank you so much
Super cool! Just did something extremely similar but in Google Sheets, so my non-tech peers can help.
This is amazing.! Can you also explain how to create a classification model from the generated dataset ?
if I have a specific number of subtopics and I don't want to generate new subtopics how would I chose the number of data sets to generate?
Great insight!
Wow! Great Tutorial Mervin this is what I'm exactly looking for fine tuning. I tested it and it worked perfectly. I have a question, from my understanding from your blog and video, this dataset is suitable for ORPO fine-tuning (AI feedback scores)? Can I still use it for SFT by filtering out the responses (rows) with best scores?
Can we use ask LLMs to outline, create questions, reply, summerise BOOKS and use that to fine tune LLMs?
Hey I am able to generate only 10 examples in count how do I make sure it generates more than a thousand at least
hey buddy , can't we create the synthetic dataset for images? i mean uploading images and getting responses for the questions... how to do it
I'm still fuzzy about when is creating a synthetic data useful in a practical scenario for me, as a single person, not a large company that needs to fine-tune LLM's. Can someone clarify? What's the real world use for this?
LLM can hallucinate for questions you ask. Specially for low resource languages. For my own language Bengali, it hallucinates a lot and gives wrong answer for facts/events. Now, you can use RAG to stop hallucination. But, RAG depends on the size of the context. If I want to build a specialized model which knows and answers facts about Bengali culture and recent events, it’s useful if I can rather fine tune with the facts and recent event dataset, so that it becomes part of my model itself, therefore no hallucination. You can think of RAG as an open book exam, where you can search for answers in the book while taking the exam versus, a fine tuned model is you having the knowledge in your brain.. of course depending on your memory and reasoning ability, you will give accurate vs hallucinated answer. But if it becomes part of your memory accurately and you can retrieve it on demand, you now no longer have to check and search your books every time someone ask you a question. So, I hope that now you understand why it may be useful to finetune. The ultimate goal is “no hallucination” and therefore better accuracy.
@@brishtiteveja Tnx for the response, I really appreciate it. I believe you're talking about real world data, not generated by the AI. My question was focused on the usefulness of synthetic data though or am I interpreting synthetic data the wrong way?
@@brishtiteveja That's a great answer.
Can it use for indonesian language?
Bro create a best model for tamil
We don’t have best gpu’s
If you do it we can createit for many usecases
ollama.com/mervinpraison
@@MervinPraison 🥰tnq
@@MervinPraison Pretty cool. Can you point me to how I can do something like this for another language? I am trying to help build one for the Yoruba language.
Use claude to do this. No model in this world is even close to claude currently. Dont beleive the benchmarks. The difference is huge