How to Create Synthetic Dataset EASILY? Step by Step Tutorial

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ธ.ค. 2024

ความคิดเห็น •

  • @Menasaat
    @Menasaat 4 หลายเดือนก่อน +7

    This is awesome. For the next video, I have a suggestion: Suppose I have multiple PDF files containing a lot of information about my organization. How can I use a large language model (LLM) like the one you used above to create a dataset extracted from the knowledge provided in these PDFs?

  • @litttlemooncream5049
    @litttlemooncream5049 5 หลายเดือนก่อน +1

    first time to know that LLMs can generate datasets! thank a lot

  • @ByteBop911
    @ByteBop911 4 หลายเดือนก่อน +2

    i also generate synthetic datasets... a secret tip for alignment... set the mood and tone as parameters in the prompt as well to generate the questions as response (make the dataset a little more dynamic)

    • @kolasatheesh1719
      @kolasatheesh1719 4 หลายเดือนก่อน

      hi @ByteBop i would like to create a synthetic dataset for images, do you know how to do it?

    • @CryptoMaN_Rahul
      @CryptoMaN_Rahul 4 หลายเดือนก่อน +1

      Hey bro can you tell does it cost anything?
      Also can you share your codes ?? I'm a newbie want to learn

  • @maruc14
    @maruc14 4 หลายเดือนก่อน

    Really great tutorial. Keep em' coming. First time seeing NIMs demo.

  • @atultiwari88
    @atultiwari88 4 หลายเดือนก่อน +1

    Thank you for this awesome tutorial. I request you to kindly make a video on synthetic dataset generation from pdf files.
    Thank you so much

  • @gr8tbigtreehugger
    @gr8tbigtreehugger 5 หลายเดือนก่อน

    Super cool! Just did something extremely similar but in Google Sheets, so my non-tech peers can help.

  • @swetharavishankar4825
    @swetharavishankar4825 3 หลายเดือนก่อน

    This is amazing.! Can you also explain how to create a classification model from the generated dataset ?

  • @batigol_9
    @batigol_9 2 หลายเดือนก่อน

    if I have a specific number of subtopics and I don't want to generate new subtopics how would I chose the number of data sets to generate?

  • @MeinDeutschkurs
    @MeinDeutschkurs 5 หลายเดือนก่อน

    Great insight!

  • @chaithanyavamshi2898
    @chaithanyavamshi2898 5 หลายเดือนก่อน

    Wow! Great Tutorial Mervin this is what I'm exactly looking for fine tuning. I tested it and it worked perfectly. I have a question, from my understanding from your blog and video, this dataset is suitable for ORPO fine-tuning (AI feedback scores)? Can I still use it for SFT by filtering out the responses (rows) with best scores?

  • @vitalis
    @vitalis 5 หลายเดือนก่อน

    Can we use ask LLMs to outline, create questions, reply, summerise BOOKS and use that to fine tune LLMs?

  • @swetharavishankar4825
    @swetharavishankar4825 3 หลายเดือนก่อน

    Hey I am able to generate only 10 examples in count how do I make sure it generates more than a thousand at least

  • @kolasatheesh1719
    @kolasatheesh1719 4 หลายเดือนก่อน

    hey buddy , can't we create the synthetic dataset for images? i mean uploading images and getting responses for the questions... how to do it

  • @fascinatingfactsabout
    @fascinatingfactsabout 5 หลายเดือนก่อน

    I'm still fuzzy about when is creating a synthetic data useful in a practical scenario for me, as a single person, not a large company that needs to fine-tune LLM's. Can someone clarify? What's the real world use for this?

    • @brishtiteveja
      @brishtiteveja 5 หลายเดือนก่อน +5

      LLM can hallucinate for questions you ask. Specially for low resource languages. For my own language Bengali, it hallucinates a lot and gives wrong answer for facts/events. Now, you can use RAG to stop hallucination. But, RAG depends on the size of the context. If I want to build a specialized model which knows and answers facts about Bengali culture and recent events, it’s useful if I can rather fine tune with the facts and recent event dataset, so that it becomes part of my model itself, therefore no hallucination. You can think of RAG as an open book exam, where you can search for answers in the book while taking the exam versus, a fine tuned model is you having the knowledge in your brain.. of course depending on your memory and reasoning ability, you will give accurate vs hallucinated answer. But if it becomes part of your memory accurately and you can retrieve it on demand, you now no longer have to check and search your books every time someone ask you a question. So, I hope that now you understand why it may be useful to finetune. The ultimate goal is “no hallucination” and therefore better accuracy.

    • @fascinatingfactsabout
      @fascinatingfactsabout 5 หลายเดือนก่อน

      @@brishtiteveja Tnx for the response, I really appreciate it. I believe you're talking about real world data, not generated by the AI. My question was focused on the usefulness of synthetic data though or am I interpreting synthetic data the wrong way?

    • @john_blues
      @john_blues 4 หลายเดือนก่อน

      @@brishtiteveja That's a great answer.

  • @bocilmillenium7698
    @bocilmillenium7698 5 หลายเดือนก่อน

    Can it use for indonesian language?

  • @commoncats5437
    @commoncats5437 5 หลายเดือนก่อน +1

    Bro create a best model for tamil
    We don’t have best gpu’s
    If you do it we can createit for many usecases

    • @MervinPraison
      @MervinPraison  5 หลายเดือนก่อน +1

      ollama.com/mervinpraison

    • @commoncats5437
      @commoncats5437 5 หลายเดือนก่อน

      @@MervinPraison 🥰tnq

    • @john_blues
      @john_blues 4 หลายเดือนก่อน +1

      @@MervinPraison Pretty cool. Can you point me to how I can do something like this for another language? I am trying to help build one for the Yoruba language.

  • @TheBestgoku
    @TheBestgoku 4 หลายเดือนก่อน

    Use claude to do this. No model in this world is even close to claude currently. Dont beleive the benchmarks. The difference is huge