Why Fine Tuning is Dead w/Emmanuel Ameisen

Hamel Husain

มุมมอง 36 011

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ต.ค. 2024
Arguments for why fine-tuning has become less useful over time, as well as some opinions as to where the field is going with Emmanuel Ameisen.
Slides, notes, and additional resources are here: parlance-labs....
00:00: Introduction and Background
01:23: Disclaimers and Opinions
01:53: Main Themes: Trends, Performance, and Difficulty
02:53: Trends in Machine Learning
03:16: Evolution of Machine Learning Practices
06:03: The Rise of Large Language Models (LLMs)
08:18: Embedding Models and Fine-Tuning
11:17: Benchmarking Prompts vs. Fine-Tuning
12:23: Fine-Tuning vs. RAG: A Comparative Analysis
25:03: Adding Knowledge to Models
33:14: Moving Targets: The Challenge of Fine-Tuning
38:10: Essential ML Practices: Data and Engineering
44:43: Trends in Model Prices and Context Sizes
47:22: Future Prospects of Fine-Tuning
แนวปฏิบัติและการใช้ชีวิต

ความคิดเห็น • 89

@elvissaravia 2 หลายเดือนก่อน ⁺¹⁷
Very interesting and thought-provoking talk. I understand Emmanuel's take on why fine-tuning might be dead. However, in my opinion, maybe we still need it but not as much as we used to as the LLMs get more powerful at analyzing, extracting, summarizing and all the other capabilities that a wide range of tasks rely on. I prefer to think more deeply about the relationship between fine-tuning, RAG, prompt engineering and how to leverage all to build highly-performant and reliable systems. Great talk! Keep it up!
@davidwright6839 2 หลายเดือนก่อน ⁺¹³
The conceptual analogy that I like to use comes from cartography. The LLM is a map of regions called "concepts" that are projected into the multidimensional tensor space of tokens. Fine-tuning is a conformal map projection of this tensor space to create a "view" appropriate to a user's domain. Prompts are tokens that adjust the zoom level of the conformal map to view greater detail and narrow the possible output responses from the tensor space. RAG is like "street-view" images or satellite data that adjust the temporal window of the map beyond its training cutoff date. Prompts can be optimized for either the LLM or the fine-tuned map. If the prompt tokens are optimized for the LLM, fine-tuning is superfluous. If the prompt tokens are domain-specific for a conformal "view," the fine-tuned map should perform somewhat better.
@xspydazx 2 หลายเดือนก่อน ⁺¹
yes i recently saw another fine tuning tehnoique to increse the probablitys if a particular series : hence by adding a list of entitys to the content then you can always have other keywords which will activate or extract the content !
@antonystringfellow5152 2 หลายเดือนก่อน ⁺¹
Thanks, that's an excellent analogue.
I didn't really have a good grasp of these points before I read this. It's easy for novices like me to get lost in the terminology.
@xspydazx 2 หลายเดือนก่อน
when i was investigating building languge models for the first time : i really was interested at what happens atr each layer mathmatically : ( forget the optimization funciton ) .. what data is actually at each layer : and the first chatGPT told me that :
Lnaguage models were a collection of ngram languge models?
SO i begun with these type of model , i did not quite get it so i digged deeper only to find out embedding were the key and after checking out skipgrams / glove etc i finally got some where ...
after i discovered the transformer arch and found that each layer was actually a word to word matrix : of probablitys of the next word as this is how ngram models do predict the next words : but this transformer is a massive vocal and layers....
i created a model step by step with the gpt ( it could not make a transformer then ) ... but we made the components ... the self attention et .... these are the search function .... enabling for the later layers to refocus on selecting the next token . : SO at every layer the data is in embedding matixes aa you can view the journey of the token prediction as it travels through the network ....
hence the layer stack : it was said that gpt had so many layers , but in truth we find :
32 layers !!
@agenticmark 3 หลายเดือนก่อน ⁺¹⁶
you fine tune for BEHAVIOR, you use RAG for DATA.
it fine tuning is how the model interacts with the user, rag is how the model gets factual information. that does not equal prompt engineering....
@xspydazx 2 หลายเดือนก่อน
rag is a bit like not trusting your own bot ! it does have the knowledge but you dont know how to get to it yet !
@thefryingpan1021 2 หลายเดือนก่อน ⁺²
This is an oversimplified approach to RAG, RAG 100% can be used for behaviour if it is attached to an agentic workflow
@vincenthenderson3733 2 หลายเดือนก่อน ⁺⁵
Around 8:30 to 10:10 - The RAG picture absolutely turns the problem into a search problem that is at least as important as the prompting problem. This is a far less trivial problem than most people realize. Using RAG requires deep thinking about the retrieval part, and this is notoriously difficult using embeddings only, at least if you want to optimize your token consumption, and overall inference time of your prompt chain. You'd greatly boost your RAG-based workflow by not only using embeddings but considering sticking a real search index behind it that is configured for the retrieval that you care about. That's a kind of LLM workflow optimization that I feel is not being talked about.
@poketopa1234 2 หลายเดือนก่อน ⁺³
Man’s argument literally falls apart 5 minutes in. That being said, it’s still an interesting discussion and I appreciate the effort put into it.
@thegrumpydeveloper 2 หลายเดือนก่อน ⁺⁵
I like the questions but really wished they were asked at the end of the presentation rather than break the flow and keep on having the answer being a few slides or talking points down the way.
@hamelhusain7140 2 หลายเดือนก่อน ⁺²
@@thegrumpydeveloper we did an experiment but didn’t do the same in other videos. I didn’t like it either !
@tufcat722 2 หลายเดือนก่อน ⁺⁷
I think what is misleading here is conflating machine learning with LLMs. The scope of LLMs is not the same as machine learning overall. Fine tuning of foundation models is not dead.
Furthermore, aren’t the big LLM companies like Anthropic already doing extensive fine tuning on their own base models before releasing to the public? How does that fit with this idea?
@AtomicPixels 2 หลายเดือนก่อน ⁺²
nothing remotely makes sense in this video and calling BS on those large roles you mentioni. XGboost is a package. Deep learning is an objective, which id actually recommend using XGBoost for any time prectiom chaecks in the Deep L process.
@vincenthenderson3733 2 หลายเดือนก่อน ⁺³
Around 24:00 Another insight here wrt the question of the life science guy, is that when we say "RAG", we tend to assume "out of the box", embeddings match RAG. But RAG is in many special cases best implemented with dedicated software parts that take the LLM query output and use other domain-specific NLP and business rules software to actually do the retrieval of what you care about. In other words, build LLM workflows that are not only using LLMs. Get the LLM to do a task, then use that output to get your advanced semantic retrieval stuff that you know works and embeds a lot of your subject matter expertise to take care of the next step of your workflow, then use that output, which will typically be much more precise than a vanilla embedding, to build your next LLM prompt.
I would have given the advice to the life sciences guy that he's very likely not going to get much benefit from fine tuning. You can't train a knowledge representation into an LLM using fine tuning.
Fine tuning helps with task-specific input and output simplification and formatting, pruning, compliance, that sort of thing, not with the actual "logical" inference that the model does.
@xspydazx 2 หลายเดือนก่อน
rag is only for badly trained models or even for those who do not wish to train a model !
@esantirulo721 2 หลายเดือนก่อน ⁺⁶
In LLM, fine-tuning just makes more complicated the problem of grounding: where the output comes from? the base model, the learned data or from nothing (hallucination) ? that's why embedding-based search is great: you know what data you're generating your output from. In some industries (e.g., medical), being able to justify ("ground") an answer is mandatory. There are a few use-cases for fine-tuning, if the cost for transforming the data in pair (prompt -> completion) is not too expensive.
@xspydazx 2 หลายเดือนก่อน
you can fine tune a Ground truth into the model no probs:
also you can always trainn the embeddings inside the model : ( you should use entity based datasets when trainning for embeddings ) , when your training for translation is advisable to train the head and the embeddings : ( each record is small so you can do masssive batches ... so you can really impact the embeddings : ) after you should do normal fine tuning on your new forign lang or specialist dataset ( not touchin ghte embeddings ) .. so the model can align its new usages ! ...
Repatitive data is also a way to embed truth !
@MagusArtStudios หลายเดือนก่อน
Been working on RAG in vieo games engineering prompts using different models and it has been going super smoothly. I trained a Retrieval model on data and incorporate static elements into it for the video game environment. Been soo much fun.
@vincenthenderson3733 2 หลายเดือนก่อน ⁺¹
Around 46:00 - context size is mostly a vanity metric AFAICT. I'd like to see data about how accuracy varies with the percentage of total nominal context that is actually used by the prompts. In fact, this could be one of the most beneficial uses of fine-tuning, for avoiding filling up the context with very long instructions.
@mrpocock 2 หลายเดือนก่อน ⁺³
I am fairly convinced that we are doing LLM wrong. We should have language models that generate complete nonsense, and puppet them with knowledge models. So if RAG is how you do this, you want your non-RAG model to score essentially zero on all benchmarking except language structure, and inject everything with RAG or some other knowledge or skill injection.
@Bootcody หลายเดือนก่อน
I love your work. Totally agree, RAG is the way to go. And especially agree with the "Average time spent per task". Where preparing the initial data is the most critical step. 👍
@rgolanng 2 หลายเดือนก่อน ⁺¹
I agree that learning to prompt is very efficient than fine-tuning. But this lead to some random guys come from nowhere, fintech background etc..., know a little bit of prompting and then call themselves "prompt engineer", "AI engineer". I know that AI engineer just use AI, not actually develop it, but its kind of different.
@lpls หลายเดือนก่อน ⁺¹
Can't fine tuning be used to get smaller models to perform a specific task like a bigger model would, but faster and cheaper?
@vincenthenderson3733 2 หลายเดือนก่อน
Around 19:10 absolutely fine tuning works better for some tasks, and certainly doesn't for knowledge injection, for which you must use RAG.
But you have to take into account also the economics and logistics of it. Fine-tuning is a task-specific thing that you have to do, then maintain all your FT models and so on, which costs money and introduces complexity. RAG and prompting is far more nimble. It's not often in life that the easier solution is in fact better than the complex one.
@mrwhitecc 3 หลายเดือนก่อน ⁺²⁵
I do not think he understand what happened to the model after fine tuning. Just give one example here, if you have a unique reasoning pattern that there is no chance public pretraining dataset can contain the correlated data , then the SFT is the only way that you can let the model simulate the "reasoning" ability that you want the model to behave , prompt engineering do not help at all , RAG either.
@xspydazx 2 หลายเดือนก่อน ⁺³
yes you need to add the custom method to the model !
the pretraining does not add methods hence Chat model ( question/Answer ) and instruct models (instruct/input/response) or Context models (Question /Answer /Context)
these are only the basic methods these base model have been trained . so for even a coding task it would need to be specifically trained for this .. hence code llama ( one of the first code specific models) ... now people have added Chain of thoughts etc ....( but if you dont know the prompt they used it might not work the same way as it was trained ) - hence fine tuning can also be called prompt tuning ... ie pushing a specifi promt in deep so its easy to get the expected response given a prompt like you are a helpful AI ( the most common prompt) .. so given that prompt you should expect a higher quality response due to the amount of samples trained with this prompt !!
@mrwhitecc 2 หลายเดือนก่อน ⁺¹
@@xspydazx indeed, totally agree
@javiergimenezmoya86 2 หลายเดือนก่อน ⁺¹
The more intelligent and advanced the model, the less important it is to do fine tuning.
@xspydazx 2 หลายเดือนก่อน ⁺³
@@javiergimenezmoya86 not really right now they are testing people like you to se who is satisfied with an off the shelf model?
I suppose you buy off the shelf suits too ?
this is how they use you to gauge satisfaction , so they can get serious and stop releasing models !
As you should know its not about the pretrian models it is a bout the fine tuned versions of these model which have had methodologies trained !
@xspydazx 2 หลายเดือนก่อน
@@javiergimenezmoya86 its because they do notunderstand the steps to train a model from scratch :
As thefirst stage is a corpus dump :
this aligns theembeddings and tokenizer , ( the tokenizer canbe trained seperately with specific training datas ie NER data enebaling for rich meanings andclusters in thetoken embdding space:
The second stage is the sequence in and sequence out , ie Chat and response :
after a large corpus dump we will get some type of conversation but.. not great .. by giving the model some input/response data we can trsin the first level of meaning ful responses ! as it would have learned many phrases and predictionf from the initial pretrain:
SO obth these stages should push the whole model parms as well as train the embeddings in the transformer ( positional embeddings) as the sequences contain the information we are searching for ( positional matrixes ) - but then you have a chat model !
after this model needs to be also PreTrained to perform task , ie :
Question with context and response / Instruct / input / response : this add the next logical layer of word usage to the model : ie given a task as an input ( containing as much relevant data as possible ) here is the expected outputs and we train for prediction :
as with chat model :
the corpus model is trained for next word generation :
It is the combination of tasks trained on the SAME network withpretraining we find we can get a base model wi=hich is multifunctional:
When we fine tune these models we can now do another set of corpus dumps ( domain specific ) to give it domain knowledge , and content : as well as desinging many tasks hence the alpaca format being the best training format :
Now they take models to another stage :
Messageing : as they need the model to output vairious message types in a messaging format: Here we introduce ROLES:
this is also where functions are introduced : functions have been fine tuned into moels but this needs to be done in pretraining now : , same as tool use : these need to be seperate outputs populated when required :
Same as image and audio input and output: so this final complex output is what we truly seek :
we need to be able to get these infomations in templae form hence Pydantic ! (but its slow ) ...
So this needs to also be in pretraining !
for custom models we now know the pretrain data has relevance to the quality of the later fine tuningg tasks ::
SO now we arer not relying on the common crawl as the corpus dump , we can use text books also or school books effectivly teaching the model during rpetraining with maths and geography , so the down stream training is more effetive and the model is more plyable !
So Knowning the Route to a model is important to undertand the importance of tuning a model : and how fine tuing or tuning a model can never be outdated !
you could always use a base model and Rag setup ! and be restricted forever :
but a true system is transactional atabase to data warehouse then to data marts :
or nuggets of facts : based on the overall picture : so the data would need to be uploaded into the model :
the model is not storing data but we could consider the data is stored in a TRie tree !
and at every node there is a probablity of steping to the next node: So the model times you travel down a path the more that path becoems biased :
SO the model could indeed contain every posible combination but with fine tuning we are enforcing routes !
so if we actually wanted the model to know the comon crawl we would train the model withmultiple epochs ( we do not even need to consider the loss rate as it wouldnormalize to the dataset , and the comon crawl is varied enough to not over fit :
hence after to enforce code or medical you need to sepcifcally fine tune these routes and tasks in:
as with good systems with usage the proabblitys of te output get better as the model is more able to select ffrom past data : but we do not have this update feature yet : hence a regualr update to the models ! ( Rag is just the transactional database )
1 thing to rememebr that transformer networks do not Replace DATABASES! hence the reuirement for Rag systems ...
the models have become Managers !
@nyan-cp5du 2 หลายเดือนก่อน ⁺²
The problem is if you don't do the cool thing, you help your company continue to generate revenue, but you don't get promoted and you spend the rest of your life writing SQL queries for shit pay
@SouhailEntertainment หลายเดือนก่อน
00:00:00 - Introduction and Purpose of the Talk
00:00:38 - Emmanuel's Background and Experience
00:01:12 - Disclaimer and Scope of the Talk
00:01:39 - Overview of Fine-Tuning: Trends, Performance, and Difficulty
00:02:13 - Observed Trends in Machine Learning Over the Years
00:05:22 - The Shift from Training to Fine-Tuning to Prompting
00:06:16 - Future of Fine-Tuning in Context of LLMs
00:06:45 - Extrapolating Trends in Fine-Tuning
00:07:10 - Questions from the Audience on Trends
00:08:26 - Comparing Fine-Tuning vs. Retrieval-Augmented Generation (RAG)
00:11:12 - Importance of Context Injection and RAG
00:14:21 - Detailed Comparison of Fine-Tuning and RAG
00:16:33 - Audience Questions on Fine-Tuning vs. RAG
00:18:26 - Performance Comparisons and Paper References
00:21:34 - Limits of Fine-Tuning for Knowledge Embedding
00:23:14 - Audience Example: Fine-Tuning for Precision Oncology
00:25:04 - Adding Knowledge Through Fine-Tuning: Discussion
00:26:47 - Challenges with Fine-Tuning for Specific Knowledge
00:28:46 - Audience Question on Multilingual Fine-Tuning
00:29:59 - Fine-Tuning for Specific Tasks like Code Models
00:32:01 - Future of Model Training and Context Handling
00:33:36 - Pre-Training vs. Fine-Tuning in Domain-Specific Models
00:35:24 - Cost Considerations in Fine-Tuning
00:37:13 - Examples of Effective Fine-Tuning
00:38:09 - Evaluating Practical Utility of Fine-Tuning
00:38:45 - Key Focus Areas in Machine Learning
00:41:42 - Importance of Data Work and Infrastructure in ML
00:43:28 - AI Engineering vs. Traditional ML Approaches
00:45:58 - Trends in Model Pricing and Context Sizes
00:48:04 - Dynamic Few-Shot Examples and RAG
00:49:11 - Practical Uses and Best Practices in Prompt Engineering
00:49:42 - Conclusion and Final Thoughts
@agenticmark 3 หลายเดือนก่อน ⁺⁴
strange, prompt engineering over fine tuning? if you dont want control-ability sure... prompt engineering will disappear. fine tuning will not.
i train voice and chat models (fine tuning) and I have trained dozens of agent foundational models that play nintendo and atari games and a bunch of classifiers. training from scratch (foundational, pretraining) is very very costly. fine tuning is not.
@Player-oz2nk 2 หลายเดือนก่อน ⁺¹
What cloud work flows do you recommend for audio models. This is what I'm really interested in
@xspydazx 2 หลายเดือนก่อน ⁺¹
i find that its important to add as many unique prompts and methods as possible and all types of tasks and response shapes ...
then use a generic prompt ! ... you will find really solid results each time ... forget your promtp from training ! ( expect it to be embedded in the model somewhere deep!)
@kcm624 2 หลายเดือนก่อน ⁺¹
The questions that constantly keep interrupting the talk are super distracting.
@alaad1009 3 หลายเดือนก่อน ⁺²
Excellent conversation!!!
@zeryf4780 2 หลายเดือนก่อน ⁺¹
great and informative conversation! I wonder if there are more channels like yours!
@Tenebrisuk 2 หลายเดือนก่อน ⁺¹
it's a shame the host didn't actually let the guest answer the last question and instead proposed a different question, otherwise I found this very interesting
@johnny017 หลายเดือนก่อน
Prompt engineering it is the same as teaching my 3 old kid on how to do things. You can run, but you cannot run on the street if there are cars, but you can run if the street is empty, etc... He has to see different situations to learn. (fine tune 😅). Very hard to get the model right when prompt engineering. More accurate and easy to evaluate when fine tune. The prompt also gets smaller as we don't need to give 20 instructions on how to do things.
@tarikborogovac9614 2 หลายเดือนก่อน ⁺¹
Could finetuning make the model overall less capable overall, i.e. forget general knowledge, reasoning, instruction following, and other abilities that help you answer questions even in the domain that you are finetuning for. And this type of pervasive capability loss maybe hard to measure.
@muhannadobeidat 2 หลายเดือนก่อน ⁺¹
Nice discussion, thanks for sharing. I am 70% into it and still didn’t hear examples or justification why fine tuning should be avoided. Lots of evaluation results, but that does not make sense if you are fine tuning. You are doing that to work on your custom data mostly and therefore generic evaluation models may not apply nor portray the real performance of the fine tuned model. I fine tune for example to do better detection of service requests into categories and potential solutions.
@AtomicPixels 2 หลายเดือนก่อน ⁺¹
JUSTICE ON CLICK BAIT D-Bags: hahahhaah finally got to the dude that point out his contradiction
@jeremyh2083 2 หลายเดือนก่อน
I’m in healthcare, my data isn’t publicly available. Fine tuning has pretty decent results because most things are single shot. Agents/rags should be fun but again our data being segregated is a hassle. I think Bloomberg has to be similar.
@enriquebruzual1702 2 หลายเดือนก่อน
The success of a RAG app (All things being equal) comes from the context sent to the model, that is, having a good vector db and good search results.
@briancase9527 2 หลายเดือนก่อน
Really good and useful talk.
@andrewcbuensalida หลายเดือนก่อน
When gpt 3 upgrades to 3.5 or 4, is that upgrade caused by fine-tuning? Or a different mechanism? Or is it completely trained from scratch? Thanks for the talk by the way.
@cesar_chez หลายเดือนก่อน
Fine-tuning is not dead, just see the recent example of Genie, a fine-tuned version of GTP-4o that has reached the top in the SWE-Bench, something impossible to do (right now) with prompting
@chunheichau7947 2 หลายเดือนก่อน ⁺²
ONLY FOR LLM!!!
@SearchingForSounds 2 หลายเดือนก่อน
Another place I think this is echoed is in image diffusion models. IP Adapter has become some powerful in stable diffusion that we're able to basically create instant models using 3/4 reference images and normalizing/averaging the tokens they create. By conditioning a prompt with those tokens via IP adapter, fine tuning base models is in all but the most niche cases, pointless now.
@Steve-lu6ft 2 หลายเดือนก่อน
When you say we should be spending days working on prompts, how so? I'm assuming you have a high level overview of how these prompts should be structured in mind, but can you break it down and simplify it for me?
@MLwithZain 2 หลายเดือนก่อน
This paper(Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs) only does continued pre-training(ie. next word prediction) not supervised finetuning so the comparison is not entirely accurate.
@notfaang4702 2 หลายเดือนก่อน
I was expecting more from this talk, the tldr was do RAG first, maybe fine tune later, but it also very depends on a case by case and what you want to do with model.
@matty-oz6yd หลายเดือนก่อน
Why is he comparing fine tuning to RAG as though the use case for each is the same? What is the point of this presentation?
@peterbizik224 2 หลายเดือนก่อน ⁺¹
Nice session. Thank you. I would love to see a reliable stable base model understanding the languages. But the domain knowledge is questionable always in my opinion, as most/some? of the books used for base model training (technical books, advanced papers) are quite complex and I am still not truly convinced that text + pictures + math was captured with very high precision.
@artifishially_stupid 2 หลายเดือนก่อน ⁺¹
Well put. Most of the LLMs contain way more knowledge than we need at the base model level. Nevertheless, I've achieved excellent results by uploading my own documents and instructing the chatbot to search my documents first. The documents are very technical and complex in their formatting, so I had to do A LOT of cleaning and preprocessing, added some annotation, and converted to txt files. Having done all of that, I'm not sure fine-tuning would add much in my particular case.
@peterbizik224 2 หลายเดือนก่อน
@@artifishially_stupid well that's the way, I guess, but - for realistic corporate operational needs - once you will get a managerial material with some level of competency - once it will come to "A LOT" then it's no go :)
@rajeevsingh758 2 หลายเดือนก่อน ⁺³
Are you comparing Fine tuning with Prompting ????? Who made this chart?
@codevacaphe3763 2 หลายเดือนก่อน
I have some experience about fine-tuning, I mean fine-tuning and transfer learning is still a great technique but you have to had in depth knowledge about the math to know how to apply it. LLM is really hard to fine-tune, some of the time it will break the parameters learn from previous context (just my opinion tho). Some CNN application on the other hand have good fine-tune result since it can capture the shapes and patterns of the image then after CNN layer it still maintain and capture the key features.
@hidroman1993 หลายเดือนก่อน
Leave him talk, and then ask questions 🥰
@toreon1978 2 หลายเดือนก่อน
35:41 sorry but I don’t get that. Isn’t the hard part creating the fine tune data set? The actual execution isn’t that much cost, is it?
@ayushman_sr 2 หลายเดือนก่อน
For some use cases, having a better prompt will add more tokens hence more cost for the same inference. Any thoughts?
@xspydazx 2 หลายเดือนก่อน
its advisable to send a enitity list in with your prompt to attract a higher percentage of associated information in the output resposne :
so if your training also reflects this you will also generate more truthful responses , even for unseen concepts !
@xspydazx 2 หลายเดือนก่อน ⁺³
25:56 / 50:06 : can you train knowledge :? but you should remember that the model is a
1. collection of Regression models
2 a Collection of Word to word matices !
@toreon1978 2 หลายเดือนก่อน
46:53 context window line is incorrect. For the models you mentioned we are are around 100K average
@xspydazx 2 หลายเดือนก่อน
yes 128k , but some are unlimited :
in fact they are all unlimited : its up to you how you train your model , the problem is only the processing size it takes to load and train the model :
inside the tokenizer config files ( mistral for instance ) it sayus that the max token length was way past a trillion !... so first it needs to be trained then it can be used ....
this is what we expect from released models : we expect that they have been trained on the largest psossible contexts leaving us to be able to choose a lower context according to our memeory abilitys ! the larger the context you set for a model the slower the response time as it is based on your gpu , so even a small model could consume a large gpu stack : , but the smaller models and large context are the best combination for local execution : so these 4b and 3b etc need training on long context :
after this the pretrained models they release will be worth downloading as a home base model :
or locking in as a gguf model :
currently you need a training regime for any model downloaded if your doing serious work !
hence rags !
@Douchebagus 2 หลายเดือนก่อน
In the context of diffusion-based image models, fine-tuning is infinitely more important than prompting, so I don't think your talk applies to all machine learning models.
@toreon1978 2 หลายเดือนก่อน
11:16 I think I get the idea of where to not finetune. But for the 5% where it is needed, I always thought that it’s like when you want to carve out a specific part of the general knowledge and behavior LLMs have. So, to give concise fitting financial advice LLMs have to ignore much (bad) knowledge and also stop a lot a expansive responses. Thats bot something good prompting can achieve, right?
@darkmatter9583 2 หลายเดือนก่อน
RAG,quantize data? favorite LLM? HELP
@agenticmark 3 หลายเดือนก่อน ⁺²
_very_ unscientific claim about the lines on that chart. try trading stocks with that mentality of guessing it will just keep going up!
@JL-1735 2 หลายเดือนก่อน ⁺¹
This guest has quite a condescending attitude, I don't buy what he's saying as a result of it. On top he came across as rude (his "This is not an Anthropic presentation" etc, you can say that in a less aggressive way. Anyway, interesting topic, but it's too much founded on fluffy lines and reasoning that serves ... suppliers of big foundation models like Anthropic. As closely guarded models like those of Anthropic are the opposite of being easy to fine tune. Anyway, Meta will prove open weights model are the future, and yes finetuning has it's place not everything is an LLM-problem built in to the foundation model.
@AtomicPixels 2 หลายเดือนก่อน
jesus christ dude, *someone* has to make these models. Love how say a bunch of stuff, but never say anything about what that us, then say to do something else....then mention you have nothing but just opinoions
ITS ALWAYS VALUABLE to make new loss functions. What ISNT helpful is ppl like you reinforcing the dependence and advising naivitiy to LLM's.
@yvettecrystal6075 2 หลายเดือนก่อน
When fine-tuning LLM with techniques like LORA, what is the model actually doing? I know it is weight updating, but what does the model learn from it? Can anyone explain in an intuitive way.
@fneful 2 หลายเดือนก่อน
Only thing LLMs are good at is predicting next one word given all previous words. Better weights means better prediction (with more confidence) of next word. You can think as, if previously model had doubt among 5 words for prediction after fine tuning model has confusion between say 3 words only.
@StarnikBayley 2 หลายเดือนก่อน ⁺¹
The original LLM model's weights are frozen, which is left intact. Nothing changes in the weights of the original model. However LoRa adds new segments to the model, which can be trained. For example, a silly example would be, a person's vision may not be able to separate camouflage in jungle . LoRa is like adding a new segment with its own weights to the brain of the person, without modifying the existing brain, which enables the person to see green and brown with higher precision so that he can easily distinguish camouflage in the jungle.
@anthonyphan1922 หลายเดือนก่อน ⁺¹
Click bai

ต่อไป

เล่นอัตโนมัติ

Beyond the Hype: A Realistic Look at Large Language Models • Jodie Burchell • GOTO 2024