![The ML Tech Lead!](/img/default-banner.jpg)
- 46
- 65 317
The ML Tech Lead!
United States
เข้าร่วมเมื่อ 21 ก.ย. 2023
My name is Damien, former ML Tech Lead at Meta and more than 10 years in the field of AI/ML! I share my knowledge of the field to help prepare the next generation of ML Engineers
Understanding How LoRA Adapters Work!
LoRA Adapters are, to me, one of the smartest strategies used in Machine Learning in recent years! LoRA came as a very natural strategy for fine-tuning models. In my opinion, if you want to work with large language models, knowing how to fine-tune models is one of the most important skills to have these days as a machine learning engineer.
So, let me show you the mathematical foundation for those LoRa adapters, why they are useful, and how they are used. Let's get into it
So, let me show you the mathematical foundation for those LoRa adapters, why they are useful, and how they are used. Let's get into it
มุมมอง: 584
วีดีโอ
The Backpropagation Algorithm Explained!
มุมมอง 52512 ชั่วโมงที่ผ่านมา
The backpropagation algorithm is the heart of deep learning! That is the core reason why we can have those advanced models like LLMs. In a previous video, we saw we can use the computational graph that is built as part of deep learning models to compute any derivatives of the network outputs with respect to the network inputs. I'll put the link in the description. Now we are going to see how we...
Understanding The Computational Graph in Neural Networks
มุมมอง 87021 ชั่วโมงที่ผ่านมา
Do you know what is this computational graph used by deep learning frameworks like TensorFlow or PyTorch? No? Let me tell you then! The whole logic behind how neural networks function is the back-propagation algorithm. This algorithm allows to update the weights of the network so that it can learn. The key aspect of this algorithm is to make sure we can compute the derivatives or the gradients ...
How to Approach Model Optimization for AutoML
มุมมอง 55114 วันที่ผ่านมา
Since I started my career in Machine learning, I have worked hard to automate every aspect of my work. If I couldn't develop a fully production-ready machine learning at the click of a button, I was doing something wrong! I find it funny how you can recognize a senior machine learning engineer by how little he works to achieve the same results as a junior one working 10 times as hard! AutoML ha...
Understanding CatBoost!
มุมมอง 55814 วันที่ผ่านมา
CatBoost was developed by Yandex in 2017: CatBoost: unbiased boosting with categorical features. They realized that the boosting process induces a special case of data leakage. To prevent that, they developed two new techniques, the expanding mean target encoding and the ordered boosting. - The Gradient Boosted Algorithm Explained: th-cam.com/video/XWQ0Fd_xiBE/w-d-xo.html - Understanding XGBoos...
Implementing the Self-Attention Mechanism from Scratch in PyTorch!
มุมมอง 61621 วันที่ผ่านมา
Let’s implement the self-attention layer! Here is the video where you can find the logic behind it: th-cam.com/video/W28LfOld44Y/w-d-xo.html
What is the Vision Transformer?
มุมมอง 55521 วันที่ผ่านมา
I find the Vision Transformer to be quite an interesting model! The self-attention mechanism and the transformer architecture were designed to help fix some of the flaws we saw in previous models that had applications in natural language processing. With the Vision Transformer, a few scientists at Google realized they could take images instead of text as input data and use that architecture as ...
Understanding XGBoost From A to Z!
มุมมอง 92828 วันที่ผ่านมา
I often say that I some point in my career, I became more of a XGBoost modeler than a Machine Learning modeler. That's because if you were working on large tabular datasets, there was no point to try another algorithm, it would provide close to optimum results without much effort. Yeah ok, LightGBM and Catboost are obviously as good and sometimes better, but I will always keep a special place i...
The Gradient Boosted Algorithm Explained!
มุมมอง 1.1Kหลายเดือนก่อน
In the gradient-boosted trees algorithm, we iterate the following: - We train a tree on the errors made at the previous iteration - We add the tree to the ensemble, and we predict with the new model - We compute the errors made for this iteration.
How Can We Generate BETTER Sequences with LLMs?
มุมมอง 383หลายเดือนก่อน
We know that LLMs are trained to predict the next word. When we decode the output sequence, we use the tokens of the prompt and the previously predicted tokens to predict the next word. With greedy decoding or multinomial sampling decoding, we use those predictions to output the next token in an autoregressive manner. But is this the sequence we are looking for, considering the prompt? Do we ac...
What is this Temperature for a Large Language Model?
มุมมอง 634หลายเดือนก่อน
How do LLMs generate text in a Stochastic manner!
From Words to Tokens: The Byte-Pair Encoding Algorithm
มุมมอง 495หลายเดือนก่อน
Why do we keep talking about "tokens" in LLMs instead of words? It happens to be much more efficient to break the words into sub-words (tokens) for model performance!
The Multi-head Attention Mechanism Explained!
มุมมอง 917หลายเดือนก่อน
The Multi-head Attention Mechanism Explained!
What ML Engineer Are You? How To Present Yourself On Your Resume
มุมมอง 295หลายเดือนก่อน
For any engineering domain, hiring managers will typically look at two sets of skills: technical skills and leadership skills.
Understanding How Vector Databases Work!
มุมมอง 15Kหลายเดือนก่อน
Today, we dive into the subject of vector databases. Those databases are often used in search engines by using the vector representations of the items we are trying to search. We dig into the different algorithms that allow us to search for vectors among billions or trillions of documents.
Understanding the Self-Attention Mechanism in 8 min
มุมมอง 1.2K2 หลายเดือนก่อน
Understanding the Self-Attention Mechanism in 8 min
Getting a Job in AI: The Different ML Jobs
มุมมอง 2522 หลายเดือนก่อน
Getting a Job in AI: The Different ML Jobs
Revolutionizing Education with AI: Personalized Learning, Model Challenges, and Finance Insights
มุมมอง 2677 หลายเดือนก่อน
Revolutionizing Education with AI: Personalized Learning, Model Challenges, and Finance Insights
Exploring Data Science Careers and Potential of Large Language Models
มุมมอง 1707 หลายเดือนก่อน
Exploring Data Science Careers and Potential of Large Language Models
Unlocking AI's Secrets: Career Journeys, Challenges, and the Future
มุมมอง 2128 หลายเดือนก่อน
Unlocking AI's Secrets: Career Journeys, Challenges, and the Future
Working in AI as a Software Engineer!
มุมมอง 2678 หลายเดือนก่อน
Working in AI as a Software Engineer!
Let's Talk about AI with Etienne Bernard!
มุมมอง 2309 หลายเดือนก่อน
Let's Talk about AI with Etienne Bernard!
not really. I'm a US citizen been all over Europe. I say it's the same .
How long have you lived in Europe and what countries exactly?
nice content, keep it up!
Thanks, will do!
very clear explanation. thanks
I like your channel
Thank you!
I've got it now. I wonder why we can't calculate the x gradient by starting the backward pass closer to x instead of going through all the activations.
I am not sure I understand the question.
One of the best explanations on TH-cam. Substantively and visually at the highest level :) Are you able to share those slides e.g. via Git?
I cannot share the slide but you can see the diagrams in my newsletter: newsletter.theaiedge.io/p/understanding-the-self-attention
❤
too good
great video
good video
Phenomenal visuals and explanations. Best video on this concept I've ever seen.
I am liking reading that!
Is it rnn 😅
Love the way you teach every point please start teaching this way
More good content indeed good one❤
💯💯💯
Thank you for your videos
I will use your videos as interview refresher....... It is so easy to forget about the details when everyday work floods in for a period of years.
I am glad to read that!
Thanks, I forgot some details about Gradient Boosted Algorithm and I was too lazy to look it up.
Please make more videos
Well I do!
Thanks You.Can you explain the entire self attention flow? (from postional encode to final next word prediction). I think it will be an entire series 😅
It is coming! It will take time
Thank you, Damien!!
Very good advice ❤
Awesome
Excellent!! Very good explanation. I need to work on my ear for French. But pausing and backing up the video helped. Great stuff!!
My accent + my speaking skills are my weaknesses. Working on it and I think I am improving!
@@TheMLTechLead Thanks for your reply but absolutely no apology necessary!! I think it is an excellent video and helpful information. Much appreciation for posting. I am a professor in a business school and always looking for insights into how to teach the technical side of technology in the context of business. Your explanation has been very helpful.
It's really good and usefull... Expecting for training an llm from the scratch for the next and interested in KAN-FORMER...
ML is a black box but boosting seems to be more interpretable (potentially) if we can make the trees more sparse and orthogonal
Tree-based method can naturally be used to measure Shapley values without approximation: shap.readthedocs.io/en/latest/tabular_examples.html
Do you mean that the new tree is predicting the error? In that case, wouldn't you subtract the new prediction from the previous predictions
So we have an ensemble of trees F that predicts y such that F(x) = \hat{y}. The error is y - F(x) = e. We want to add a tree that predicts the error T(x) = \hat{e} = e + error = y - F(x) + error. Therefore F(x) + T(x) = y + error
share the resources in description
Why
“Give me the exam solutions pls”
Thank you for this video !
Hi,we should subtract the target from the cum sum right? I didn't understand where you did it.
It is in the script shown in the video and I am not adding the target of the current row, which is equivalent to subtracting it if I were to add it.
@@TheMLTechLeadUnderstood ! Thanks for replying .
Can you help me understand how and why are the positional embeddings effective in transformers (vision or text). Can't the model just learn that through its existing weights. How does adding extra positional embeddings to the vision/text embeddings help? Even if we have a unique vector for each position, when we add those to the text embeddings the result won't be unique. Would the result after addition even have useful information, since we can get same addition values from multiple combinations. Let's say we have a text model that has input limit of only two tokens. and the embedding size is 3. Text embeddings: [0, 1.1, 0.3], [0, 0.1,1.3] position embeddings: [0, 0, 1], [0, 1, 0] Embeddings after addition: [0, 1.1, 1.3], [0, 1, 1.3] We get same vectors Is the magic in the actual function that we use for embeddings or is it just empirically better and we can't understand it fully.
Inside the model, we compute the self-attentions. They are pretty much just a measure of interaction between the different tokens in the input sequence. Inside the attention layer, we have the queries, the keys and the values. The keys and queries are used to compute the self-attentions and the resulting hidden states is the weighted average of the values where we use the attentions as weights. At that point, the order of the tokens in completely lost because we are just summing stuff together without knowing in what order they were before the sum. That is why we keep the position information through the positional encoding. We systematically add the same vector for the same position so the model starts to understand how that shift relates to that position. The value of the same token varies depending on its position. To be fair, we do it a bit differently in 2024. Video coming!
@@TheMLTechLead Looking forward to it! After commenting, I read about RoPE (can't say fully understood it) and learnable positional embeddings. P.S. I really liked your idea of using routing in attention, a bit ambitious goal, but I want to use it to train a small language model or I will see if it is possible to simply add it in a pre-trained model without losing the learned weights.
@@jaskiratbenipal8255 I may not make a video about RoPE but I wrote something about it here: www.linkedin.com/posts/damienbenveniste_most-modern-llms-are-built-using-the-rope-activity-7188571849084096515-mmUk. For the routed self-attentions, I am looking forward to see somebody implementing those and training a model with it.
@@TheMLTechLead I tried it, I trained a language model from scratch for next character prediction (to have small vocabulary). The results were good using normal attention, the model was able to form words and phrases and some jibberish that looked like words. With the routed attention (I tried 0.1 and 0.3 sparcity values), it started to diverge and model was not converging at all after first epoch. The training time did decrease from 34 to 24 mins.
Very nice video, thanks. The second type of NLP Engineer falls under the emerging AI Engineer role. An AI Engineer can use APIs to develop NLP as well as CV based applications (e.g. using stable diffusion APIs)
accent is tough to understand
That is my weakness!
@@TheMLTechLead It's not that bad. As a English speaker that knows/speaks French, I can tell you are French. It's not had to hear and understand what you are saying. I'd say keep going and make any improvements along the way.
not true at all, i am watching in 1.5x and can still understand everything
Thanks @TheMLTechLead! You did a great job in summarizing the key ideas.
Awesome work. Thanks so much.
Thank you so much for this. You don’t know how badly I needed this right now. Please extend this series to transformers, if possibly any LLM as well.
lovely video , helpful for me in getting started with vector-db
Your way explaining using animation is very good. It is requested please stop backend sound which is distracting.
Ok! Yeah I figured it was annoying!
So this is basically used for classification? For example, cat and dogs, right?
It can be used for any computer vision ML task.
@TheMLTechLead great! I was thinking of image generation for a given prompt or user input.., what would the process be?
Oh no, you would need a very different model for that. Although, the vision transformer can be an element of it.
@@TheMLTechLead got you!
Good job.
Merci pour ton travail. C'est vachement bien ce que tu fais.
Great intro into XGBoost theory! Not been new to the subject, in the past I used the official docs when I needed to refresh my knowledge. Will be using your video now, thank you Damien!
so far the best teacher .....if possible I would love to join you ...
If I need somebody, I know where to look!
Great explanations! Easy to understand.
What's the difference of this algorithm from Boosting as explained in Hastie & Tibshirani's book published in 2013 (first version). It does seem the same.
Why do you expect them to be different?
Maybe you are asking what is the difference between boosting in general and gradient boosting in particular? To be fair, my video in not going deep enough to highlight the differences. In a coming video, I am going to go into the details of how XGBoost works and I believe that should clear up the confusion.
Very good and intuitive explanation of the algorithm. Thank-you!
Informative! 🙏
Absolute master piece
These few recent videos are great! They are short and to the points with clear explanations!
Happy to hear that!
One question: when adding T into the numerator, do we also add it into ‘C’, the denominator?
Absolutely!
Great explanation