Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018)

แชร์
ฝัง
  • เผยแพร่เมื่อ 28 ก.ย. 2024

ความคิดเห็น • 278

  • @krishyket
    @krishyket ปีที่แล้ว +356

    Dude is a multi-millionaire and took valuable time meticulously teaching students and us. Legend.

    • @The_Quaalude
      @The_Quaalude 8 หลายเดือนก่อน +30

      Bro needs to train his future employees

    • @vikram-aditya
      @vikram-aditya 8 หลายเดือนก่อน +7

      yes bro. i think the more people with the knowledge, the faster the breakthroughs in the field

    • @clerpington_the_fifth
      @clerpington_the_fifth 7 หลายเดือนก่อน +5

      ...and FOR FREE.

    • @SaidurRahman-c8w
      @SaidurRahman-c8w 2 หลายเดือนก่อน +14

      To people like him, money is really irrelevent. These people are really top 0.00001 of people of the world, all that matters to them is how they can contribute to their respective field and help make this world a better place, money is just by-product of that passsion.

  • @calvin_713
    @calvin_713 ปีที่แล้ว +65

    This course saves my life! The lecturer of the ML course I'm attending rn is just going thru those crazy math derivations preassuming that all the students have mastered it all before😂

    • @mahihoque4598
      @mahihoque4598 3 หลายเดือนก่อน +1

      My man was treating like these top % brains had forgotten simple partial differentiation and ours just don't even care😢

    • @mees8711
      @mees8711 6 วันที่ผ่านมา

      This sounds like you're taking the same course I am taking lmao

  • @Lachipiedubinks
    @Lachipiedubinks 14 วันที่ผ่านมา +1

    This professor is such kind, clear and patient... The kind of professor I wanna be

  • @dimensionentangled4514
    @dimensionentangled4514 2 ปีที่แล้ว +74

    We define a cost function based on sum of squared errors. The job is minimise this cost function with respect to the parameters. First, we look at (Batch) gradient descent. Second, we look at Stochastic gradient descent, which does not give us the exact value at which the minima is achieved, however, it is much much more effective in dealing with big data. Third, we look at the normal equation. This equation directly gives us the value at which minima is achieved! Linear regression models is one of the few models in which such an equation exist.

    • @xxdxma6700
      @xxdxma6700 2 ปีที่แล้ว +12

      I wish you sat next to me in class 😂

    • @rajvaghasia9942
      @rajvaghasia9942 2 ปีที่แล้ว +1

      Bro who named that equation as normal equation?

    • @alessandroderossi8930
      @alessandroderossi8930 2 ปีที่แล้ว +6

      ​@@rajvaghasia9942 the name "normal equation" is because generalizes the concept of perpendiculum (normal to something means perpendicula to something). In fact "the normal equation" represent the projection between the straight line that i draw as a starting point (in the case of LINEAR regression) and the effective sampling data .This projection has , obviously , information about the distances between the real data (sampling data) and my "starting line"...hence to find the optimal curve that fit my data i 've to find weight a bias (in this video Theta0 , Theta1 and so on) to minimize this distance. you can minimize this distance using gradient descend (too much the cost), stochastic gradient descend (doing a set of partial derivative not computing all the gradient of loss function) or using the "normal equations"...uderstand?... Here an image from wikipedia to understand better (the green line are the famous distances) en.wikipedia.org/wiki/File:Linear_least_squares_example2.svg

    • @JDMathematicsAndDataScience
      @JDMathematicsAndDataScience 2 ปีที่แล้ว

      @@rajvaghasia9942 because we're in the matrix now bro! ha. For real though. It's about the projection matrix and the matrix representation/method of acquiring the beta coefficients.

    • @JDMathematicsAndDataScience
      @JDMathematicsAndDataScience 2 ปีที่แล้ว

      I have been wondering why we need such an algorithm when we could just derive the least squares estimators. Have you seen any research comparing the gradient descent method of selection of parameters with the typical method of deriving the least squares estimators of the coefficient parameters?

  • @imad1996
    @imad1996 ปีที่แล้ว +8

    We learn, and teachers give us the information in a way that can help stimulate our learning abilities. So, we always appreciate our teachers and the facilities contributing to our development. Thank you.

  • @k-bobmakabaka4420
    @k-bobmakabaka4420 ปีที่แล้ว +351

    when u paying 12k to your own university a year just so you can look up a course from a better school for free

    • @paulushimawan5196
      @paulushimawan5196 ปีที่แล้ว +6

      University cost needs to be as low cost as possible.

    • @_night_spring_
      @_night_spring_ ปีที่แล้ว +12

      while youtube have the unlimited free information and courses better than the tech university and colleges 🙂

    • @Call-me-Avi
      @Call-me-Avi ปีที่แล้ว +1

      Hahahahaahaha fucking hell thats what i am doing right fucking now.

    • @preyumkumar7404
      @preyumkumar7404 9 หลายเดือนก่อน

      which uni is that...

    • @k-bobmakabaka4420
      @k-bobmakabaka4420 9 หลายเดือนก่อน

      @@preyumkumar7404 University of Toronto

  • @deepakbastola6302
    @deepakbastola6302 หลายเดือนก่อน

    Dr. NG is always my best.. keep up motivating with such classes.

  • @jaeen7665
    @jaeen7665 6 หลายเดือนก่อน

    One of the greats, a legend in AI & Machine Learning. Up there with Prof. Strang and Prof LeCun.

  • @Eric-zo8wo
    @Eric-zo8wo ปีที่แล้ว +226

    0:41: 📚 This class will cover linear regression, batch and stochastic gradient descent, and the normal equations as algorithms for fitting linear regression models.
    5:35: 🏠 The speaker discusses using multiple input features, such as size and number of bedrooms, to estimate the size of a house.
    12:03: 📝 The hypothesis is defined as the sum of features multiplied by parameters.
    18:40: 📉 Gradient descent is a method to minimize a function J of Theta by iteratively updating the values of Theta.
    24:21: 📝 Gradient descent is a method used to update values in each step by calculating the partial derivative of the cost function.
    30:13: 📝 The partial derivative of a term with respect to Theta J is equal to XJ, and one step of gradient descent updates Theta J
    36:08: 🔑 The choice of learning rate in the algorithm affects its convergence to the global minimum.
    41:45: 📊 Batch gradient descent is a method in machine learning where the entire training set is processed as one batch, but it has a disadvantage when dealing with large datasets.
    47:13: 📈 Stochastic gradient descent allows for faster progress in large datasets but never fully converges.
    52:23: 📝 Gradient descent is an iterative algorithm used to find the global optimum, but for linear regression, the normal equation can be used to directly jump to the global optimum.
    58:59: 📝 The derivative of a matrix function with respect to the matrix itself is a matrix with the same dimensions, where each element is the derivative with respect to the corresponding element in the original matrix.
    1:05:51: 📝 The speaker discusses properties of matrix traces and their derivatives.
    1:13:17: 📝 The derivative of the function is equal to one-half times the derivative of Theta multiplied by the transpose of X minus the transpose of y.
    Recap by Tammy AI

    • @Lucky-vm9dv
      @Lucky-vm9dv ปีที่แล้ว +7

      How much we have to pay for your valuable overview on the entire class?
      Kudos to your efforts 👍

    • @MLLearner
      @MLLearner 4 หลายเดือนก่อน +1

      Thank you so much 👍🫡

    • @sarkersaadahmed
      @sarkersaadahmed 3 หลายเดือนก่อน +1

      Legend

    • @surajr4757
      @surajr4757 2 หลายเดือนก่อน +2

      @@Lucky-vm9dv Bro didn't read the last line, Recap by Tammy AI🙂

  • @adeelfarooq6319
    @adeelfarooq6319 หลายเดือนก่อน +1

    Linear regression and gradient descent are introduced as the first in-depth learning algorithm. The video covers the hypothesis representation, cost function, and optimization using batch and stochastic gradient descent. The normal equation is also derived as an efficient way to fit linear models.
    Highlights:
    00:11 Linear regression is a fundamental learning algorithm in supervised learning, used to fit models like predicting house prices. The algorithm involves defining hypotheses, parameters, and training sets to make accurate predictions.
    -Supervised learning involves mapping inputs to outputs, like predicting house prices based on features. Linear regression is a simple yet powerful algorithm for this task.
    -In linear regression, hypotheses are defined as linear functions of input features. Parameters like theta are chosen by the learning algorithm to make accurate predictions.
    -Introducing multiple input features in linear regression expands the model's capabilities. Parameters like theta are adjusted to fit the data accurately.
    13:01 Linear regression involves choosing parameters Theta to minimize the squared difference between the hypothesis output and the actual values for training examples, achieved through a cost function J of Theta. Gradient descent is used to find the optimal Theta values for minimizing J of Theta.
    -Explanation of input features X and output Y in linear regression, highlighting the importance of terminology and notation in defining hypotheses.
    -Defining the cost function J of Theta in linear regression as the squared difference between predicted and actual values, leading to the minimization of this function to find optimal parameters.
    -Introduction to gradient descent as an algorithm used to minimize the cost function J of Theta and find the optimal parameters for linear regression.
    18:47 Gradient descent is a method used to minimize a function by iteratively adjusting parameters. It involves taking steps in the direction of steepest descent to reach a local optimum.
    -Visualization of gradient descent involves finding values for Theta to minimize J of Theta, representing a 3D vector in 2D space.
    -Gradient descent algorithm involves updating parameters Theta using the learning rate and the partial derivative of the cost function with respect to Theta.
    -Determining the learning rate in practice involves starting with a common value like 0.01 and adjusting based on feature scaling for optimal function minimization.
    27:26 Understanding the partial derivative in gradient descent is crucial for updating parameters efficiently. The algorithm iterates through training examples to find the global minimum of the cost function, adjusting Theta values accordingly.
    -Explanation of the partial derivative calculation in gradient descent and its importance in updating parameters effectively.
    -Expanding on the concept of gradient descent with multiple training examples and the iterative process of updating Theta values for convergence.
    -Illustration of how the cost function J of Theta behaves in linear regression models, showing a quadratic function without local optima, aiding in efficient parameter optimization.
    36:30 Gradient descent is a key algorithm in machine learning, adjusting parameters to minimize errors. It's crucial to choose the right learning rate to efficiently converge.
    -Visualizing gradient descent with data points and parameter adjustments helps understand the algorithm's progression.
    -Batch gradient descent processes the entire dataset at once, suitable for small datasets but inefficient for large ones due to extensive computations.
    -The limitations of batch gradient descent in handling big data sets due to the need for repeated scans, leading to slow convergence and high computational costs.
    44:58 Stochastic gradient descent updates parameters using one training example at a time, making faster progress on large datasets compared to batch gradient descent, which is slower but more stable.
    -Comparison of stochastic and batch gradient descent. Stochastic is faster on large datasets but doesn't converge, while batch is slower but more stable.
    -Mini-batch gradient descent. Using a subset of examples for faster convergence compared to one at a time in stochastic gradient descent.
    -Importance of decreasing learning rate. Reducing steps size in stochastic gradient descent for smoother convergence towards the global minimum.
    53:39 The normal equation provides a way to find the optimal parameters in linear regression in one step, leading to the global optimum without iterative algorithms. Linear algebra notation simplifies deriving the normal equation and matrix derivatives for efficient computation.
    -The normal equation streamlines finding optimal parameters in linear regression, bypassing iterative methods for quick convergence to the global optimum.
    -Utilizing matrix derivatives and linear algebra notation simplifies the derivation process, reducing complex computations to a few lines for efficiency.
    -Understanding matrix functions mapping to real numbers and computing derivatives with respect to matrices enhances algorithm derivation and optimization in machine learning.
    1:03:52 The video explains the concept of the trace of a matrix, its properties, and how it relates to derivatives in matrix calculus, providing examples and proofs. It also demonstrates how to express a cost function in matrix vector notation for machine learning optimization.
    -Properties of the trace of a matrix are discussed, including the fact that the trace of a matrix is equal to the trace of its transpose, and the cyclic permutation property of the trace of matrix products.
    -The video delves into the derivative properties of the trace operator in matrix calculus, showcasing how the derivative of a function involving the trace of a matrix can be computed and proven.
    -The concept of expressing a cost function in matrix vector notation for machine learning optimization is explained, demonstrating how to set up the design matrix and compute the cost function using matrix operations.
    1:15:15 The video explains the normal equations in linear regression, where the derivative is set to 0 to find the optimum Theta value using matrix derivatives, leading to X transpose X Theta equals X transpose y.
    -Explanation of the normal equations in linear regression and setting the derivative to 0 to find the optimal Theta value using matrix derivatives.
    -Addressing the scenario of X being non-invertible due to redundant features and the solution using the pseudo inverse for linearly dependent features.

  • @claudiosaponaro4565
    @claudiosaponaro4565 ปีที่แล้ว +2

    the best professor in the world.

  • @polymaththesolver5721
    @polymaththesolver5721 ปีที่แล้ว +3

    Thank you Stanford for this amazing resource. Pls csn i get a link to the lecture notes. Thanks

  • @raymundovazquezmusic216
    @raymundovazquezmusic216 ปีที่แล้ว +9

    Can you update the lecture notes and assignments in the website for the course? Most of the links to the documents are broken

    • @stanfordonline
      @stanfordonline  ปีที่แล้ว +24

      Hi there, thanks for your comment and feedback. The course website may be helpful to you cs229.stanford.edu/ and the notes document docs.google.com/spreadsheets/d/12ua10iRYLtxTWi05jBSAxEMM_104nTr8S4nC2cmN9BQ/edit?usp=sharing

    • @adi29raj
      @adi29raj ปีที่แล้ว +3

      @@stanfordonline Where can I access the problem sets?

    • @salonisingla1665
      @salonisingla1665 ปีที่แล้ว +2

      @@stanfordonline Please post this in the description to every video. Having this in an obscure reply to a comment will only lead to people missing it while scrolling.

  • @tanmayshukla8660
    @tanmayshukla8660 ปีที่แล้ว +1

    Why do we take the transpose of each row, wouldn't it be stacking columns on top of each other?

  • @HeisenbergHK
    @HeisenbergHK 8 หลายเดือนก่อน +1

    Where can I find the notes and other videos and any material related to this class!?

  • @riajulchowdhury4218
    @riajulchowdhury4218 2 หลายเดือนก่อน +1

    Where can I get the lecture notes? I can't access the files in the website.

  • @wonggran9983
    @wonggran9983 2 ปีที่แล้ว +3

    Fred has a one hundred sided die. Fred rolls the dice,
    once and gets side i. Fred then rolls the dice, again,
    second roll, and gets side j where side j is not side i.
    What is the probability of this event e? Assume the
    one hundred sides of the one hundred sided die all have
    an equal probability of facing up.

    • @Tryingitoutletsee
      @Tryingitoutletsee 2 ปีที่แล้ว +1

      1 - (1/10000) = 9999/10000

    • @ahmettolgakarabulut9380
      @ahmettolgakarabulut9380 2 ปีที่แล้ว

      the probability of getting the same results for two rolls and they are both defined is 1/10000. So that we will subtract that from 1

    • @billr5842
      @billr5842 ปีที่แล้ว

      Wouldn't it be 99/100? The first roll can be any number so it doesn't really matter what's there. The second roll just needs to be one of the other 99 numbers. The first roll doesn't really change the probability. Of course, I barely know any math so I'm no expert lol

    • @emirkisa
      @emirkisa ปีที่แล้ว

      @@billr5842 you're right, the probability calculated above as 1/10000 is the probability of getting the same result for a "specific side", like getting "side 3" twice. But there are 100 different sides that has the 1/10000 probability to occur twice, so the probability 1/10000 is multiplied by the different side number 100 which makes the probability of getting the same result for two rolls equal to 1/100. Then 1 - 1/100 = 99/100

  • @TheLastStand226
    @TheLastStand226 ปีที่แล้ว +2

    The notes from the description seem to have vanished. Does anyone have them?

  • @Jewishisgreat
    @Jewishisgreat ปีที่แล้ว

    Knowledge is power

  • @putinscat1208
    @putinscat1208 ปีที่แล้ว

    I asked ChatGPT how to learn machine learning. #1 Coursera:
    Course: "Machine Learning" by Andrew Ng (Stanford University)

  • @samsondawit
    @samsondawit ปีที่แล้ว +1

    why is it that the cost function has the constant 1/2 before the summation and not 1/2m?

    • @ihebbibani7122
      @ihebbibani7122 ปีที่แล้ว +3

      I think it's because he is taking one learning example and not m learning examples

    • @samsondawit
      @samsondawit ปีที่แล้ว

      @@ihebbibani7122 ah I see

  • @MikeSieko17
    @MikeSieko17 10 หลายเดือนก่อน

    why even go to uni, wtf this is so much better than my lectures and it's free and it's recorded lmao wtf unis be doing they are dying fr

  • @lyndonyang1269
    @lyndonyang1269 3 หลายเดือนก่อน

    anyone knows where to access the homework assignments as practice?

  • @faisalhussain4022
    @faisalhussain4022 10 หลายเดือนก่อน +1

    Wondering if lecture notes are also available to download from somewhere ?

    • @williambrace6885
      @williambrace6885 8 หลายเดือนก่อน +4

      hey bro I found them: cs229.stanford.edu/lectures-spring2022/main_notes.pdf

    • @kag46
      @kag46 8 หลายเดือนก่อน

      @@williambrace6885thanks a lot!

  • @ChidinmaOnyeri
    @ChidinmaOnyeri 4 หลายเดือนก่อน

    Hi. Can anyone recommend any textbook that can help in further study of this course.
    Thank you

  • @ravishankar-k4l2n
    @ravishankar-k4l2n 8 หลายเดือนก่อน

    where do I find the lecture notes? Help

  • @chhaysith
    @chhaysith ปีที่แล้ว +1

    Dear Dr. Andrew I saw yours other video with the cost function with linear regression by 1/2m but this video 1/2, so what is different between it?(footnote 16:00)

    • @treqemad
      @treqemad ปีที่แล้ว +1

      I don't really understand what you mean by 1/2m. However, from my understanding, the 1/2 is just for simplicity when taking the derivative of the cost ftn the power 2 will be multiplied to the equation and cancellyby the half.

    • @googgab
      @googgab ปีที่แล้ว

      It should be 1/2m where m is the size of the data set. That's because we'd like to take the average sum of squared differences and not have the cost function depend on the size of the data set m
      th-cam.com/video/ZzeDtSmrRoU/w-d-xo.html
      He explains it here at 6:30 minutes

    • @aman-qj5sx
      @aman-qj5sx ปีที่แล้ว

      @@googgab It should be ok if J depends on m since m isn't changing?

    • @labiditasnim623
      @labiditasnim623 ปีที่แล้ว

      same question

  • @sivavenkateshr
    @sivavenkateshr ปีที่แล้ว

    This is really cool. ❤

  • @Nevermind1000
    @Nevermind1000 4 หลายเดือนก่อน

    Anyone know where to get the lecture notes for the lecture

    • @fordownload9611
      @fordownload9611 4 หลายเดือนก่อน +1

      just search Stanford machine learning notes, one of the first result will be of pdf from the website cs229

  • @ganeshgummadi7301
    @ganeshgummadi7301 ปีที่แล้ว

    54:13 Normal Equation

  • @gaurav210
    @gaurav210 29 วันที่ผ่านมา

    12:57
    44:00
    54:00

  • @7takeo
    @7takeo 2 ปีที่แล้ว

    Podemos dar la clase fuera?

  • @R9000S
    @R9000S หลายเดือนก่อน

    The voice at 50:38, how is that possible?

    • @tripcodee
      @tripcodee 19 วันที่ผ่านมา

      lmfao I was gonna comment it

    • @tripcodee
      @tripcodee 19 วันที่ผ่านมา

      it's probably microphone glitch (hopefully)

  • @wishIKnewHowToLove
    @wishIKnewHowToLove ปีที่แล้ว

    why are the background voices like that? i feel like im in backrooms...

  • @kitsaruna
    @kitsaruna 6 หลายเดือนก่อน +1

    Anyone learning here with me ...... Can we join for better understanding of concepts , a little discussion would be nice

    • @manishcicada
      @manishcicada 5 หลายเดือนก่อน

      Where are you from?

    • @kitsaruna
      @kitsaruna 5 หลายเดือนก่อน

      @@manishcicada India , manish ji

    • @Ajay.m-sc1vc
      @Ajay.m-sc1vc 4 หลายเดือนก่อน

      You must have finished the course by now

  • @cbpuzzle
    @cbpuzzle ปีที่แล้ว

    If I was taking this course, I would be sh1tt1ng my pants thinking about how abstract the midterm will be.

  • @waringrob
    @waringrob 4 หลายเดือนก่อน

    He suddenly says he is going to do x. Eg minimize theta J but doesn’t tell us why. He just assumes we would know why. There are sooooooooo many times he does this. So within 30 seconds I get lost. Watch for five more mins and have absolutely no idea what he is doing or why. A good teacher tells us WHY he is doing stuff.

  • @ujjawalx7460
    @ujjawalx7460 2 หลายเดือนก่อน

    27:03

  • @AryanSharma-dh4fb
    @AryanSharma-dh4fb ปีที่แล้ว

    27:00

  • @JoonHaSong-z1x
    @JoonHaSong-z1x 2 ปีที่แล้ว

    39:46 for tmr

  • @VloggySaurav
    @VloggySaurav 7 วันที่ผ่านมา

    sidhi baat no bakwas

  • @fahyen6557
    @fahyen6557 ปีที่แล้ว +8

    why do all the students sound like darth vader

  • @manudasmd
    @manudasmd ปีที่แล้ว +94

    Feels like sitting in stanford classroom from india ...Thanks stanford. you guys are best

    • @gurjotsingh3726
      @gurjotsingh3726 11 หลายเดือนก่อน +7

      for real bro, me sitting in panjab, would have never come across how the top uni profs are, this is surreal.

    • @hamirmahal
      @hamirmahal 7 หลายเดือนก่อน +4

      ​@@gurjotsingh3726 Sat sri akaal, ਖੁਸ਼ਕਿਸਮਤੀ

  • @DagmawiAbate
    @DagmawiAbate ปีที่แล้ว +32

    I am not good at math anymore, but I think math is simple if you get the right teachers like you. Tnks.

  • @토스트-d3r
    @토스트-d3r ปีที่แล้ว +46

    8:50 notations and symbols
    13:08 how to choose theta
    17:50 Gradient descent

    • @dens3254
      @dens3254 ปีที่แล้ว +3

      52:50 Normal equations

  • @zzh315
    @zzh315 7 หลายเดือนก่อน +9

    "Wait, AI is just math?"
    "Always has been"

  • @LuisFuentes98
    @LuisFuentes98 ปีที่แล้ว +57

    Hey can I point out how an amazing teacher professor Andrew is?!
    Also, I love how he is all excited about the lesson he is giving! It just makes me feel even more interested in the subject.
    Thanks for this awesome course!

    • @tanishsharma136
      @tanishsharma136 ปีที่แล้ว +2

      Look at Coursera, he founded that and has many free courses.

  • @abhishekagrawal896
    @abhishekagrawal896 4 หลายเดือนก่อน +16

    🎯 Key points for quick navigation:
    00:03 *🏠 Introduction to Linear Regression*
    - Linear regression is a learning algorithm used to fit linear models.
    - Motivation for linear regression is explained through a supervised learning problem.
    - Collecting a dataset, defining notations, and building a regression model are important steps.
    04:04 *📊 Designing a Learning Algorithm*
    - The process of supervised learning involves inputting a training set and outputting a hypothesis.
    - Key decisions in designing a machine learning algorithm include defining the hypothesis representation.
    - Understanding the workflow, dataset, and hypothesis structure is crucial in creating a successful learning algorithm.
    07:19 *🏡 Multiple Features in Linear Regression*
    - Introducing multiple input features in linear regression models.
    - The importance of adding additional features like the number of bedrooms to enhance prediction accuracy.
    - Notation, such as defining a dummy feature for simplifying hypotheses, is explained.
    13:03 *🎯 Cost Function and Parameter Optimization*
    - Choosing parameter values Theta to minimize the cost function J of Theta.
    - The squared error is used in linear regression as a measure of prediction accuracy.
    - Parameters are iteratively adjusted using gradient descent to find the optimal values for the model.
    24:18 *🧮 Linear Regression: Gradient Descent Overview*
    Explanation of gradient descent in each step:
    - Update Theta values for each feature based on the learning rate and partial derivative of the cost function.
    - Learning rate determination for practical applications.
    - Detailed explanation of the derivative calculation for one training example.
    27:11 *📈 Gradient Descent Algorithm*
    Derivation of the partial derivative with respect to Theta.
    - Calculating the partial derivative for a simple training example.
    - Update equation for each step of gradient descent using the calculated derivative.
    33:11 *📉 Optimization: Convergence and Learning Rate*
    Concepts of convergence and learning rate optimization in gradient descent:
    - Explanation of repeat until convergence in gradient descent.
    - Impact of learning rate on the convergence speed and efficiency.
    - Practical approach to determining the optimal learning rate during implementation.
    41:22 *📊 Batch Gradient Descent vs. Stochastic Gradient Descent*
    Comparison between batch gradient descent and stochastic gradient descent:
    - Description of batch gradient descent processing the entire training set in one batch.
    - Introduction to stochastic gradient descent processing one example at a time for parameter updates.
    - Illustration of how stochastic gradient descent takes a slightly noisy path towards convergence.
    47:22 *🏃 Stochastic Gradient Descent vs. Batch Gradient Descent*
    - Stochastic gradient descent is used more in practice with very large datasets.
    - Mini-batch gradient descent is another algorithm that can be used with datasets that are too large for batch gradient descent.
    - Stochastic gradient descent is often preferred due to its faster progress in large datasets.
    53:01 *📉 Derivation of the Normal Equation for Linear Regression*
    - The normal equation allows for the direct calculation of optimal parameter values in linear regression without an iterative algorithm.
    - Deriving the normal equation involves taking derivatives, setting them to zero, and solving for the optimal parameters theta.
    - Matrix derivatives and linear algebra notation play a crucial role in deriving the normal equation.
    57:52 *🧮 Matrix Derivatives and Trace Operator*
    - The trace operator allows for the sum of diagonal entries in a matrix.
    - Properties of the trace operator include the trace of a matrix being equal to the trace of its transpose.
    - Derivatives with respect to matrices can be computed using the trace operator for functions mapping to real numbers.
    01:12:49 *📈 Linear Regression Derivation Summary*
    - Deriving the gradient for the cost function J(Theta) involves taking the derivative of a quadratic function.
    01:15:19 *🧮 Deriving the Normal Equations*
    - Setting the derivative of J(Theta) to 0 leads to the normal equations X^T X Theta = X^T y.
    - Using matrix derivatives helps simplify the final equation for Theta.
    01:17:09 *🔍 Dealing with Non-Invertible X Matrix*
    - When X is non-invertible, it indicates redundant features or linear dependence.
    - The pseudo inverse can provide a solution in the case of linearly dependent features.

    • @hmm7780
      @hmm7780 2 หลายเดือนก่อน +1

      Thanx Bro for this!!

  • @skillato9000
    @skillato9000 ปีที่แล้ว +8

    1:01:06 Didn't know Darth Vader attended this lectures

  • @anushka.narsima
    @anushka.narsima ปีที่แล้ว +28

    Thank you so much Dr. Andrew! It took me some time but your stepwise explanation and notes have given me a proper understanding. I'm learning this to make a presentation for my university club. We all are very grateful!

    • @Amit_Kumar_Trivedi
      @Amit_Kumar_Trivedi ปีที่แล้ว

      Hi I was not able to download the notes, 404 error, from the course page in description. Other PDFs are available on the course page. Are you enrolled or where did you download the notes from?

    • @anushka.narsima
      @anushka.narsima ปีที่แล้ว +17

      @@Amit_Kumar_Trivedi cs229.stanford.edu/lectures-spring2022/main_notes.pdf

    • @georgenyagura7742
      @georgenyagura7742 ปีที่แล้ว +1

      @@anushka.narsima thanks

  • @Honey-sv3ek
    @Honey-sv3ek 2 ปีที่แล้ว +28

    I really don't have a clue about this stuff, but it's interesting and I can concentrate a lot better when I listen to this lecture so I like it

    • @AHUMAN5
      @AHUMAN5 2 ปีที่แล้ว +1

      You can see his lecture on coursera about Machine learning. You will surely get what he is saying in this video.

    • @paulushimawan5196
      @paulushimawan5196 ปีที่แล้ว

      @@AHUMAN5 yes, that course is beginner-friendly. Everyone with basic high school math can take that course even without knowledge of calculus.

  • @diegoalias2935
    @diegoalias2935 2 ปีที่แล้ว +19

    Really easy to understand. Thanks a lot for sharing!

    • @massimovarano407
      @massimovarano407 ปีที่แล้ว

      sure it is, it is high school topic, at least in Italy

    • @gustavoramalho9454
      @gustavoramalho9454 ปีที่แล้ว +10

      @@massimovarano407 I'm pretty sure multivariate calculus is not a high-school topic in Europe

  • @clinkclink7814
    @clinkclink7814 ปีที่แล้ว +5

    Very clear explanations. Extra points for sounding like Stewie Griffin

  • @Z_nix
    @Z_nix 8 หลายเดือนก่อน +4

    How come everybody could understand this? The first lecture was very good. But this one... I couldn't understand a single concept. I have watched half of the video and I don't know what is the meaning of Linear Regression. Professor Andrew Ng. was writing all those equations and I was questioning like what's the meaning of this? Why are we doing this? What's the use of this? I opened comment section and everybody is appreciating Professor's teaching skills and then there's me who couldn't unserstand anything in this video. Where am I going wrong? Why am I different than others?

    • @dimilands
      @dimilands 8 หลายเดือนก่อน +2

      i dont know if i get your question about linear regression, but it is a way to finding a formula that would fit our data most accurately , so that our hypothesis is satisfied (i think)

    • @Z_nix
      @Z_nix 8 หลายเดือนก่อน +1

      @@dimilands now I m learning Machine Learning from somebody else and he is teaching very very good. Andrew just sucks.

    • @dimilands
      @dimilands 8 หลายเดือนก่อน

      may i ask who is your new 'source' ? @@Z_nix

    • @ponugotimanojkumar
      @ponugotimanojkumar 4 หลายเดือนก่อน

      ​​@@Z_nixwhat's your source,I can't understand this lecture neither

    • @Z_nix
      @Z_nix 4 หลายเดือนก่อน

      @@ponugotimanojkumar I'm learning it from TH-cam channel 'Campus X'.

  • @ambushtunes
    @ambushtunes ปีที่แล้ว +4

    Attending Stanford University from Nairobi, Kenya.

  • @learnfullstack
    @learnfullstack ปีที่แล้ว +3

    if board is full, slide up the board, if it refuses to go up, pull it back down, erase and continue writing on it.

  • @ikramadjissa370
    @ikramadjissa370 2 ปีที่แล้ว +8

    Andrew Ng you are the best

  • @jeroenoomen8145
    @jeroenoomen8145 8 หลายเดือนก่อน +4

    Thank you to Stanford and Andrew for a wonderful series of lectures!

  • @anonymous-3720
    @anonymous-3720 ปีที่แล้ว +3

    Which book is he using? and where do we find the homework?

  • @parthjoshi5892
    @parthjoshi5892 ปีที่แล้ว +3

    Would anyone please share the lecture notes? On clicking on the link for the pdf notes on the course website, its showing an error that the requested URL was not found on the server. It would really be great if someone could help me with finding the class notes.

    • @amaia7045
      @amaia7045 7 หลายเดือนก่อน

      I think i found them here : chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/cs229.stanford.edu/main_notes.pdf

  • @forheuristiclifeksh7836
    @forheuristiclifeksh7836 3 หลายเดือนก่อน +1

    16:00

  • @i183x4
    @i183x4 ปีที่แล้ว +13

    8:50 notations and symbols
    13:08 how to choose theta
    17:50 Gradient descent
    8:42 - 14:42 - Terminologies completion
    51:00 - batch
    55:00 problem 1 set
    57:00 for p 0

    • @AshishRaj04
      @AshishRaj04 ปีที่แล้ว

      notes are not available on the website ???

  • @HarshitSharma-YearBTechChemica
    @HarshitSharma-YearBTechChemica 8 หลายเดือนก่อน +2

    Does someone know how to get the lecture notes?
    They are not available on stanford's website.

    • @logeshwaran1537
      @logeshwaran1537 7 หลายเดือนก่อน

      Same issue for me alsoo....

  • @souravsengupta1311
    @souravsengupta1311 9 หลายเดือนก่อน +2

    cant download the course class note pls look onto ot

  • @sipraneye70
    @sipraneye70 ปีที่แล้ว +3

    Where do i get the assignments for these lecture series?

  • @anikdas567
    @anikdas567 4 หลายเดือนก่อน

    i think he did a mistake when he defined the cost function at 16:17 (for "m" training examples). He just gave 1/2 as the constant, which works fine for 1 training example. But i felt a bit weird to use this for m training examples. Its like we are adding "m" quantities and dividing by 2? shouldn't it be like an average? I searched google and it showed the formuale for cost function. It showed 1/2m as the factor. which makes sense. The 2 is just a trick so that while differentiating it cancels with the power (which is 2). the 2 in the denominator can be adjusted by the learning factor (alpha). but missing the "m" in th denominator doesn't feel right. Can anyone please approve or disprove this??

    • @NehaGupta-xw2xg
      @NehaGupta-xw2xg 4 หลายเดือนก่อน +1

      Oh thanku so much you pointed out this I was having doubt in this

    • @lyricalrohit
      @lyricalrohit 3 หลายเดือนก่อน +1

      It doesn't matter if we introduce "m" in denominator or not. For a given dataset "m" is a constant value and the way of minimising variance which you mentioned is done by minimizing numerator only. The only contribution "m" and "2" will make is reduction of step size in each iteration which will make the computation longer.

  • @uekiarawari3054
    @uekiarawari3054 ปีที่แล้ว +1

    difficult word :
    cost function
    gradient descent
    convex optimization
    hypothesis fx
    target
    j of theta = cost/loss function
    partial derivatives
    chain row
    global optimum
    batch gradient descent
    stochastic gradient descent
    mini batch gradient descent
    decreasing learning rate
    parameters oscillating
    iterative algorithm
    normal equation
    trace of a

  • @26d8
    @26d8 ปีที่แล้ว +1

    The partial derivative was incomplete to me. we should take the derivative 2/2 thetha as well? is that term a constant?
    shouldn't we go with the product rule!

  • @just2579
    @just2579 10 วันที่ผ่านมา

    can someone please help me with the Last derivative Part @1:15:42
    thank you

  • @vseelix957
    @vseelix957 ปีที่แล้ว +1

    my machine learning lecturer is so dogshit I thought this unit was impossible to understand. Now following these on study break before midsem and this guy is the best. I'd prefer that my uni just refers to these lectures rather than making their own

  • @Suliyaa_Agri
    @Suliyaa_Agri 4 หลายเดือนก่อน +1

    Andrews Voice is Everything and that blue shirt of his

  • @nikhithar3077
    @nikhithar3077 7 หลายเดือนก่อน +1

    39:38 we're subtracting because to minimize the cost function, the two vectors must be at 180⁰. So we get a negative from there.

  • @李丰-w9h
    @李丰-w9h 5 หลายเดือนก่อน +1

    at 40:10, how about if we set the initial value at a point that the gradient is a negative direction, then we should increase theta rather than decrease theta?

    • @anikdas567
      @anikdas567 4 หลายเดือนก่อน +1

      even then we should decrease theta. Why? Reason: see the aim is to find a minima right? So if u start with a negative slope (aka gradient), u need to adjust the values of the parameters (theta) such that the slope approaches zero! (why? since the slope is zero at the minima). and if u see the graph of a quadratic equation, u will immediately understand the logic. it does not matter if u start with a pistive or negative slope. U just need to change theta so that finally ur gradient approaches zero. And for both of these cases we need to decrease the value of theta.

  • @techpasya974
    @techpasya974 6 หลายเดือนก่อน +1

    Is the lecture note available publicly for this? I have been going watching this playlist and I think the lecture note will be very helpful.

    • @KorexeroK
      @KorexeroK 5 หลายเดือนก่อน

      cs229.stanford.edu/main_notes.pdf

  • @michaelcochran6260
    @michaelcochran6260 6 หลายเดือนก่อน +1

    Took me quite some time to realize this class was not being taught to darth vader

  • @atefehebrahimi4958
    @atefehebrahimi4958 11 วันที่ผ่านมา

    guys what is the website the professor mentions around 27:18

  • @promariddhidas6895
    @promariddhidas6895 5 หลายเดือนก่อน +1

    i wish i had access to the problem sets for this course

    • @akshat_senpai
      @akshat_senpai 4 หลายเดือนก่อน

      May be on github...

  • @labiditasnim623
    @labiditasnim623 ปีที่แล้ว +1

    why in cost function he did 1/2 and not 1/2*m ?

  • @wishIKnewHowToLove
    @wishIKnewHowToLove ปีที่แล้ว +1

    it's hard, but everything thats worth doing is

  • @mortyrickerson6322
    @mortyrickerson6322 ปีที่แล้ว +2

    Fantastic. Thank you deeply for sharing

  • @jerzytas
    @jerzytas 11 หลายเดือนก่อน +1

    In the very last equatin (Normal equation 1:18:06) Transpose(X) appears on both sides of the equation, can't this be simplified by dropping transpose(T)?

    • @manasvi-fl6xq
      @manasvi-fl6xq 8 หลายเดือนก่อน +2

      no because , x is neccesarily not a square a matrix

  • @PhilosophyOfWinners
    @PhilosophyOfWinners 11 หลายเดือนก่อน +2

    Loving the lectures!!

  • @chandarayi5673
    @chandarayi5673 ปีที่แล้ว +1

    I love you Sir Andrew, you inspire me a lot haha

  • @bharathanumandla6305
    @bharathanumandla6305 20 วันที่ผ่านมา

    When will practice lectures start ?

  • @AmanSainiIITIAN
    @AmanSainiIITIAN 26 วันที่ผ่านมา

    where can i find lecture notes???

  • @GameFlife
    @GameFlife ปีที่แล้ว +1

    I need that lecture notes ASAP professor

  • @danilvinyukov2060
    @danilvinyukov2060 2 หลายเดือนก่อน

    1:17:31
    Can't we just get rid of the x transverse on both left sides of the equation. As I remember from linear algebra if you have the same matrix on two sides of the equation from the same side that is redundant and can be removed.
    The result should be x(theta) =y => (theta) = x^(-1) y

  • @aliiq6572
    @aliiq6572 ปีที่แล้ว +1

    Can I get notes for these lectures?

  • @ObaroJohnson-q8v
    @ObaroJohnson-q8v 2 หลายเดือนก่อน

    Formula looks like variance formulae , will be interested to know why we have that 1/2 of the variances of the lost of function. Could we just used the variance formula instead or is there a theory behind that. Thanks

  • @gauravpadole1035
    @gauravpadole1035 ปีที่แล้ว +1

    can anyone pls explain what do we mean by "parameters" that is denoted by theta here?

    • @SteveVon7
      @SteveVon7 9 หลายเดือนก่อน +1

      Parameters are TRAINABLE numbers in the model such as weights and bias's, since the prediction of the model is based on some combination of weight and bias values. So when 'parameters' of 'theta' are changed or 'trained', it means that the weights and bias's are changed or trained.

  • @Legends-t2t
    @Legends-t2t ปีที่แล้ว +2

    How can I get the lecture papers pdf..?

  • @shashankshekharjha6913
    @shashankshekharjha6913 3 หลายเดือนก่อน

    okay so the superscript i, ( 1 to m) represents the number of features, right? Because here m = 2 and I don't understand why m = # training examples

  • @akashrathod2285
    @akashrathod2285 21 วันที่ผ่านมา

    Is this man teaching aliens ?

  • @Baru_Bangun_Tidur
    @Baru_Bangun_Tidur ปีที่แล้ว

    1.14.54 my answer is (X^T Xθ )+(X^T θ^T X)-(X^T Y)-(Y X^T) its same or my ans is wrong ?

  • @Gatsbi
    @Gatsbi 6 หลายเดือนก่อน

    Had to study basic Calculus and Linear algebra at the same time to understand a bit, but don't get it fully yet,

  • @RHCIPHER
    @RHCIPHER ปีที่แล้ว +1

    this men is great teatcher

  • @puspjoc9975
    @puspjoc9975 3 หลายเดือนก่อน

    where can i get the full detail notes?? Anyone who knows this ,reply please.

  • @ZDixon-io5ww
    @ZDixon-io5ww 2 ปีที่แล้ว +13

    47:00
    51:00 - batch
    55:00 problem 1 set
    57:00 for p 0

  • @olinabin2004
    @olinabin2004 ปีที่แล้ว +1

    8:42 - 14:42 - Terminologies completion
    17:51 -- Checkpoint
    57:00 - run1

  • @Rubariton
    @Rubariton ปีที่แล้ว

    That feel when you need to pause the video every n-minutes and need to google the terminology coz highschool was too long ago