Gradient Descent Algorithm
<aside>
💡
Gradient descent is a mathematical technique that iteratively finds the weights (slope) and bias (intercept) that produce the model with the lowest loss.
</aside>
- Our goal is to minimize RSS, so to do it accurate use use RMSE. But RMSE uses higher computational power so as a solution we use this Gradient Descent Algorithm.
- Gradient descent is a method for finding the best model weights (e.g. $w_0,w_1$) by repeatedly moving them in the direction that reduces the error the most - like walking downhill toward the lowest point of a bowl-shaped error surface.
- We usually minimize a cost such as MSE (average squared error).
- The gradient is the slope of the cost with respect to each weight; it shows which direction increases the cost.
- We update each weight by stepping opposite to the gradient to reduce the cost.
- The algorithm works by iteratively adjusting the weights in the direction that reduces the error the most.

- The learning rate $α$ (alpha) controls step size ($η$):
- too large → overshoot / diverge.
- too small → very slow.
- Variants:
- **Batch GD (**Gradient descent) uses all data each step.
- Stochastic GD (SGD) uses one sample per step.
- Mini-batch GD uses a small batch.
- Gradient descent is an optimizer; RMSE / MSE are error metrics. They’re different things,
- GD finds weights that minimize a chosen loss (often MSE).
- RMSE measures final prediction error in original units.
The Updated Equation:
$$
\begin{align*}
\hat{y}i &= w_0 + w_1 x_i
\\ \\
\text{MSE} &= \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}i)^2
\\ \\
w{\text{new}} &= w_{\text{old}} - \alpha \frac{\partial}{\partial w}\text{Error}(w)
\end{align*}
$$
- $w_{\text{new}}$: the updated weight after one step (what we set next).
- $w_{\text{old}}$: the current weight value (what we have now).
- $α$ (alpha): the learning rate (controls step size).
- Small → slow but safe.
- large → fast but may overshoot.