Optimization Algorithms - Gradient Descent

Gradient Descent Algorithm

<aside> 💡

Gradient descent is a mathematical technique that iteratively finds the weights (slope) and bias (intercept) that produce the model with the lowest loss.

</aside>

Our goal is to minimize RSS, so to do it accurate use use RMSE. But RMSE uses higher computational power so as a solution we use this Gradient Descent Algorithm.
Gradient descent is a method for finding the best model weights (e.g. $w_0,w_1$) by repeatedly moving them in the direction that reduces the error the most - like walking downhill toward the lowest point of a bowl-shaped error surface.
We usually minimize a cost such as MSE (average squared error).
The gradient is the slope of the cost with respect to each weight; it shows which direction increases the cost.
We update each weight by stepping opposite to the gradient to reduce the cost.
The algorithm works by iteratively adjusting the weights in the direction that reduces the error the most.

The learning rate $α$ (alpha) controls step size ($η$):
- too large → overshoot / diverge.
- too small → very slow.
Variants:
- **Batch GD (**Gradient descent) uses all data each step.
- Stochastic GD (SGD) uses one sample per step.
- Mini-batch GD uses a small batch.
Gradient descent is an optimizer; RMSE / MSE are error metrics. They’re different things,
- GD finds weights that minimize a chosen loss (often MSE).
- RMSE measures final prediction error in original units.

The Updated Equation:

$$ \begin{align*} \hat{y}i &= w_0 + w_1 x_i \\ \\ \text{MSE} &= \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}i)^2 \\ \\ w{\text{new}} &= w_{\text{old}} - \alpha \frac{\partial}{\partial w}\text{Error}(w) \end{align*} $$

$w_{\text{new}}$: the updated weight after one step (what we set next).
$w_{\text{old}}$: the current weight value (what we have now).
$α$ (alpha): the learning rate (controls step size).
- Small → slow but safe.
- large → fast but may overshoot.