ML algorithms 1.01: Linear Regression
Introduction
This is the first algorithm anybody learns when they step into the world of Machine Learning. The advantage of the model lies in its simplicity. A linear model has a high bias and low variance. Also, if the features are normalized, it helps the algorithm converge faster while using gradient descent. Missing values need to be imputed or dropped in Linear Regression. Outliers affect the formation of best fit line. They should be filtered using boxplot or any other method.
Assumptions
- The relation between target and predictor variables is linear
- The features are independent of each other
- Homoskedasticity: The random variables have same finite variance
- There is no relationship between the residuals and the target variable
- The residuals are normally distributed
Advantages
- Performs very well when assumptions are true
- High model interpretability
Disadvantages
- Hardly does data meet the assumptions
- Model is prone to overfitting
Model
Let X be the feature set with m samples and n features. Let y be the continuous response.
Parameters of the model are represented as:
We may set the initial parameters 𝜷 of the model close to 0. Let us define the loss(cost) function of the model:
The negative gradient of the cost function gives the direction we should move to optimize 𝜷 for our model. We move down the negative gradient of the loss function with step size η(learning rate). Note that we do not update the intercept term.
The cost function of linear regression is a convex function. Therefore we iteratively update 𝜷 till the gradient is 0. If we can’t reach the gradient of 0 , we iterate till the gradient is less than some epsilon value. After these steps we have obtained the optimal parameters 𝜷 for our model and the best fit line for our data.
Estimation
Now we can estimate the continuous variable through our model as shown below:
Hyperparameter tuning
There isn’t much tuning required or possible in Linear Regression. We can use Lasso or Ridge regression techniques to penalize the cost function.
Ridge regression is used to reduce variance in the model by penalizing higher parameter values 𝜷.
Ridge regression cost function:
In Ridge regression the parameters shrink towards 0 but never become 0. Lasso regression can force some parameters to 0. This implements automatic feature selection.
Lasso regression cost function:
******************************************************************Example code:
from sklearn.linear_model import LinearRegression as LR
lr =LR()
lr.fit(X_train, y_train)
y_hat = lr.predict(X_test)
*****************************************************************