Soujanya Syamal
- May 4, 2021
- 8 min read

A Beginner's Guide to Machine Learning Regression Analysis

Examples, diagrams, animations, and cheat sheets are used to illustrate regression analysis.

CONTEXT-

Consider the following basic example to gain a better understanding of the inspiration behind regression. From 2001 to 2012, the scatter plot below displays the number of college graduates in the United States.

What if someone asked you, based on the available data, how many college graduates with master's degrees will there be in 2018? The number of college graduates with master's degrees increases almost linearly with the passing of each year. So, based on a quick visual review, we can estimate the number to be between 2.0 and 2.1 million. Let's take a peek at the numbers. From 2001 to 2018, the same variable is plotted in the graph below. As can be shown, our expected value was within a few percent of the actual value.

Our mind was easily able to solve the problem because it was a simple one (fitting a line to data). Regression analysis is the method of fitting a feature to a collection of data points.

What is Regression Analysis and How Does It Work?

The method of estimating the relationship between a dependent variable and independent variables is known as regression analysis. To put it another way, it means fitting a function from a selected family of functions to the sampled data thus accounting for any error. Regression analysis is one of the most basic methods for estimation in the field of machine learning. You fit a feature to the available data and try to predict the outcome in the future or for hold-out datapoints using regression. This function-fitting is beneficial in two ways.

Within your data set, you can estimate missing data (Interpolation)
Outside of your data set, you can make educated guesses about future data (Extrapolation)

Predicting the price of a house based on house features, predicting the effect of SAT/GRE scores on college admissions, predicting sales based on input parameters, predicting the weather, and so on are some real-world examples of regression analysis.

Let's go back to the college graduates example from earlier.

Interpolation:

Let's pretend we have access to some sparse info, such as the number of college graduates every four years, as depicted in the scatter plot below.

For the years in between, we'd like to estimate the number of college graduates. We can do this by fitting a line to the few data points we have. Interpolation is the term for this procedure.

Extrapolation:

Say we only have minimal data from 2001 to 2012 and want to forecast the number of college graduates from 2013 to 2018.

The number of college graduates with master's degrees increases almost linearly with the passing of each year. As a result, fitting a line to the dataset makes sense. It can be shown that the prediction is very close by using the 12 points to match a line and then testing the line's prediction on the future 6 points.

In terms of mathematics,

Regression analysis types-

Let's take a look at some of the various approaches to regression. We may categorise regression into the following groups based on the family of functions (f beta) and the loss function (l) used.

1. Linear Regression

The aim of linear regression is to minimise the amount of mean-squared error for each data point in order to suit a hyperplane (a line for 2D data points).

In terms of mathematics, linear regression solves the following issue:

As a result, we must identify two beta-denoted variables that parameterize the linear function f. (.). Figure 4 shows an example of linear regression with a P value of 5. The equipped linear function with beta 0 = -90.798 and beta 1 = 0.046 is also shown in the figure.

2. Polynomial Regression

Polynomial Regression is the second form of regression.

The relationship between the dependent (y) and independent (x) variables is assumed to be linear in linear regression. When the relationship between the data points is not linear, it fails to match them. Instead of fitting a linear regression model to the data points, polynomial regression fits a polynomial of degree m to the data points. The more complex the role under consideration, the better its fitting capabilities (in general). Polynomial regression solves the following equation mathematically.

As a result, we must locate (m+1) variables denoted by beta 0,...,beta m. Linear regression is a special case of polynomial regression with degree 2 as can be shown.

Consider the scatter plot of the following series of data points. We get a fit that simply fails to estimate the data points when we use linear regression. However, we get a much better fit when we use polynomial regression with degree 6, as shown below.

Image- [1] Scatter plot of data — [2] Linear regression on data — [3] Polynomial regression of degree 6

Linear regression struggled to estimate a good fitting function since the data points did not have a linear relationship between the dependent and independent variables. Polynomial regression, on the other hand, was able to capture the non-linear relationship.

3. Ridge Regression

In regression analysis, ridge regression solves the problem of overfitting. Consider the same example as before to see what I mean. When a polynomial of degree 25 is fitted to the data with 10 training points, the red data points are perfectly matched (center figure below). However, it does so at the expense of other points in the centre (spike between last two data points). This is depicted in the diagram below. Ridge regression is an attempt to solve this problem. By sacrificing the fit on the training points, it seeks to reduce the generalisation error.

mage- [1] Scatter plot of data — [2] Polynomial regression of degree 25— [3] Polynomial Ridge regression of degree 25

Ridge regression solves the following problem mathematically by changing the loss function.

The function f(x) may be polynomial or linear. When the feature overfits the data points in the absence of ridge regression, the weights learned appear to be very high. Ridge regression prevents overfitting by integrating the scaled L2 norm of the weights (beta) into the loss function, which limits the norm of the weights being learned. As a result, the trained model makes trade-offs between perfectly fitting the data point (large norm of the learned weights) and limiting the weights' norm. This trade-off is regulated by the scaling constant alpha>0. Higher norm weights and overfitting of the training data points would result from a low alpha value. A high alpha value, on the other hand, will result in a function with a bad fit to the training data points but a very small weights norm. The optimal trade-off is achieved by carefully selecting the value of alpha.

4. Regression using the LASSO method (LASSO REGRESSION)

Ridge regression and LASSO regression are also used as regularizers to prevent overfitting on the training data points. LASSO, on the other hand, has an added advantage. The trained weights are forced to be sparse.

Ridge regression causes the learned weights' norm to be minimal, resulting in a collection of weights with a lower total norm. The majority (if not all) of the weights would be non-zero. LASSO, on the other hand, attempts to find a range of weights by setting the majority of them to zero. This results in a sparse weight matrix that can be implemented much more energy-efficiently than a non-sparse weight matrix while retaining equal accuracy in terms of fitting to the data points.

On the same example as before, the figure below attempts to visualise this concept. Both the Ridge and Lasso regressions are used to suit the data points, and the fit and weights are plotted in ascending order. The majority of the weights in the LASSO regression are very close to zero, as can be shown.'

By changing the loss function, LASSO regression solves the following problem mathematically.

LASSO differs from Ridge regression in that it uses the L1 norm of the weights rather than the L2 norm. The loss function's L1 norm appears to increase sparsity in learned weights. More information about how it enforces sparsity can be found in the article's section L1 regularisation.

The constant alpha>0 is used to monitor the tradeoff in learned weights between fit and sparsity. A high alpha value leads to a bad match but a sparser learned range of weights. A small value of alpha, on the other hand, results in a close fit on training data points (which may lead to over-fitting) but a less sparse collection of weights.

5. ElasticNet Regression

Ridge and LASSO regression are combined in ElasticNet regression. The weights' L1 and L2 norms, as well as their respective scaling constants, are included in the loss term. It's often used to overcome LASSO's drawbacks, such as its non-convex form. ElasticNet penalises the weights in a quadratic way, rendering it primarily convex.

ElasticNet regression solves the following problem mathematically by changing the loss function.

6. Bayesian Regression

The aim of the frequentists approach (the regression discussed above) is to find a collection of deterministic weights (beta) that explain the data. Instead of finding a single value for each weight in Bayesian regression, we try to find the distribution of these weights based on a prior.

So we start with an initial weight distribution and, based on the data, nudge it in the right direction using the Bayesian theorem, which relates the prior distribution to the posterior distribution based on probability and proof.

When there are infinite data points, the posterior weight distribution becomes an impulse at the ordinary least square solution, i.e. the variance reaches zero.

Instead of a single set of deterministic values, finding the distribution of weights serves two purposes.

It acts as a regularizer by naturally guarding against the problem of overfitting.
It provides trust and a weight selection, which makes more sense than simply returning a single value.

Let us express the problem and its solution mathematically.

Consider a Gaussian prior with mean and covariance on the weights, i.e.

We change this distribution based on the available data D. The posterior will be a gaussian distribution with the following parameters for the problem at hand.

Here's a link to a more comprehensive mathematical description.

Let's try to visualise it by looking at sequential Bayesian linear regression, which updates the weight distribution one data point at a time. The diagram that follows

Bayesian regression nudges the posterior distribution in the right direction based on the input data (x, y)

The weights distribution gets closer to the actual underlying distribution as each data point is added.

When a single new data point is taken into account, the animation below plots the original data, the inter-quartile range of the projection, the marginal posterior distribution of weights, and the joint distribution of weights at each time level. The interquartile range narrows (green shaded area), the marginal distribution is distributed around two weights parameters with variance approaching zero, and the joint distribution converges at the actual weights as more points are added.

7. Logistic Regression

In classification tasks, where the output must be the conditional likelihood of the output provided the data, logistic regression comes in handy. In terms of mathematics, logistic regression solves the following issue:

Consider the scatter plot below, where the data points are assigned to one of two categories: 0 (red) or 1 (yellow).

Image- [1] Scatter plot of data points — [2] Logistic regression trained on data points plotted in blue

A sigmoid function at the output of a linear or polynomial function is used in logistic regression to map the output from (-♾️, ♾️) to (0, 1). The test data is then classified into one of two groups using a threshold (usually 0.5).

It might appear that logistic regression is a classification algorithm rather than a regression algorithm. However, this is not the case. More information can be found in Adrian's article.

CONCLUSION-

In this post, we looked at different regression analysis approaches, their motivations, and how to use them. The table and cheat sheet that follow summarise the various approaches mentioned previously.

Reference- Towards Data Science