Linear and Logistic Regressions as Degenerate Neural Networks in Keras

If you are tasked with creating a prediction for some measure, you may wonder whether a simple linear or multiple regression would be sufficient (or, in case we want to predict a binary value, logistic regression), or perhaps use a neural network. And how much coding would be involved in trying out the different models. The good news is, the high-level neural network framework Keras is sufficient for all these purposes, as I will show using a simple example.

Aaron Zhu’s overview of the regression models used here is a great start, so here we will only mention the basics.

Linear regressions

The goal with simple linear regression is to model an outcome (Y[i]), a continuous variable, as a linear function of a continuous input variable (X[i]) using two constants:

𝛃X[i] + 𝛂 = Y[i]*

The error in the prediction (Y[i]* - Y[i]) is usually measured using the sum of squared errors, because minimizing this error conveniently happens to maximize the likelihood that 𝛂 and 𝛃 provide the underlying model given the observed data of X and Y.

Notice that the mapping from X to Y* is actually the same as the one provided by a fully connected (or dense) layer in a neural network. This dense layer is very simple: it has one input, and one output; 𝛃 is the weight associated with the single input, and 𝛂 is the bias.

While a simple linear regression can be solved directly, that is, the values for 𝛂 and 𝛃 can be found using some algebra, should we decide to use gradient descent to optimize their values, what we would get is a very simple, degenerate neural network with a single dense layer and a squared error loss function. Both of these are readily available in Keras. I call this neural network degenerate since it does not have an activation function, and so it does not even have actual perceptrons in it.

Why stop here, though? If we have multiple input variables, X1, X2, ... Xn, we may consider multiple regression, where the single outcome (Y) is predicted using a vector of factors:

𝛃1X1[i] + 𝛃2X2[i] + ... + 𝛃nXn[i] + 𝛂 = Y[i]*

The direct algebraic method to obtain 𝛃n and 𝛂 is to solve a system of linear equations. The alternative, using gradient descent as above, can prove less resource intensive in general, which again yields a degenerate neural network: this time with a dense layer with n inputs and a single output with 𝛃1, 𝛃2, ..., 𝛃n being the weights and 𝛂 the bias. For the same reason as in the case of simple linear regression, the choice of the error function is the sum of squared errors.

Logistic regression

Sometimes though we need to predict not a continuous value, but a true-or-false one, which is where logistic regression enters the picture. It is a linear regression model the outcome of which is fed into the logistic function 𝝈(x) = 1/(1+exp(-x)), a sigmoid function that maps all real numbers to the 0-1 interval. This makes the output of a logistic regression interpretable as a probability: if we are trying to predict whether a person will buy a red or a green balloon, this could be the probability that they go for the red one.

𝝈(𝛃1X1[i] + 𝛃2X2[i] + ... + 𝛃nXn[i] + 𝛂) = Y[i]* = prob(person[i] buys red balloon)

Here the usual sum of squared errors loss would not maximize the likelihood of 𝛃1, 𝛃2, ..., 𝛃n, 𝛂 being the underlying model directly. The correct loss function we need to use to achieve that is called binary cross-entropy. (See Arron Zhu’s article for a derivation of this.) The complexity of finding an optimal solution directly is increased by the logistic function, so, similarly to multiple regression, gradient descent can offer a faster and less resource-intensive alternative.

It will come as no surprise that with gradient descent, optimizing a logistic regression is equivalent to training a simple neural network, using components available in the arsenal of all relevant software packages. Actually, logistic regression represents a single layer of perceptrons, which in Keras can be modeled as a dense layer with a sigmoid activation. Training this model using the binary cross-entropy loss function gives us exactly what we want.

An example

That is, there is no need to write separate code or call separate libraries to try linear and logistic regressions, or full-fledged neural networks. Simply varying the neural network model allows us to try all three types of models fast and easily.

To demonstrate this, we create a toy example. Given a person’s age, relationship status and number of children, we try to predict how many balloons they buy, and whether the balloons are red or green. (Or how many chairs they buy and whether they need a wide or narrow dining table.) We represent relationship status as a single value that is either -0.5 for partnered, or +0.5 for single.

In our toy example we choose an underlying model that is not entirely linear to see if full neural networks fare better than the linear models. We generate our inputs using uniform random values. We do so from the [-0.5, +0.5] interval so the inputs would already be normalized: have zero mean and uniform variance. Then we calculate:

if relationship == -0.5:
    number_of_balloons = 1. * children - .2 * age
    balloon_color = 1 if (.8 * children + .2 * age > 0) else 0
else:
    number_of_balloons = .8 * children + .5 * age
    balloon_color = 1 if (.5 * children + .5 * age > 0) else 0

We create four models: a multiple linear regression and a neural network with no sigmoid function at the end to predict the number of balloons; here we use a sum of squares loss. Then a logistic regression one and a neural network with a final sigmoid function to predict the color of the balloons; here we use binary cross-entropy as the loss.

See this gist for the code that trains these models, and in the case of the regression ones, also displays the weights of the dense layers (corresponding to 𝛃 and 𝛂):

One example run produced the following output - reproduced here with slight modifications for readability:

======= Output type: num_balloons Model type: regression
Epoch 1/1000
loss: 0.0526 - val_loss: 0.0205
Epoch 2/1000
loss: 0.0205 - val_loss: 0.0201
Epoch 3/1000
loss: 0.0204 - val_loss: 0.0207

Weights:
[<'dense/kernel:0' ([[-0.00324172], [0.43457505], [0.1423042]])>,
<'dense/bias:0' ([0.00032589])>]

======= Output type: num_balloons Model type: neural
Epoch 1/1000
loss: 0.0206 - val_loss: 1.1117e-04
Epoch 2/1000
loss: 1.2853e-04 - val_loss: 1.1776e-04

======= Output type: color Model type: regression
Epoch 1/1000
loss: 0.5130 - val_loss: 0.2387
Epoch 2/1000
loss: 0.2213 - val_loss: 0.2005
Epoch 3/1000
loss: 0.2009 - val_loss: 0.1935
Epoch 4/1000
loss: 0.1965 - val_loss: 0.1969

Weights:
[<'dense/kernel:0' ([[0.02176554], [14.170614], [8.668548]])>,
<'dense/bias:0' ([0.02175274])>]

======= Output type: color Model type: neural 
Epoch 1/1000
loss: 0.4589 - val_loss: 0.0651
Epoch 2/1000
loss: 0.0518 - val_loss: 0.0345
Epoch 3/1000
loss: 0.0271 - val_loss: 0.0199
Epoch 4/1000
loss: 0.0161 - val_loss: 0.0147
Epoch 5/1000
loss: 0.0119 - val_loss: 0.0098
Epoch 6/1000
loss: 0.0104 - val_loss: 0.0098

The first observation is that the neural models fared better in both cases than the regressions (0.001178 validation loss vs. 0.0207; 0.0098 loss vs. 0.1969). As expected, they could model the non-linear relationships.

The weights returned by the regressions merit a bit more analysis and sanity checking. For the number of balloons, the multiple regressions predicts

-0.003 relationship + 0.435 num_children + 0.142 age + 0.000 = num_of_balloons

Since the number of children and age both have a mean of zero, the mean of the number of balloons returned by the underlying model is also zero for both relationship statuses. This is reflected in that the factor for the relationship input and the bias are both practically zero.

The underlying model produces the same amount of examples for the two relationship statuses, so we expect that the best linear model approximating their combination is in the middle between the two linear expressions in the underlying model. Indeed this is what we find: the factor for the number of children is close to (1.0 + 0.8) / 2 = 0.45, and the factor for the age is close to (0.5 - 0.2) / 2 = 0.15.

For the color of the balloons, the logistic regression gives the model

𝝈(0.0218 relationship + 14.171 num_children + 8.669 age + 0.0218) = prob(red)

The same symmetries apply as before, and again we find that the factor for the relationship and the bias are close to zero, especially if compared to the other two numbers. We again expect the linear model to be between the two linear expressions. For the number of children we expect (0.8 + 0.5) / 2 = 0.65, and for the age (0.2 + 0.5) / 2 = 0.35. However, due to the sigmoid function the model is only interested in whether the output is negative or positive, so there is an arbitrary scaling factor; taking this into account we see the factors do make sense: 14.171 / 8.669 = 1.635, which is close to 0.65 / 0.35 = 1.857.

In fact, the scaling factor is not arbitrary: the fact that they are big numbers make the input into the sigmoid function large in absolute terms, which forces its output to be very close to zero or one.

Conclusion

We have seen how neural networks are supersets of linear and logistic regressions, and how with existing software components used to build neural networks we can very easily implement regression models. Implementing regression models this way can make it very easy to upgrade them to a neural network if necessary.

Popular Posts