Decreasing accuracy in neural nets during training - things to try

Recently I attempted to train a classifier network, just to find that after an initial period of the loss decreasing and accuracy increasing, the accuracy quickly dropped to the point where it became just 1/N where N was the number of classes. In other words, it became no better than a random guess, and the network was stuck in this state.

I tried a number of things with more or less success, researched what the usual causes and remedies are, and thought I'd share my experience as the solution that ultimately helped was somewhat surprising. This was:

More stable loss function. I used TensorFlow and Keras, and added a softmax layer from Keras as the last layer in my model, and used Keras's categorical_crossentropy as the loss function. It turned out that this arrangement can be quite unstable, causing unwanted behaviour during training. When I switched to tf.nn.softmax_cross_entropy_with_logits_v2 from the TensorFlow backend, and removed the softmax layer, the problem largely went away. This layer calculates softmax and the cross-entropy in one go, with apparently much better numerical stability.

To use this as the loss function in a Keras model, I followed this structure:

def my_loss(truth, prediction):
    truth = tf.stop_gradient(truth)
    loss = tf.nn.softmax_cross_entropy_with_logits_v2(
    return tf.reduce_mean(loss)

model = keras.models.Model(...)
model.compile(loss=my_loss, ...)

The fact that it was necessary was surprising because the network was rather simple, and the classes completely balanced. When I displayed the output of the network just before the softmax layer, I found that when the accuracy got stuck at 1/N, the signals for all classes were largely equal, and quite large.

Next are some of the other things I did, some of which did seem to help. Some may help you, too, although YMMV.

Check for bugs in the code. Many suggested that such behaviour can occur if due to a bug a NaN value is fed into the loss function, for example, as a result of taking the logarithm of 0.

Try easy data. To ensure that the model is actually capable of converging, feed it some data that is really easy to train on, like using N words ("aaa", "bbb", "ccc", etc.) as the input with a one-to-one correspondence to classes as the target.

Reduce the learning rate. There were suggestions online that this behaviour can also be due to exploding gradients or oscillation around a minimum, which can potentially be avoided by reducing the learning rate.

Change the optimizer. Related to the above suggestion, if the loss function is really funny, some optimizers may just not work well with it. Try different ones, and/or try setting their parameters better.

Change the activation function. Changing the activation functions throughout the model will change its characteristics completely. In other projects I found that ReLUs and Leaky ReLUs, although great to partially manage vanishing gradients, allow weights to grow very large. Of course, we have clamping and regularizers available to help with this, too. Also, ReLUs have a large dead region where the derivative is 0, and the network can become stuck there. Leaky ReLUs avoid this problem.

Turn regularizers and batch norm layers on/off. Again these will change the characteristics of the model considerably, which may be better or worse with the given loss function and optimizer.

Reduce the width of the network. I found that a network was more susceptible to this issue if it had more units on its layers. It has been suggested that wide and shallow networks are difficult to train, which may be related.

I hope you'll find some of these suggestions useful. For reference, let me list some of the sources I found:

Popular Posts