How Deep is your... Neural Network? How deep should it be?

How many hidden layers? How deep should your neural network be? How large or deep a fully-connected neural network can or should be?

All good questions, here we explore some answers.

This book’s chapter takes the cake for how large or deep a fully-connected neural network can or should be:

TensorFlow for Deep Learning

Chapter 4. Fully Connected Deep Networks This chapter will introduce you to fully connected deep networks. Fully connected networks are the workhorses of deep learning, used for thousands of applications. … - Selection from TensorFlow for Deep Learning [Book]

At present day, it looks like theoretically demonstrating (or disproving) the superiority of deep networks is far outside the ability of our mathematicians.

One way of thinking about fully connected networks is that each fully connected layer effects a transformation of the feature space in which the problem resides. The idea of transforming the representation of a problem to render it more malleable is a very old one in engineering and physics. It follows that deep learning methods are sometimes called “representation learning.”

Some lively discussion on stats.stackexchange.com that is more practical:

How to choose the number of hidden layers and nodes in a feedforward neural network?

Is there a standard and accepted method for selecting the number of layers, and the number of nodes in each layer, in a feed-forward neural network? I’m interested in automated ways of building neu...

An answer quotes:

Determining the Number of Hidden Layers

Number of Hidden Layers	Result
0	Only capable of representing linear separable functions or decisions
1	Can approximate any function that contains a continuous mapping from one finite space to another
2	Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy

From Introduction to Neural Networks for Java (second edition) by Jeff Heaton

Another answer says:

More than 2 [Number of Hidden Layers] – Additional layers can learn complex representations (sort of automatic feature engineering) for layer layers.

These nice academic folks wrote a whole paper exploring heuristics and things like genetic algorithms to find the optimal size and depth of a fully-connected neural network:

How many hidden layers and nodes?

(2009). How many hidden layers and nodes? International Journal of Remote Sensing: Vol. 30, No. 8, pp. 2133-2147.

Maximum accuracy was achieved with a network with 2 hidden layers, of which the topology was found using a genetic algorithm.

I have extracted Table 2 from the paper for your viewing pleasure:

Note that for deeper topologies (i.e. more hidden layers), the variance of accuracy and gap between max and min accuracies are far larger. This implies more time and effort is needed to figure out the best training method for a deeper network.

It seems that deeper networks can achieve higher accuracy due to better representation learning, however, they are much more unstable when training and many training iterations may be required to exceed the performance of a shallower fully-connected neural network. This implies that a should system should be in place to permutate or learn the hyper-parameter search-space.

As for how many nodes per hidden layer, the evidence seems to point towards larger numbers and taking advantage of the drop-off hyperparameter to avoid overfitting the model.