Paper Summary: Maxout Networks

Karan Uppal
6 min readJan 12, 2023

--

Goodfellow, Ian, et al. “Maxout networks.” International conference on machine learning. PMLR, 2013.

Link to original paper

Starting off the year with a paper by the father of GANs, we take a look at a unique activation function called Maxout. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with dropout, hence the name. The paper thoroughly analyses the performance of Maxout and achieves state-of-the-art performance on five different benchmark tasks.

1. Introduction

The paper starts off by explaining Dropout. Dropout provides an inexpensive and simple means of both training a large ensemble of models that share parameters and approximately averaging together these models’ predictions.

The authors state that training using dropout differs a lot from ordinary SGD. Under the dropout regime, each update can be seen as making a significant update to a different model on a different subset of the training set. Another thing they state is that dropout model averaging is only an approximation when applied to deep models. Thus, explicitly designing models to minimize these weaknesses can lead to better optimization.

2. Review of Dropout

The authors give a brief description of dropout and similarities of the technique with bagging. It differs from bagging in that each model is trained for only one step and all of the models share parameters. One problem arises that it is not obvious how to average all the sub-models’ predictions. It turns out that for a single-layer neural network with a softmax activation (logistic regression), the average prediction of exponentially many sub-models can be computed by running the full model with all weights divided by 2. This also holds true for a neural network with all linear layers. For more general networks, this weight scaling is only an approximation but performs well in practice.

3. Description of Maxout

Maxout model uses a new type of activation function, which the authors call the maxout unit. A hidden layer of the model implements the following operations.

When training with dropout, we perform the elementwise multiplication with the dropout mask immediately prior to the multiplication by the weights in all cases, ensuring that no input is dropped to the max operator.

A single maxout unit can be interpreted as making a piecewise linear approximation to an arbitrary convex function. They are, in a sense, learning the activation function of each hidden unit. The example below showcases how it can approximate convex functions in 2D.

In a convolutional network, a maxout feature map can be constructed by taking the maximum across k affine feature maps (i.e., pool across channels), the example for which is shown below.

4. Maxout is a universal approximator

A standard MLP with enough hidden units is a universal approximator. The authors argue that a maxout model with just two hidden units (provided each individual maxout unit may have arbitrarily many affine components) can approximate, arbitrarily well, any continuous function. An outline of the proof is illustrated below.

We will now go over the formal proof in a not-so-formal way:

Proposition 4.1: Any continuous piece-wise linear function can be expressed as a difference of two convex piece-wise linear functions.

Proposition 4.2: Let C be a compact domain belonging to real n-dimensions, f : C -> R be a continuous function and ε > 0 be any positive real number. Then there exists a continuous piece-wise linear function g, (depending upon ε), such that for all v belonging to C, | f(v) — g(v) | < ε

The proof is then laid out as follows:

5. Benchmark Results

To evaluate the performance of maxout networks, the authors experiment on four benchmarks datasets (MNIST, CIFAR-10, CIFAR-100 and SVHN) and set the state-of-the-art on all of them. For exact model desgin, refer to the paper.

6. Comparison to rectifiers

The authors compare the performance of maxout with ReLU, to verify whether the performance gain is due to improved preprocessing/larger models, rather than by the use of maxout.

The authors run a large cross-validation experiment as shown above, which clearly indicate that maxout offers a clear improvement over rectifiers. They also found that their preprocessing and size of models improves ReLU and dropout beyond the previous state-of-the-art results.

7. Model Averaging

When using dropout, we know that for averaging sub-models we divide the weights by 2, which gives us the exact model averaging for a single-layer model. This same concept can be extended to multiple linear layers (although they have the same representational power as a single layer). The authors then argue the following:

They further state that dropout training encourages maxout units to have large linear regions around inputs. This indicates that dropout will do exact model averaging for any activation provided that it is locally linear among the space of inputs to each layer that are visited by applying different dropout masks. Thus, maxout units make the network act like a linear one, making the approximate model averaging more exact (in contrast to a network with nonlinear units).

To test this theory, they compared maxout with a tanh network. It was found that the KL divergence between the approximate model average and the sampled model average was smaller for maxout, indicating that maxout was doing more effective model averaging.

8. Optimization

The authors state that maxout is effective as compared to rectifier units becasue it is easier to train with dropout than a max-pooled ReLU network, particularly with many layers. They empirically verify this by the following experiment.

They trained very deep and narrow models on the MNIST dataset, noting the train and test errors for varying depths. Maxout optimization degrades gracefully with depth but pooled rectifier units worsen noticeably at 6 layers and dramatically at 7.

Next, the authors argue that by using ReLU, we include 0 in the activation function which blocks the gradient flow from flowing through the unit, leading to saturation. Maxout doesn’t suffer from this problems because gradient always flows through every maxout unit, even in the case of 0.

Active ReLU units become inactive at a greater rate than inactive units become active when training with dropout. But maxout units, which are always active, transition between positive and negative activations at about equal rates in each direction. This is illustrated in the above figure. They hypothesize that the high proportion of zeros and the difficulty of escaping them impairs the optimization performance of ReLU relative to maxout.

9. Conclusion

The authors propose a new activation function called maxout which is particularly well suited for training with dropout. They empirically showcase the model averaging behaviour of maxout networks as well as its ability to train deeper networks. The state-of-the-art performance of their approach on several reputed benchmark tasks further illustrates the novelty and usefulness of maxout networks.

10. Final Words

The paper is quite well written and easy to understand if one gives some time. However, certain analysis (for example, variance of gradient experiment) is not explained well and are not easily grasped.

11. References

Goodfellow, Ian, et al. “Maxout networks.” International conference on machine learning. PMLR, 2013.

--

--