We consider a single-bottleneck NNGP model with normalized ReLU activation, one infinite-width hidden layer before the bottleneck, $D$ infinite-width hidden layers after the bottleneck, and bottleneck width $W$.
We apply the bottleneck NNGP model to the %datasetname% dataset for various $D$ and $W$ and compare their likelihoods.
For each $D$ and $W$, we optimize the model hyperparameters $v_b$, $v_w$, and $v_n$ with maximum marginal likelihood.
We initialize the hyperparameters to $v_b=%vb%$, $v_w=%vw%$, and $vn=%vn%$.
We train the model for $%iters%$ iterations of Adam with learning rate $%lr%$ and full batches.
The loss function is the negative marginal log-likelihood divided by the dataset size.
We note that the loss function is stochastic since we sample the bottleneck layer during the forward pass;
we draw $%train_samples%$ samples at each iteration of training.
Due to this stochasticity, it can be difficult to tell if the learning rate is sufficiently small based only on the loss-vs-iterations learning curve.
Therefore, at each iteration, we record the loss before updating the parameters and the loss after updating the parameters,
where we make sure to use the same draw from the bottleneck layer in both cases.
The change between these two loss values must be negative at each iteration;
whenever the loss increases after the parameter update, we multiply the learning rate by $0.9$.
After training, we compute the final loss using the optimized model hyperparameters,
this time drawing $%test_samples%$ samples from the bottleneck layer.
The heat maps on the next page compare the final losses (negative marginal log-likelihood per data example) and the optimal hyperparameters over all bottleneck NNGP architectures that were tested.
``Depth'' and ``Width'' refer to post-bottleneck depth $D$ and bottleneck width $W$ respectively.
The remaining pages present learning curves for a sample of the architectures that were tested.
The ``Loss'' learning curve includes the loss at each iteration before the parameter update (red) and after the parameter update (blue),
where again the same sample draw from the bottleneck layer was used within each iteration.
The sample draws are IID for distinct iterations, hence the noise in the learning curves.
The ``Change in loss'' learning curve gives the difference between the red and blue losses as a percentage of the red loss.