TL;DR: nearly isometric DNN initializations imply low parameter space curvature, and a lower condition number, but that's not always great
Abstract: Recently mean field theory has been successfully used to analyze properties
    of wide, random neural networks. It gave rise to a prescriptive theory for
    initializing feed-forward neural networks with orthogonal weights, which
    ensures that both the forward propagated activations and the backpropagated
    gradients are near \(\ell_2\) isometries and as a consequence training is
    orders of magnitude faster. Despite strong empirical performance, the
    mechanisms by which critical initializations confer an advantage in the
    optimization of deep neural networks are poorly understood. Here we show a
    novel connection between the maximum curvature of the optimization landscape
    (gradient smoothness) as measured by the Fisher information matrix (FIM) and
    the spectral radius of the input-output Jacobian, which partially explains
    why more isometric networks can train much faster. Furthermore, given that
    orthogonal weights are necessary to ensure that gradient norms are
    approximately preserved at initialization, we experimentally investigate the
    benefits of maintaining orthogonality throughout training, and we conclude
    that manifold optimization of weights performs well regardless of the
    smoothness of the gradients. Moreover, we observe a surprising yet robust
    behavior of highly isometric initializations --- even though such networks
    have a lower FIM condition number \emph{at initialization}, and therefore by
    analogy to convex functions should be easier to optimize, experimentally
    they prove to be much harder to train with stochastic gradient descent. We
    conjecture the FIM condition number plays a non-trivial role in the optimization.
Code: https://github.com/PiotrSokol/info-geom
Keywords: Fisher, mean-field, deep learning
Original Pdf:  pdf
12 Replies
Loading