(Received: 31-Jan.-2022, Revised: 30-Mar.-2022 , Accepted: 22-Apr.-2022)
Though substantial advancements have been made in training deep neural networks, one problem remains, the vanishing gradient. The very strength of deep neural networks, their depth, is also unfortunately their problem, due to the difficulty of thoroughly training the deeper layers due to the vanishing gradient. This paper proposes "Phylogenetic Replay Learning", a learning methodology that substantially alleviates the vanishing-gradient problem. Unlike the residual learning methods, it does not restrict the structure of the model. Instead, it leverages elements from neuroevolution, transfer learning and layer-by-layer training. We demonstrate that this new approach is able to produce a better performing model and by calculating Shannon entropy of weights, we show that the deeper layers are trained much more thoroughly and contain statistically significantly more information than when a model is trained in a traditional brute force manner.

[1] D. Silver, David et al., "Mastering the Game of Go with Deep Neural Networks and Tree Search," Nature, vol. 529, pp. 484-489, DOI: 10.1038/nature16961, 2016.

[2] P. Vikhar, "Evolutionary Algorithms: A Critical Review and Its Future Prospects," Proc. of the IEEE Int. Conf. on Global Trends in Signal Process., Inf. Comp. and Comm. pp. 261-265, Jalgaon, India, 2016.

[3] F. Gomez, J. Schmidhuber and R. Miikkulainen, "Accelerated Neural Evolution through Cooperatively Coevolved Synapses," Journal of Machine Learning Research, vol. 9, pp. 937-965, 2008.

[4] R. De Nardi, J. Togelius, O. Holland and S. Lucas, "Evolution of Neural Networks for Helicopter Control: Why Modularity Matters," Proc. of the IEEE Int. Conf. on Evolutionary Computation, pp. 1799-1806, DOI: 10.1109/CEC.2006.1688525, Vancouver, Canada, 2006.

[5] V. Heidrich-Meisner, C. Igel, B. Hoeffding and Bernstein, "Races for Selecting Policies in Evolutionary Direct Policy Search," Proc. of the 26th Annual Int. Conf. on Machine Learning (ICML '09), vol. 51, DOI: 10.1145/1553374.1553426, 2009.

[6] J. Lehman et al., "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities," Massachusetts Institute of Technology, Artificial Life, vol. 26, no. 2, pp. 274–306, 2020.

[7] F. Such et al., "Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning," arXiv, DOI: 10.48550/arXiv.1712.06567, 2017.

[8] X. Zhang, J. Clune and K. Stanley, "On the Relationship between the OpenAI Evolution Strategy and Stochastic Gradient Descent," arXiv: 1712.06564, DOI: 10.48550/arXiv.1712.06564, 2017.

[9] J. Lehman, J. Chen, J. Clune and K. Stanley, "ES Is More Than Just a Traditional Finite-difference Approximator," Proc. of the Genetic and Evolutionary Computation Conference (GECCO '18), pp. 450- 457, DOI: 10.1145/3205455.3205474, 2018.

[10] E. Conti, Edoardo et al., "Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-seeking Agents," Proc. of the 32nd Int. Conf. on Neural Information Processing Systems (NIPS'18), pp. 5032–5043, 2017.

[11] J. Metzen, M. Edgington, Y. Kassahun and F. Kirchner, "Performance Evaluation of EANT in the Robocup Keepaway Benchmark," Proc. of the 6th Int. Conf. on Machine Learning and Applications (ICMLA 2007), pp. 342-347, DOI: 10.1109/ICMLA.2007.23, 2008.

[12] F. Gomez, J. Schmidhuber and R. Miikkulainen, "Accelerated Neural Evolution through Cooperatively Coevolved Synapses," JMLR, vol. 9, pp. 937-965, DOI: 10.1145/1390681.1390712, 2008.

[13] K. Stanley and R. Miikkulainen, "Evolving Neural Networks through Augmenting Topologies," Evolutionary Computation, vol. 10, pp. 99-127, DOI: 10.1162/106365602320169811, 2002.

[14] E. Real, A. Aggarwal, Y. Huang and Q. Le, "Regularized Evolution for Image Classifier Architecture Search," Proc. of AAAI Conf. on Artificial Intellig., vol. 33, DOI: 10.1609/aaai.v33i01.33014780, 2018.

[15] A. Gaier and D. Ha, "Weight Agnostic Neural Networks," arXiv: 1906.04358, DOI: 10.13140/RG.2.2.16025.88169, 2019.

[16] S. Hochreiter, Untersuchungen zu dynamischen neuronalen Netzen, Diploma Thesis, Josef Hochreiter Institut fur Informatik, Technische Universitat Munchen, Germany, 1991.

[17] F. Informatik, Y. Bengio, P. Frasconi and J. Schmidhuber Jfirgen, "Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies," Chapter of Book: A Field Guide to Dynamical Recurrent Neural Networks, pp. 237 – 243, DOI: 10.1109/9780470544037.ch14, IEEE Press, 2003.

[18] Y. Bengio, P. Simard and P. Frasconi, "Learning Long-term Dependencies with Gradient Descent Is Difficult," IEEE Transactions on Neural Networks, vol. 5, pp. 157-166, DOI: 10.1109/72.279181, 1994.

[19] R. Pascanu, T. Mikolov and Y. Bengio, "On the Difficulty of Training Recurrent Neural Networks," Proc. of the 30th Int. Conf. on Machine Learning, JMLR: W&CP, vol. 28, Atlanta, Georgia, USA, 2013.

[20] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," Proc. of the IEEE Conf. on Comp. Vision and Pattern Recog. (CVPR), pp. 770-778, DOI: 10.1109 CVPR.2016.90, 2016.

[21] X. Glorot, A. Bordes and Y. Bengio, "Deep Sparse Rectifier Neural Networks," Proc. of the 14th Int. Conf. on Artificial Intelligence and Statistics, vol. 15, pp. 315-323, Fort Lauderdale, FL, USA, 2011.

[22] Y. Lecun, L. Bottou, G. Orr and K.-R. Müller, "Efficient BackProp," Chapter in Book: Neural Networks: Tricks of the Trade, vol. 7700, pp. 9-48, DOI: 10.1007\/3-540-49430-8\_2, 1998.

[23] X. Glorot and Y. Bengio, "Understanding the Difficulty of Training Deep Feedforward Neural Networks," Journal of Machine Learning Research, vol. 9, pp. 249-256, 2010.

[24] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," arXiv: 1502.03167, DOI: 10.48550/arXiv.1502.03167, 2015.

[25] Y. Lecun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, vol. 1, pp. 541-551, DOI: 10.1162 neco.1989.1.4.541, 1989.

[26] H. Noh, T. You, J. Mun and B. Han, "Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization," Proc. of the 31st Conf. on Neural Inf. Process. Sys. (NIPS), Long Beach, USA, 2017.

[27] S. Enrique, J. Hare and M. Niranjan, "Deep Cascade Learning," IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 11, pp. 5475 – 5485, DOI: 10.1109/TNNLS.2018.2805098, 2018.

[28] C. Shannon and W. Weaver, The Mathematical Theory of Communication, Note 78, p. 44, 1963.

[29] J. Schmidhuber, "Learning Complex, Extended Sequences Using the Principle of History Compression," Neural Computation, vol. 4, pp. 234-242, DOI: 10.1162/neco.1992.4.2.234, 1992.

[30] O. Granmo et al., "The Convolutional Tsetlin Machine," arXiv: 1905.09688v5, DOI: 10.48550/arXiv.190 5.09688, 2019.