Department of Computer Science, Georgia State University, Atlanta, GA30302, USA.
Center for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen518055, China.
Show Author Information
Hide Author Information
Abstract
Improving performance of deep learning models and reducing their training times are ongoing challenges in deep neural networks. There are several approaches proposed to address these challenges, one of which is to increase the depth of the neural networks. Such deeper networks not only increase training times, but also suffer from vanishing gradients problem while training. In this work, we propose gradient amplification approach for training deep learning models to prevent vanishing gradients and also develop a training strategy to enable or disable gradient amplification method across several epochs with different learning rates. We perform experiments on VGG-19 and Resnet models (Resnet-18 and Resnet-34) , and study the impact of amplification parameters on these models in detail. Our proposed approach improves performance of these deep learning models even at higher learning rates, thereby allowing these models to achieve higher performance with reduced training time.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
K. M.He, X. Y.Zhang, S. Q.Ren, and J.Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
G.Hinton, L.Deng, D.Yu, G. E.Dahl, A. R.Mohamed, N.Jaitly, A.Senior, V.Vanhoucke, P.Nguyen, T. N.Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Proc. Mag., vol. 29, no. 6, pp. 82-97, 2012.
G.Litjens, T.Kooi, B. E.Bejnordi, A. A. A.Setio, F.Ciompi, M.Ghafoorian, J. A. W. M.Van Der Laak, B.Van Ginneken, and C. I.Sánchez, A survey on deep learning in medical image analysis, Med. Image Anal., vol. 42, pp. 60-88, 2017.
J. D.Wang, Y. Q.Chen, S. J.Hao, X. H.Peng, and L. S.Hu, Deep learning for sensor-based activity recognition: A survey, Pattern Recognit. Lett., vol. 119, pp. 3-11, 2019.
G.Huang, Y.Sun, Z.Liu, D.Sedra, and K. Q.Weinberger, Deep networks with stochastic depth, in Proc. 14th European Conf. on Computer Vision, Amsterdam, Netherlands, 2016, pp. 646-661.
S.Hochreiter, Y.Bengio, P.Frasconi, and J.Schmidhuber, Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies, In A Field Guide to Dynamical Recurrent Networks, J. F.Kolen and S. C.Kremer, eds.Wiley-IEEE Press, .
B.Hanin, Which neural net architectures give rise to exploding and vanishing gradients? in Proc. Advances in Neural Information Processing Systems 31, Montréal, Canada, 2018, pp. 582-591.
[14]
J.Schmidhuber, Learning complex, extended sequences using the principle of history compression, Neural Comput., vol. 4, no. 2, pp. 234-242, 1992.
V.Nair and G. E.Hinton, Rectified linear units improve restricted Boltzmann machines, in Proc. 27th Int. Conf. on Machine Learning, Haifa, Israel, 2010, pp. 807-814.
[16]
X.Glorot, A.Bordes, and Y.Bengio, Deep sparse rectifier neural networks, in Proc. 14th Int. Conf. on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2011, pp. 315-323.
[17]
S.Ioffe and C.Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv: 1502.03167, 2015.
[18]
Y.Yang and H.Wang, Multi-view clustering: A survey, Big Data Mining and Analytics, vol. 1, no. 2, pp. 83-107, 2018.
S.Kumar and M.Singh, A novel clustering technique for efficient clustering of big data in hadoop ecosystem, Big Data Mining and Analytics, vol. 2, no. 4, pp. 240-247, 2019.
C.Darken, J.Chang, and J.Moody, Learning rate schedules for faster stochastic gradient search, in Proc. Neural Networks for Signal Processing II Proc. of the 1992 IEEE Workshop, Helsingoer, Denmark, 1992, pp. 3-12.
[22]
J.Duchi, E.Hazan, and Y.Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., vol. 12, pp. 2121-2159, 2011.
T.Schaul, S. X.Zhang, and Y.LeCun, No more pesky learning rates, in Proc. 30th Int. Conf. on Machine Learning, Atlanta, GA, USA, 2013, pp. 343-351.
[28]
S. L.Smith, P. J.Kindermans, C.Ying, and Q. V.Le, Don’t decay the learning rate, increase the batch size, arXiv preprint arXiv: 1711.00489, 2017.
[29]
A.Paszke, S.Gross, S.Chintala, G.Chanan, E.Yang, Z.DeVito, Z. M.Lin, A.Desmaison, L.Antiga, and A.Lerer, Automatic differentiation in PyTorch, in Proc. 31st Conf. on Neural Information Processing Systems, Long Beach, CA, USA, 2017.
[30]
H.Liu, J.Li, Y. Q.Zhang, and Y.Pan, An adaptive genetic fuzzy multi-path routing protocol for wireless ad-hoc networks, in Proc. 6th Int. Conf. on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing and 1st ACIS Int. Workshop on Self-Assembling Wireless Network, Towson, MD, USA, 2005, pp. 468-475.
Basodi S, Ji C, Zhang H, et al. Gradient Amplification: An Efficient Way to Train Deep Neural Networks. Big Data Mining and Analytics, 2020, 3(3): 196-207. https://doi.org/10.26599/BDMA.2020.9020004
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
4.1 Setup
Our experiments are performed on CIFAR 10 dataset which consists of 60 000 colored images of 10 classes with 6000 images per class and each image has 32 32 resolution. We implement our algorithms using python and pytorch[
29
] libraries. In our experiments, we employ several standard deep learning models and train them for 150 epochs. The number of epochs, combination of number of epochs, and learning rates can be chosen as one thinks best. In this work, the first 100 epochs have learning rate of 0.1 and the next 50 epochs have the learning rate of 0.01 (as shown in
Fig. 1
). The first 50 epochs are trained with learning rate of 0.1 without gradient amplification. This is because for the first few epochs, the model is considered to be in transient phase and the network parameters undergo significant changes. This initial transient can be considered for any number of epochs and in this work, we set it to 50 epochs. The next 50 epochs have the same learning rate of 0.1 but has gradient amplification applied during backpropagation while training the model (as shown in
Fig. 2a
). After identifying the best params with gradient amplification for epochs 51-100, using those params for those epochs, we extend amplification for epochs 101-130 to identify the best params and with no amplification for epochs 131-150, as shown
Fig. 2b
. There are mainly three important parameters while applying gradient amplification method, namely, the type of the layers to be employed for amplification, the ratio of layers () to be chosen from selected layers to perform amplification, and gradient amplification factor. The effects of varying each of these parameters are explained in detail in the subsections below. We run our experiments on Resnet and VGG models with different architectures.
10.26599/BDMA.2020.9020004.F001
Experiment setting showing the number of epochs and learning rates corresponding to epochs for training all the models.
10.26599/BDMA.2020.9020004.F002
Two-step training process carried out during performance analysis of deep learning models. Experiments are firstly executed on the models with training steps shown in Step 1. For Step 2, ratio parameters for gradient amplification which have better performance of the models in Step 1 are considered as the parameters for epochs 51-100 and experiments are performed by analyzing ratio parameters for epochs 101-130, with no amplification from epochs 131-150. These settings show the number of epochs and the learning rates corresponding to these epochs while training these models.
Here we perform three phase analysis while evaluating our model.
Phase 1 In this phase, we choose the type of layers to be considered for amplification. There are several types of layers at which amplification can be applied such as activation function layers, pooling layers, batch normalization layers, and convolution layers. Convolution layers apply kernel functions and extract important features from the data and pooling layers perform accumulation of features over a grid using several strategies, such as retrieving maximum values, minimum values, averaging, fractional pooling, and so on. Since the network parameter tuning while training can be sensitive to these values, in this work, we do not perform amplification on these layers. Batch normalization layers normalize data over a batch of inputs, and activation function layers transform data non-linearly before forwarding it to the succeeding layers. In our work, we perform gradient amplification on batch normalization and activation function layers. ReLU is the activation function used in Resnet and VGG models. From these two types of layers, either one or both of them can be considered for amplification. Once the type of the layers is selected, we now tag all the layers of the selected type to belong to the group . We now move to the next phase to determine the final amplification layers amp.
Phase 2 Once the set of layers is determined, the next task is to find the subset of layers, which gives better performance. It requires identifying subset size and selection of those many layers from . Since the size is unknown, experiments are performed by selecting the size to be a ratio of size of . This ratio, , is chosen from the set . The actual size of amp is determined by the value . When the value is , no layers are chosen and gradient amplification is not performed. When the value is , then all the layers in are considered for amplification. is included to verify whether the model performs better without gradient amplification or vice versa. Random selection is employed to select amp subset of layers from . We perform experiments with all these sizes and select the model with the best performance.
Phase 3 In this phase, the layers amp on which gradient amplification can be applied is known. The only parameter left to explore is , the factor with which gradient needs to be amplified. To reduce computation complexity in testing all the combinations of parameter values amp, , and , firstly, experiments are performed on all combinations of amp and , i.e., until Phase 2, then the best models are chosen from Phase 2 and analyzed by varying . The value of is firstly varied from to analyze the impact of amplification and then fine-tuned by varying from to determine the value that works best during training.
4.1 Setup
Our experiments are performed on CIFAR 10 dataset which consists of 60 000 colored images of 10 classes with 6000 images per class and each image has 32 32 resolution. We implement our algorithms using python and pytorch[
29
] libraries. In our experiments, we employ several standard deep learning models and train them for 150 epochs. The number of epochs, combination of number of epochs, and learning rates can be chosen as one thinks best. In this work, the first 100 epochs have learning rate of 0.1 and the next 50 epochs have the learning rate of 0.01 (as shown in
Fig. 1
). The first 50 epochs are trained with learning rate of 0.1 without gradient amplification. This is because for the first few epochs, the model is considered to be in transient phase and the network parameters undergo significant changes. This initial transient can be considered for any number of epochs and in this work, we set it to 50 epochs. The next 50 epochs have the same learning rate of 0.1 but has gradient amplification applied during backpropagation while training the model (as shown in
Fig. 2a
). After identifying the best params with gradient amplification for epochs 51-100, using those params for those epochs, we extend amplification for epochs 101-130 to identify the best params and with no amplification for epochs 131-150, as shown
Fig. 2b
. There are mainly three important parameters while applying gradient amplification method, namely, the type of the layers to be employed for amplification, the ratio of layers () to be chosen from selected layers to perform amplification, and gradient amplification factor. The effects of varying each of these parameters are explained in detail in the subsections below. We run our experiments on Resnet and VGG models with different architectures.
10.26599/BDMA.2020.9020004.F001
Experiment setting showing the number of epochs and learning rates corresponding to epochs for training all the models.
10.26599/BDMA.2020.9020004.F002
Two-step training process carried out during performance analysis of deep learning models. Experiments are firstly executed on the models with training steps shown in Step 1. For Step 2, ratio parameters for gradient amplification which have better performance of the models in Step 1 are considered as the parameters for epochs 51-100 and experiments are performed by analyzing ratio parameters for epochs 101-130, with no amplification from epochs 131-150. These settings show the number of epochs and the learning rates corresponding to these epochs while training these models.
Here we perform three phase analysis while evaluating our model.
Phase 1 In this phase, we choose the type of layers to be considered for amplification. There are several types of layers at which amplification can be applied such as activation function layers, pooling layers, batch normalization layers, and convolution layers. Convolution layers apply kernel functions and extract important features from the data and pooling layers perform accumulation of features over a grid using several strategies, such as retrieving maximum values, minimum values, averaging, fractional pooling, and so on. Since the network parameter tuning while training can be sensitive to these values, in this work, we do not perform amplification on these layers. Batch normalization layers normalize data over a batch of inputs, and activation function layers transform data non-linearly before forwarding it to the succeeding layers. In our work, we perform gradient amplification on batch normalization and activation function layers. ReLU is the activation function used in Resnet and VGG models. From these two types of layers, either one or both of them can be considered for amplification. Once the type of the layers is selected, we now tag all the layers of the selected type to belong to the group . We now move to the next phase to determine the final amplification layers amp.
Phase 2 Once the set of layers is determined, the next task is to find the subset of layers, which gives better performance. It requires identifying subset size and selection of those many layers from . Since the size is unknown, experiments are performed by selecting the size to be a ratio of size of . This ratio, , is chosen from the set . The actual size of amp is determined by the value . When the value is , no layers are chosen and gradient amplification is not performed. When the value is , then all the layers in are considered for amplification. is included to verify whether the model performs better without gradient amplification or vice versa. Random selection is employed to select amp subset of layers from . We perform experiments with all these sizes and select the model with the best performance.
Phase 3 In this phase, the layers amp on which gradient amplification can be applied is known. The only parameter left to explore is , the factor with which gradient needs to be amplified. To reduce computation complexity in testing all the combinations of parameter values amp, , and , firstly, experiments are performed on all combinations of amp and , i.e., until Phase 2, then the best models are chosen from Phase 2 and analyzed by varying . The value of is firstly varied from to analyze the impact of amplification and then fine-tuned by varying from to determine the value that works best during training.
10.26599/BDMA.2020.9020004.F001
Experiment setting showing the number of epochs and learning rates corresponding to epochs for training all the models.
10.26599/BDMA.2020.9020004.F002
Two-step training process carried out during performance analysis of deep learning models. Experiments are firstly executed on the models with training steps shown in Step 1. For Step 2, ratio parameters for gradient amplification which have better performance of the models in Step 1 are considered as the parameters for epochs 51-100 and experiments are performed by analyzing ratio parameters for epochs 101-130, with no amplification from epochs 131-150. These settings show the number of epochs and the learning rates corresponding to these epochs while training these models.
10.26599/BDMA.2020.9020004.F003
Overview of all the experiments performed by varying different parameters of gradient amplification.
10.26599/BDMA.2020.9020004.F004
Performance of the models after training with Step 2 strategy with gradient amplification (red) applied from epochs 51-100 compared to mean accuracies of the original models (blue) with no gradient amplification. In each plot, blue horizontal line shows the average testing accuracy of the original models without gradient amplification. Amp testing refers to testing accuracies of models with gradient amplification. The type of the layer is shown in each subplot; horizontal and vertical axes correspond to the ratio of amplified layers and accuracies, respectively. These experiment plots correspond to params xx, where the ratio (0.3) of layers are amplified for epochs 51-100. The other params xx, xx, and xx also have similar performance patterns.
10.26599/BDMA.2020.9020004.F005
Performance comparison of amplified models (red) as is varied from 1 to 10 (horizontal axis) vs. original models (blue).
10.26599/BDMA.2020.9020004.F006
Performance comparison of amplified models (red) as is varied in small steps from 1 to 3 (horizontal axis) vs. original models (blue).
10.26599/BDMA.2020.9020004.F007
Performance of the best models with gradient amplification over 150 epochs compared to original model with no gradient amplification. Original training (gray) and testing (blue) accuracies including their mean accuracies are plotted along with amplified training (green) and testing (red) accuracies. These plots demonstrate that the models do not overfit while training with amplification.
10.26599/BDMA.2020.9020004.T001Accuracy comparison of models with gradient amplification vs. mean accuracies of corresponding original model across 5 runs.