Gradient Amplification: An Efficient Way to Train Deep Neural Networks

Sunitha Basodi; Chunyan Ji; Haiping Zhang; Yi Pan

doi:10.26599/BDMA.2020.9020004

| Sign up

PDF (6.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (7)

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Tables (1)

Table 1

Open Access

Gradient Amplification: An Efficient Way to Train Deep Neural Networks

Sunitha Basodi, Chunyan Ji, Haiping Zhang, Yi Pan()

∙ Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.

∙ Center for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.

Show Author Information

Abstract

Improving performance of deep learning models and reducing their training times are ongoing challenges in deep neural networks. There are several approaches proposed to address these challenges, one of which is to increase the depth of the neural networks. Such deeper networks not only increase training times, but also suffer from vanishing gradients problem while training. In this work, we propose gradient amplification approach for training deep learning models to prevent vanishing gradients and also develop a training strategy to enable or disable gradient amplification method across several epochs with different learning rates. We perform experiments on VGG-19 and Resnet models (Resnet-18 and Resnet-34) , and study the impact of amplification parameters on these models in detail. Our proposed approach improves performance of these deep learning models even at higher learning rates, thereby allowing these models to achieve higher performance with reduced training time.

Keywords

deep learning gradient amplification learning rate backpropagation vanishing gradients

References

[1]

K. M.

, X. Y.

Zhang

, S. Q.

Ren

, and J.

Sun

, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.

Crossref

[2]

Hinton

, L.

Deng

, D.

, G. E.

Dahl

, A. R.

Mohamed

, N.

Jaitly

, A.

Senior

, V.

Vanhoucke

, P.

Nguyen

, T. N.

Sainath

, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Proc. Mag., vol. 29, no. 6, pp. 82-97, 2012.

Crossref Google Scholar

[3]

Hochreiter

and J.

Schmidhuber

, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.

Crossref Google Scholar

[4]

Kamilaris

and F. X.

Prenafeta-Boldú

, Deep learning in agriculture: A survey, Comput. Electron. Agric., vol. 147, pp. 70-90, 2018.

Crossref Google Scholar

[5]

Litjens

, T.

Kooi

, B. E.

Bejnordi

, A. A. A.

Setio

, F.

Ciompi

, M.

Ghafoorian

, J. A. W. M.

Van Der Laak

, B.

Van Ginneken

, and C. I.

Sánchez

, A survey on deep learning in medical image analysis, Med. Image Anal., vol. 42, pp. 60-88, 2017.

Crossref Google Scholar

[6]

Q. C.

Zhang

, L. T.

Yang

, Z. K.

Chen

, and P.

, A survey on deep learning for big data, Inf. Fusion, vol. 42, pp. 146-157, 2018.

Crossref Google Scholar

[7]

Zhang

, S.

Wang

, and B.

Liu

, Deep learning for sentiment analysis: A survey, Wiley Interdiscip. Rev., vol. 8, no. 4, p. e1253, 2018.

Crossref Google Scholar

[8]

J. D.

Wang

, Y. Q.

Chen

, S. J.

Hao

, X. H.

Peng

, and L. S.

, Deep learning for sensor-based activity recognition: A survey, Pattern Recognit. Lett., vol. 119, pp. 3-11, 2019.

Crossref Google Scholar

[9]

Huang

, Y.

Sun

, Z.

Liu

, D.

Sedra

, and K. Q.

Weinberger

, Deep networks with stochastic depth, in Proc. 14th European Conf. on Computer Vision, Amsterdam, Netherlands, 2016, pp. 646-661.

Crossref

[10]

Brownlee

, Understand the impact of learning rate on neural network performance, https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks, 2019.

[11]

Hochreiter

, Y.

Bengio

, P.

Frasconi

, and J.

Schmidhuber

, Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies, In A Field Guide to Dynamical Recurrent Networks, J. F.

Kolen

and S. C.

Kremer

, eds. Wiley-IEEE Press, .

Crossref

[12]

G. B.

Goh

, N. O.

Hodas

, and A.

Vishnu

, Deep learning for computational chemistry, J. Comput. Chem., vol. 38, no. 16, pp. 1291-1307, 2017.

Crossref Google Scholar

[13]

Hanin

, Which neural net architectures give rise to exploding and vanishing gradients? in Proc. Advances in Neural Information Processing Systems 31, Montréal, Canada, 2018, pp. 582-591.

[14]

Schmidhuber

, Learning complex, extended sequences using the principle of history compression, Neural Comput., vol. 4, no. 2, pp. 234-242, 1992.

Crossref Google Scholar

[15]

Nair

and G. E.

Hinton

, Rectified linear units improve restricted Boltzmann machines, in Proc. 27th Int. Conf. on Machine Learning, Haifa, Israel, 2010, pp. 807-814.

[16]

Glorot

, A.

Bordes

, and Y.

Bengio

, Deep sparse rectifier neural networks, in Proc. 14th Int. Conf. on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2011, pp. 315-323.

[17]

Ioffe

and C.

Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv: 1502.03167, 2015.

[18]

Yang

and H.

Wang

, Multi-view clustering: A survey, Big Data Mining and Analytics, vol. 1, no. 2, pp. 83-107, 2018.

Crossref Google Scholar

[19]

Kumar

and M.

Singh

, A novel clustering technique for efficient clustering of big data in hadoop ecosystem, Big Data Mining and Analytics, vol. 2, no. 4, pp. 240-247, 2019.

Crossref Google Scholar

[20]

Schmidhuber

, Deep learning in neural networks: An overview, Neural Netw., vol. 61, pp. 85-117, 2015.

Crossref Google Scholar

[21]

Darken

, J.

Chang

, and J.

Moody

, Learning rate schedules for faster stochastic gradient search, in Proc. Neural Networks for Signal Processing II Proc. of the 1992 IEEE Workshop, Helsingoer, Denmark, 1992, pp. 3-12.

[22]

Duchi

, E.

Hazan

, and Y.

Singer

, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., vol. 12, pp. 2121-2159, 2011.

Google Scholar

[23]

M. D.

Zeiler

, Adadelta: An adaptive learning rate method, arXiv preprint arXiv: 1212.5701, 2012.

[24]

Graves

, Generating sequences with recurrent neural networks, arXiv preprint arXiv: 1308.0850, 2013.

Crossref

[25]

D. P.

Kingma

and J.

, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.

[26]

Lau

, Learning rate schedules and adaptive learning rate methods for deep learning, https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1, 2019.

[27]

Schaul

, S. X.

Zhang

, and Y.

LeCun

, No more pesky learning rates, in Proc. 30th Int. Conf. on Machine Learning, Atlanta, GA, USA, 2013, pp. 343-351.

[28]

S. L.

Smith

, P. J.

Kindermans

, C.

Ying

, and Q. V.

, Don’t decay the learning rate, increase the batch size, arXiv preprint arXiv: 1711.00489, 2017.

[29]

Paszke

, S.

Gross

, S.

Chintala

, G.

Chanan

, E.

Yang

, Z.

DeVito

, Z. M.

Lin

, A.

Desmaison

, L.

Antiga

, and A.

Lerer

, Automatic differentiation in PyTorch, in Proc. 31st Conf. on Neural Information Processing Systems, Long Beach, CA, USA, 2017.

[30]

Liu

, J.

, Y. Q.

Zhang

, and Y.

Pan

, An adaptive genetic fuzzy multi-path routing protocol for wireless ad-hoc networks, in Proc. 6th Int. Conf. on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing and 1st ACIS Int. Workshop on Self-Assembling Wireless Network, Towson, MD, USA, 2005, pp. 468-475.

Big Data Mining and Analytics

Volume 3 Issue 3,
September 2020

Pages 196-207

DOI: 10.26599/BDMA.2020.9020004

Cite this article:

Basodi S, Ji C, Zhang H, et al. Gradient Amplification: An Efficient Way to Train Deep Neural Networks. Big Data Mining and Analytics, 2020, 3(3): 196-207. https://doi.org/10.26599/BDMA.2020.9020004

4.1 Setup

Our experiments are performed on CIFAR 10 dataset which consists of 60 000 colored images of 10 classes with 6000 images per class and each image has 32 $\times$ 32 resolution. We implement our algorithms using python and pytorch^{[

29
]} libraries. In our experiments, we employ several standard deep learning models and train them for 150 epochs. The number of epochs, combination of number of epochs, and learning rates can be chosen as one thinks best. In this work, the first 100 epochs have learning rate of 0.1 and the next 50 epochs have the learning rate of 0.01 (as shown in Fig. 1 ). The first 50 epochs are trained with learning rate of 0.1 without gradient amplification. This is because for the first few epochs, the model is considered to be in transient phase and the network parameters undergo significant changes. This initial transient can be considered for any number of epochs and in this work, we set it to 50 epochs. The next 50 epochs have the same learning rate of 0.1 but has gradient amplification applied during backpropagation while training the model (as shown in Fig. 2a ). After identifying the best params with gradient amplification for epochs 51-100, using those params for those epochs, we extend amplification for epochs 101-130 to identify the best params and with no amplification for epochs 131-150, as shown Fig. 2b . There are mainly three important parameters while applying gradient amplification method, namely, the type of the layers to be employed for amplification, the ratio of layers ( $β$ ) to be chosen from selected layers to perform amplification, and gradient amplification factor. The effects of varying each of these parameters are explained in detail in the subsections below. We run our experiments on Resnet and VGG models with different architectures.

10.26599/BDMA.2020.9020004.F001 Figure 1Experiment setting showing the number of epochs and learning rates corresponding to epochs for training all the models.

10.26599/BDMA.2020.9020004.F002 Figure 2Two-step training process carried out during performance analysis of deep learning models. Experiments are firstly executed on the models with training steps shown in Step 1. For Step 2, ratio parameters for gradient amplification which have better performance of the models in Step 1 are considered as the parameters for epochs 51-100 and experiments are performed by analyzing ratio parameters for epochs 101-130, with no amplification from epochs 131-150. These settings show the number of epochs and the learning rates corresponding to these epochs while training these models.

Here we perform three phase analysis while evaluating our model.

Phase 1 In this phase, we choose the type of layers to be considered for amplification. There are several types of layers at which amplification can be applied such as activation function layers, pooling layers, batch normalization layers, and convolution layers. Convolution layers apply kernel functions and extract important features from the data and pooling layers perform accumulation of features over a grid using several strategies, such as retrieving maximum values, minimum values, averaging, fractional pooling, and so on. Since the network parameter tuning while training can be sensitive to these values, in this work, we do not perform amplification on these layers. Batch normalization layers normalize data over a batch of inputs, and activation function layers transform data non-linearly before forwarding it to the succeeding layers. In our work, we perform gradient amplification on batch normalization and activation function layers. ReLU is the activation function used in Resnet and VGG models. From these two types of layers, either one or both of them can be considered for amplification. Once the type of the layers is selected, we now tag all the layers of the selected type to belong to the group $G$ . We now move to the next phase to determine the final amplification layers amp.

Phase 2 Once the set of layers $G$ is determined, the next task is to find the subset of layers, which gives better performance. It requires identifying subset size and selection of those many layers from $G$ . Since the size is unknown, experiments are performed by selecting the size to be a ratio of size of $G$ . This ratio, $β$ , is chosen from the set $β \in {0, 0.1, 0.2, \dots, 0.9, 1}$ . The actual size of amp is determined by the value $β \times size_of (G)$ . When the value is $0$ , no layers are chosen and gradient amplification is not performed. When the value is $1$ , then all the layers in $G$ are considered for amplification. $0$ is included to verify whether the model performs better without gradient amplification or vice versa. Random selection is employed to select amp subset of layers from $G$ . We perform experiments with all these sizes and select the model with the best performance.

Phase 3 In this phase, the layers amp on which gradient amplification can be applied is known. The only parameter left to explore is $Γ$ , the factor with which gradient needs to be amplified. To reduce computation complexity in testing all the combinations of parameter values amp, $β$ , and $Γ$ , firstly, experiments are performed on all combinations of amp and $β$ , i.e., until Phase 2, then the best models are chosen from Phase 2 and analyzed by varying $Γ$ . The value of $Γ$ is firstly varied from ${1, 2, 3, \dots, 10}$ to analyze the impact of amplification and then fine-tuned by varying from ${1.1, 1.2, \dots, 2.9, 3.0}$ to determine the value that works best during training.

4.1 Setup

10.26599/BDMA.2020.9020004.F001 Figure 1Experiment setting showing the number of epochs and learning rates corresponding to epochs for training all the models.

Here we perform three phase analysis while evaluating our model.

10.26599/BDMA.2020.9020004.F001 Figure 1Experiment setting showing the number of epochs and learning rates corresponding to epochs for training all the models.

10.26599/BDMA.2020.9020004.F003 Figure 3Overview of all the experiments performed by varying different parameters of gradient amplification.

10.26599/BDMA.2020.9020004.F004 Figure 4Performance of the models after training with Step 2 strategy with gradient amplification (red) applied from epochs 51-100 compared to mean accuracies of the original models (blue) with no gradient amplification. In each plot, blue horizontal line shows the average testing accuracy of the original models without gradient amplification. Amp testing refers to testing accuracies of models with gradient amplification. The type of the layer is shown in each subplot; horizontal and vertical axes correspond to the ratio of amplified layers and accuracies, respectively. These experiment plots correspond to params $S 𝟐_0.3_$ xx, where the ratio (0.3) of layers are amplified for epochs 51-100. The other params $S 𝟐_0.1_$ xx, $S 𝟐_0.5_$ xx, and $S 𝟐_0.6_$ xx also have similar performance patterns.

10.26599/BDMA.2020.9020004.F005 Figure 5Performance comparison of amplified models (red) as $𝚪$ is varied from 1 to 10 (horizontal axis) vs. original models (blue).

10.26599/BDMA.2020.9020004.F006 Figure 6Performance comparison of amplified models (red) as $𝚪$ is varied in small steps from 1 to 3 (horizontal axis) vs. original models (blue).

10.26599/BDMA.2020.9020004.F007 Figure 7Performance of the best models with gradient amplification over 150 epochs compared to original model with no gradient amplification. Original training (gray) and testing (blue) accuracies including their mean accuracies are plotted along with amplified training (green) and testing (red) accuracies. These plots demonstrate that the models do not overfit while training with amplification.

手机微信扫描二维码，点击右上角···按钮
分享到微信朋友圈

Model	params	Mean/best accuracy (%)		Improved accuracy (%)
Model	params	Training	Testing	Training	Testing
VGG-19	Original	97.87	91.08	-	-
VGG-19	Ours (VGG-19 with amplification)	99.764	93.35	1.9	2.27
Resnet-18	Original	98.371	92.488	-	-
Resnet-18	Ours (Resnet-18 with amplification)	99.878	94.57	1.51	2.08
Resnet-34	Original	98.444	92.716	-	-
Resnet-34	Ours (Resnet-34 with amplification)	99.774	94.39	1.25	1.67