Neural Network Pruning with Gradient-Based Importance Estimation

Paweł Kubik

supervisor: Paweł Wawrzyński



Pruning is a neural network compression technique that removes weights of a network to reduce its size and computation cost. In a typical compression scheme one: 1) trains the original model; 2) prunes the model to meet performance constraints; 3) fine-tunes the pruned model to recover some of the lost prediction quality. Pruning methods try to limit the loss of prediction quality by removing the least important parts of the network. In order to assess this importance, we associate parts of a network with binary gate variables that control their inclusion in the final network and reuse the original model's training procedure to find optimal values for the gates. We enable flow of the gradients by relaxing the binary gate variables with Gumbel-Softmax distribution. We sample a new set of gate values on each training step, and back-propagate gradient to the distribution parameters with the reparametrization trick. We propose an additional loss function that ensure that the pruned network meet performance constraint, by comparing the expected number of floating-point operations with a value specified as a hyperparameter. The constraint is global for the whole network, which means that apart from selecting the most important weights within each layer, we also search for an efficient model structure, i.e. size of each layer of the network. To further improve the final model, we apply knowledge distillation in the fine-tuning phase with the original network used as the teacher.