Paweł Kubik
supervisor: Paweł Wawrzyński
Knowledge distillation improves performance of a small student neural network with a signal from a larger teacher network. This signal takes the form of a loss function that either supplements or completely replaces a task-specific loss function. While intuition could suggest that selecting a teacher with the highest performance leads to best results, various studies have shown contradictory results. Large discrepancy between the models seems to have a negative impact on the knowledge distillation. To mitigate this, we explore the effect of gradually replacing smaller intermediate teachers with a larger teacher throughout the training. We achieve smooth transitions by mixing the outputs of the teachers with a weighted average. We then replace hand-crafted schedules with an automatic teacher selection mechanism based on gradient descent. We derive a selection weight of each teacher from a trainable parameter. Both the student and the teacher selection mechanism are trained to minimize Kullback–Leibler divergence between the student and the mixed teachers outputs. Surprisingly, the selection mechanism consecutively switches from the smallest to the largest teacher, even though the objective function does not impose any direct incentives for such a behavior.