​​BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
tl;dr
with a huggingface – compatible weights
take original BERT, replace some of his layers with new (smaller) ones randomly during the distillation. the probability of replacing the module will increase over time, resulting in a small model at the end.
them approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning.
also, they outperform existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression
paper: https://arxiv.org/abs/2002.02925
github: https://github.com/JetRunner/BERT-of-Theseus
#nlp #compressing #knowledge #distillation #bert
tl;dr
[ONE loss] + [ONE hyperparameter] + [NO external data] = GREAT PERFORMANCE
with a huggingface – compatible weights
take original BERT, replace some of his layers with new (smaller) ones randomly during the distillation. the probability of replacing the module will increase over time, resulting in a small model at the end.
them approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning.
also, they outperform existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression
paper: https://arxiv.org/abs/2002.02925
github: https://github.com/JetRunner/BERT-of-Theseus
#nlp #compressing #knowledge #distillation #bert