MobileBERT: BERT for Resource-Limited Devices

guest_blog 28 Jul, 2020 • 11 min read



The MobileBERT architectures

MobileBERT : Architecture

Architecture visualization of transformer blocks within (a) BERT, (b) MobileBERT teacher, and (c) MobileBERT student. The green trapezoids marked with “Linear” are referred to as bottlenecks. Source



Multi-Head Attention

MobileBERT : Multi-head Attention

Stacked FFN

Operational optimizations

Image for post

NoNorm equation to replace batch normalization operation in the transformer blocks. The “dot” denotes Hadamara product — element-wise multiplication between the two vectors

The motivation of teacher and student size

Proposed knowledge distillation objectives

Image for post

Feature map transfer objective function. T is the sequence length, N the feature map size, and l the layer index.

Image for post

Attention map transfer objective function. T is the sequence length, A the number of attention heads, and l the layer index.

Image for post

Knowledge transfer techniques. (a) Auxiliary knowledge transfer, (b) joint knowledge transfer, (c) progressive knowledge transfer. Source

Experimental results

MobileBERT : Experimental results on the GLUE benchmark

Experimental results on the GLUE benchmark. Source

It’s, therefore, safe to conclude that it’s possible to create a distilled model which both can be performant and fast on resource-limited devices!

It’s been fine-tuned by itself on GLUE which proves that it’s possible to create a task agnostic model through the proposed distillation process!


If you found this summary helpful in understanding the broader picture of this particular research paper, please consider reading my other articles! I’ve already written a bunch and more will definitely be added. I think you might find this one interesting👋🏼🤖

About the Author

Author Viktor Karlsson – Software Engineer

I am a Software Engineer and MSc of Machine Learning with a growing interest in NLP. Trying to stay on top of recent developments within the ML field in general, and NLP in particular. Writing to learn!

guest_blog 28 Jul 2020

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Natural Language Processing
Become a full stack data scientist

  • [tta_listen_btn class="listen"]