Fonte: https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
BERT is a bi-directional transformer for pre-training over a lot of unlabeled textual data to learn a language representation that can be used to fine-tune for specific machine learning tasks.
XLNet is a large bidirectional transformer that uses improved training methodology, larger data and more computational power to achieve better than BERT prediction metrics on 20 language tasks.
RoBERTa. Introduced at Facebook, Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data and compute power.
To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs.
On the other hand, to reduce the computational (training, prediction) times of BERT or related models, a natural choice is to use a smaller network to approximate the performance. There are many approaches that can be used to do this, including pruning, distillation and quantization, however, all of these result in lower prediction metrics.
DistilBERT learns a distilled (approximate) version of BERT, retaining 97% performance but using only half the number of parameters (paper). Specifically, it does not has token-type embeddings, pooler and retains only half of the layers from Google’s BERT. DistilBERT uses a technique called distillation, which approximates the Google’s BERT, i.e. the large neural network by a smaller one
Advancing over BERT? BigBird, ConvBERT, DynaBERT…
Which model should you choose?
Initial improvements over BERT either increased data or compute power to outperform it. However, lately, models have made conceptual and architectural advancements over BERT, starting with StructBERT and ALBERT.
So, which one to use ?
Yes, we now have models that are a prefered choice instead of BERT.
Long Text: BigBird wins, comparable fallback: Longformer
Smaller network/speed: DynaBERT, ConvBERT
Multilingual: if compute is not an issue XLM-Roberta, else mBERT
BERT Technology introduced in 3-minutes
BERT’s state-of-the-art performance is based on two things. First, novel pre-training tasks called Masked Language Model(MLM) and Next Sentense Prediction (NSP). Second, a lot of data and compute power to train BERT.
The NSP task allows BERT to learn relationships between sentences by predicting if the next sentence in a pair is the true next or not. For this 50% correct pairs are supplemented with 50% random pairs and the model trained. BERT trains both MLM and NSP objectives simultaneously.
BERT outperforms 11 state-of-the-art NLP tasks with large margins.

Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.