BERT, RoBERTa, DistilBERT, XLNet — which one to use? TDS

Fonte: https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8

BERT is a bi-directional transformer for pre-training over a lot of unlabeled textual data to learn a language representation that can be used to fine-tune for specific machine learning tasks.

XLNet is a large bidirectional transformer that uses improved training methodology, larger data and more computational power to achieve better than BERT prediction metrics on 20 language tasks.

RoBERTa. Introduced at Facebook, Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data and compute power.

To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs.

On the other hand, to reduce the computational (training, prediction) times of BERT or related models, a natural choice is to use a smaller network to approximate the performance. There are many approaches that can be used to do this, including pruning, distillation and quantization, however, all of these result in lower prediction metrics.

DistilBERT learns a distilled (approximate) version of BERT, retaining 97% performance but using only half the number of parameters (paper). Specifically, it does not has token-type embeddings, pooler and retains only half of the layers from Google’s BERT. DistilBERT uses a technique called distillation, which approximates the Google’s BERT, i.e. the large neural network by a smaller one

Advancing over BERT? BigBird, ConvBERT, DynaBERT…

Which model should you choose?

Fonte: https://towardsdatascience.com/advancing-over-bert-bigbird-convbert-dynabert-bca78a45629c

Initial improvements over BERT either increased data or compute power to outperform it. However, lately, models have made conceptual and architectural advancements over BERT, starting with StructBERT and ALBERT.

StructBERT: aka ALICE, incorporates langauge structures into the pre-training process. Specifically it introduces a new objective function called WSO (word structral objective). WSO is designed to predict the order of words in a sentence. ... It is one of the first methods to outperform BERT while using comparable data and compute power.

ALBERT: Is the first architectural improvement over BERT and uses 18 times smaller model to outperform BERT on 5 NLU tasks.... ALBERT performs low dimensional projection of the huge token embedding matrix, reducing upto 20 million additional parameters. With limited chances for overfitting, ALBERT also removes dropout yielding an improved memory consumption

So, which one to use ?

Yes, we now have models that are a prefered choice instead of BERT.

Long Text: BigBird wins, comparable fallback: Longformer
Smaller network/speed: DynaBERT, ConvBERT
Multilingual: if compute is not an issue XLM-Roberta, else mBERT

BERT Technology introduced in 3-minutes

Fonte: https://towardsdatascience.com/bert-technology-introduced-in-3-minutes-2c2f9968268c

BERT works in two steps, First, it uses a large amount of unlabeled data to learn a language representation in an unsupervised fashion called pre-training. Then, the pre-trained model can be fine-tuned in a supervised fashion using a small amount of labeled trained data to perform various supervised tasks.

BERT’s state-of-the-art performance is based on two things. First, novel pre-training tasks called Masked Language Model(MLM) and Next Sentense Prediction (NSP). Second, a lot of data and compute power to train BERT.

The NSP task allows BERT to learn relationships between sentences by predicting if the next sentence in a pair is the true next or not. For this 50% correct pairs are supplemented with 50% random pairs and the model trained. BERT trains both MLM and NSP objectives simultaneously.

BERT outperforms 11 state-of-the-art NLP tasks with large margins.

BERT is available as open source: https://github.com/google-research/bert and pre-trained for 104 languages with implementations in TensorFlow and Pytorch.

Pesquisa de Doutorado da Veronica

Pesquisar este blog

BERT, RoBERTa, DistilBERT, XLNet — which one to use? TDS

Advancing over BERT? BigBird, ConvBERT, DynaBERT…

Which model should you choose?

So, which one to use ?

BERT Technology introduced in 3-minutes

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo