Pular para o conteúdo principal

BERT, RoBERTa, DistilBERT, XLNet — which one to use? TDS

Fonte: https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8

is a bi-directional transformer for pre-training over a lot of unlabeled textual data to learn a language representation that can be used to fine-tune for specific machine learning tasks. 

is a large bidirectional transformer that uses improved training methodology, larger data and more computational power to achieve better than BERT prediction metrics on 20 language tasks.

. Introduced at Facebook, Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data and compute power.

To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs.

On the other hand, to reduce the computational (training, prediction) times of BERT or related models, a natural choice is to use a smaller network to approximate the performance. There are many approaches that can be used to do this, including pruning, distillation and quantization, however, all of these result in lower prediction metrics.

learns a distilled (approximate) version of BERT, retaining 97% performance but using only half the number of parameters (). Specifically, it does not has token-type embeddings, pooler and retains only half of the layers from Google’s BERT. DistilBERT uses a technique called distillation, which approximates the Google’s BERT, i.e. the large neural network by a smaller one

Advancing over BERT? BigBird, ConvBERT, DynaBERT…

Which model should you choose?

Fonte: https://towardsdatascience.com/advancing-over-bert-bigbird-convbert-dynabert-bca78a45629c

Initial improvements over BERT either increased data or compute power to outperform it. However, lately, models have made conceptual and architectural advancements over BERT, starting with StructBERT and ALBERT.

StructBERT: aka ALICE, incorporates langauge structures into the pre-training process. Specifically it introduces a new objective function called WSO (word structral objective). WSO is designed to predict the order of words in a sentence. ... It is one of the first methods to outperform BERT while using comparable data and compute power.
 
ALBERT: Is the first architectural improvement over BERT and uses 18 times smaller model to outperform BERT on 5 NLU tasks.... ALBERT performs low dimensional projection of the huge token embedding matrix, reducing upto 20 million additional parameters. With limited chances for overfitting, ALBERT also removes dropout yielding an improved memory consumption

So, which one to use ?

Yes, we now have models that are a prefered choice instead of BERT.

Long Text: BigBird wins, comparable fallback: Longformer

Smaller network/speed: DynaBERT, ConvBERT

Multilingual: if compute is not an issue XLM-Roberta, else mBERT

BERT Technology introduced in 3-minutes

Fonte:  https://towardsdatascience.com/bert-technology-introduced-in-3-minutes-2c2f9968268c
 
BERT works in two steps, First, it uses a large amount of unlabeled data to learn a language representation in an unsupervised fashion called pre-training. Then, the pre-trained model can be fine-tuned in a supervised fashion using a small amount of labeled trained data to perform various supervised tasks.

BERT’s state-of-the-art performance is based on two things. First, novel pre-training tasks called Masked Language Model(MLM) and Next Sentense Prediction (NSP). Second, a lot of data and compute power to train BERT.

The NSP task allows BERT to learn relationships between sentences by predicting if the next sentence in a pair is the true next or not. For this 50% correct pairs are supplemented with 50% random pairs and the model trained. BERT trains both MLM and NSP objectives simultaneously.

BERT outperforms 11 state-of-the-art NLP tasks with large margins.

 
 
BERT is available as open source: https://github.com/google-research/bert and pre-trained for 104 languages with implementations in TensorFlow and Pytorch.
 
 

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...