Pular para o conteúdo principal

Deep Learning for Matching in Search and Recommendation - Cap 3 Deep Learning for Matching - Leitura de Artigo

The main reason for the success is due to deep learning’s strong ability in learning of representations for inputs (i.e., queries, documents, users, and items) and learning of nonlinear functions for matching.

Overview of Deep Learning

Deep Neural Networks

The Feed-forward Neural Networks (FFN), also called Multilayer Perceptron (MLP), are neural networks consisting of multiple layers of units, which are connected layer by layer without a loop.

Besides sigmoid function, other functions such as tanh and Rectified Linear Units (ReLU) are also utilized.

In learning, training data of input-output pairs are fed into the network as ground-truth. A loss is calculated for each instance by contrasting the ground truth and the prediction by the network, and the training is performed by adjusting the parameters so that the total loss is minimized. The well-known back-propagation algorithm is employed to conduct the minimization.

Convolutional Neural Networks (CNN) are neural networks that make use of convolution operations in at least one of the layers. They are specialized neural networks for processing data that has a grid-like structure, e.g., time serious data (1-D grid of time intervals) and image data (2-D grid of pixels). ... a typical convolutional network consists of multiple stacked layers: convolutional layer, detector layer, and pooling layer. In the convolutional layer, convolution functions are applied in parallel to produce a set of linear activations. In the detector layer, the set of linear activations are run through a nonlinear activation function. In the pooling layer, pooling functions are used to further modify the set of outputs.

The convolutional layer, on the other hand, uses convolutional kernel vectors (or matrices) to model the local features of each position (unit), where the weights of kernels are shared across positions (units). Thus it has much sparse connections between layers.

Recurrent Neural Networks (RNN) are neural networks for processing sequence data x(1), . . . , x(T). Unlike FFNs which can only handle one instance at a time, RNN can handle a long sequence of instances with a variable length.

Attention is a useful tool in deep learning. It is originally proposed to dynamically and selectively collect information from the source sentence in an encoder-decoder model in neural machine translation

Attention based Model: Figure 3.5 shows an encoder-decoder model with the additive attention mechanism.

Transformer: Transformer (Vaswani et al., 2017) is another attention based neural network under the encoder and decoder framework. Different from the aforementioned model which sequentially reads the input sequence (left-to-right or right-to-left), Transformer reads the entire input sequence at once. The characteristic enables it to learn the model by considering both the left and the right context of a word.

Each decoder component or layer in the decoder consists of a self attention sub-layer, an encoder-decoder attention sub-layer, and a feedforward network (FFN) sub-layer. The sub-layers have the same architecture as that of the encoder component.

Autoencoders are neural networks that aim to learn the hidden information of the input, by compressing the input into a latent-space representation and then reconstructing the output from the representation.
In the model, high-dimensional data is first converted into a low-dimensional latent representation by a multilayer encoder neural network. Then, the data is reconstructed from the latent representation by a multilayer decoder neural network.

Word embedding is a basic way of representing words in Natural Language Processing (NLP) and Information Retrieval (IR). Embeddings of words are usually created based on the assumption that the meaning of a word can be determined by its contexts in documents.

The classical word embedding models (e.g., Word2Vec and GloVe) have a fundamental shortcoming: they generate and utilize the same embeddings of the same words in different contexts. Therefore, they cannot effectively deal with the context-dependent nature of words. Contextualized word embeddings aim at capturing lexical semantics in different contexts.

Among the models, BERT is the most widely used. BERT is a mask language model (a denoising auto-encoder) that aims to reconstruct the original sentences from the corrupted ones. That is, in the pre-train phase, the input sentence is corrupted by replacing some original words with “[MASK]”. The learning objective, therefore, is to predict the masked words to get the original sentence.

The learning of BERT consists of two stages: pre-training and finetuning. In pre-training, sentence pairs collected from a large corpus are used as training data. The model parameters are determined using two training strategies: mask language modeling and next sentence prediction. (1) In mask language modeling, 15% of the randomly chosen words in the two sentences are replaced with token “[MASK]” before feeding them into the model. The training goal, then, is to predict the original masked words, based on the contexts provided by the non-masked words in the sentences. (2) In next sentence prediction, the model receives pairs of sentences as input. The training goal is to predict if the second sentence in the pair is the subsequent sentence in the original document.

Deep learning for matching, referred to as deep matching, has become the state-of-the-art technologies in search and recommendation (Mitra and Craswell, 2019). Compared with the traditional machine learning approaches, the deep learning approaches improve the matching accuracy in three ways: 

(1) using deep neural networks to construct richer representations for matching of objects (i.e., query, document, user, and item), 

(2) using deep learning algorithms to construct more powerful functions for matching, and 

(3) learning the representations and matching functions jointly in an end-to-end fashion. 

Another advantage of deep matching approaches is their flexibility of extending to multi-modal matching where the common semantic space can be learned to universally represent data of different modalities.

The matching framework takes two matching objects as its input and outputs a numerical value to represent the matching degree.


The input layer receives the two matching objects which can be word embeddings, ID vectors, or feature vectors.

The representation layer converts the input vectors into the distributed representations. Neural networks such as MLP, CNN, and RNN can be used here, depending on the type and nature of the input. 

The interaction layer compares the matching objects (i.e., two distributed representations) and outputs a number of (local or global) matching signals. Matrix and tensor can be used for storing the signals and their locations.

The aggregation layer aggregates the individual matching signals into a high-level matching vector. Operations in deep neural networks such as pooling and concatenation are usually adopted in this layer.

The output layer takes the high-level matching vector and outputs a matching score. Linear model, MLP, Neural Tensor Networks (NTN), or other neural networks can be utilized.

=======================================================

Apresentação -> https://youtu.be/DN7OE_mlldg

FFN = MLP

Multiplas camadas sem loop, FF pq a informação flui em uma direção sem retorno, não existe conexão entre neurônios da mesma camada. 

Exemplo de MLP para um Classificador de imagem (pode ter n camadas)

Gabarito: pares de entrada e saída

CNN = operação de convolução em pelo menos uma das camadas 

camadas completamente conectadas se tornam localmente conectadas para diminuir o custo de processamento da entrada

RNN = redes neurais que são específicas para sequências de entradas com streaming

Permite loop para passar informações anteriores 

Modelo de Atenção: Encoder > Camada de Atenção > Decoder

Essa camada remove os ruídos da entrada

Transformers para tradução

Autoencoders

Word Embeddings: representação de palavras para IP e NLP

Word2Vec: CBOW & Skip Gram ... a palavra é representada da mesma forma independente do contexto

BERT: as palavras podem ter representações diferentes se aparecerem em contextos diferentes. DUas fases: pré-treinamentos e fine tuning. Pré-treinamento (Masked Language Model & Next Sentence Prediction)

Arquitetura Deep Matching: recebe os dois objetos X e Y a serem comparados na camada de entrada, os objetos são convertidos no formato de embeddings, .... a camada de saída pode ser MLP



Comentários

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russell, 1972, p. 36.) “Truthmaker theories” hold that in order for any truthbe

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The