Deep Learning for Matching in Search and Recommendation - Cap 3 Deep Learning for Matching - Leitura de Artigo
The main reason for the success is due to deep learning’s strong ability in learning of representations for inputs (i.e., queries, documents, users, and items) and learning of nonlinear functions for matching.
Overview of Deep Learning
Deep Neural Networks
The Feed-forward Neural Networks (FFN), also called Multilayer Perceptron (MLP), are neural networks consisting of multiple layers of units, which are connected layer by layer without a loop.
Besides sigmoid function, other functions such as tanh and Rectified Linear Units (ReLU) are also utilized.
In learning, training data of input-output pairs are fed into the network as ground-truth. A loss is calculated for each instance by contrasting the ground truth and the prediction by the network, and the training is performed by adjusting the parameters so that the total loss is minimized. The well-known back-propagation algorithm is employed to conduct the minimization.
Convolutional Neural Networks (CNN) are neural networks that make use of convolution operations in at least one of the layers. They are specialized neural networks for processing data that has a grid-like structure, e.g., time serious data (1-D grid of time intervals) and image data (2-D grid of pixels). ... a typical convolutional network consists of multiple stacked layers: convolutional layer, detector layer, and pooling layer. In the convolutional layer, convolution functions are applied in parallel to produce a set of linear activations. In the detector layer, the set of linear activations are run through a nonlinear activation function. In the pooling layer, pooling functions are used to further modify the set of outputs.
The convolutional layer, on the other hand, uses convolutional kernel vectors (or matrices) to model the local features of each position (unit), where the weights of kernels are shared across positions (units). Thus it has much sparse connections between layers.
Recurrent Neural Networks (RNN) are neural networks for processing sequence data x(1), . . . , x(T). Unlike FFNs which can only handle one instance at a time, RNN can handle a long sequence of instances with a variable length.
Attention is a useful tool in deep learning. It is originally proposed to dynamically and selectively collect information from the source sentence in an encoder-decoder model in neural machine translation
Attention based Model: Figure 3.5 shows an encoder-decoder model with the additive attention mechanism.
Transformer: Transformer (Vaswani et al., 2017) is another attention based neural network under the encoder and decoder framework. Different from the aforementioned model which sequentially reads the input sequence (left-to-right or right-to-left), Transformer reads the entire input sequence at once. The characteristic enables it to learn the model by considering both the left and the right context of a word.
Each decoder component or layer in the decoder consists of a self attention sub-layer, an encoder-decoder attention sub-layer, and a feedforward network (FFN) sub-layer. The sub-layers have the same architecture as that of the encoder component.
Autoencoders are neural networks that aim to learn the hidden information of the input, by compressing the input into a latent-space representation and then reconstructing the output from the representation.
In the model, high-dimensional data is first converted into a low-dimensional latent representation by a multilayer encoder neural network. Then, the data is reconstructed from the latent representation by a multilayer decoder neural network.
Word embedding is a basic way of representing words in Natural Language Processing (NLP) and Information Retrieval (IR). Embeddings of words are usually created based on the assumption that the meaning of a word can be determined by its contexts in documents.
The classical word embedding models (e.g., Word2Vec and GloVe) have a fundamental shortcoming: they generate and utilize the same embeddings of the same words in different contexts. Therefore, they cannot effectively deal with the context-dependent nature of words. Contextualized word embeddings aim at capturing lexical semantics in different contexts.
Among the models, BERT is the most widely used. BERT is a mask language model (a denoising auto-encoder) that aims to reconstruct the original sentences from the corrupted ones. That is, in the pre-train phase, the input sentence is corrupted by replacing some original words with “[MASK]”. The learning objective, therefore, is to predict the masked words to get the original sentence.
The learning of BERT consists of two stages: pre-training and finetuning. In pre-training, sentence pairs collected from a large corpus are used as training data. The model parameters are determined using two training strategies: mask language modeling and next sentence prediction. (1) In mask language modeling, 15% of the randomly chosen words in the two sentences are replaced with token “[MASK]” before feeding them into the model. The training goal, then, is to predict the original masked words, based on the contexts provided by the non-masked words in the sentences. (2) In next sentence prediction, the model receives pairs of sentences as input. The training goal is to predict if the second sentence in the pair is the subsequent sentence in the original document.
Deep learning for matching, referred to as deep matching, has become the state-of-the-art technologies in search and recommendation (Mitra and Craswell, 2019). Compared with the traditional machine learning approaches, the deep learning approaches improve the matching accuracy in three ways:
(1) using deep neural networks to construct richer representations for matching of objects (i.e., query, document, user, and item),
(2) using deep learning algorithms to construct more powerful functions for matching, and
(3) learning the representations and matching functions jointly in an end-to-end fashion.
Another advantage of deep matching approaches is their flexibility of extending to multi-modal matching where the common semantic space can be learned to universally represent data of different modalities.
The matching framework takes two matching objects as its input and outputs a numerical value to represent the matching degree.
The input layer receives the two matching objects which can be word embeddings, ID vectors, or feature vectors.
The representation layer converts the input vectors into the distributed representations. Neural networks such as MLP, CNN, and RNN can be used here, depending on the type and nature of the input.
The interaction layer compares the matching objects (i.e., two distributed representations) and outputs a number of (local or global) matching signals. Matrix and tensor can be used for storing the signals and their locations.
The aggregation layer aggregates the individual matching signals into a high-level matching vector. Operations in deep neural networks such as pooling and concatenation are usually adopted in this layer.
The output layer takes the high-level matching vector and outputs a matching score. Linear model, MLP, Neural Tensor Networks (NTN), or other neural networks can be utilized.
=======================================================
Apresentação -> https://youtu.be/DN7OE_mlldg
FFN = MLP
Multiplas camadas sem loop, FF pq a informação flui em uma direção sem retorno, não existe conexão entre neurônios da mesma camada.
Exemplo de MLP para um Classificador de imagem (pode ter n camadas)
Gabarito: pares de entrada e saída
CNN = operação de convolução em pelo menos uma das camadas
camadas completamente conectadas se tornam localmente conectadas para diminuir o custo de processamento da entrada
RNN = redes neurais que são específicas para sequências de entradas com streaming
Permite loop para passar informações anteriores
Modelo de Atenção: Encoder > Camada de Atenção > Decoder
Essa camada remove os ruídos da entrada
Transformers para tradução
Autoencoders
Word Embeddings: representação de palavras para IP e NLP
Word2Vec: CBOW & Skip Gram ... a palavra é representada da mesma forma independente do contexto
BERT: as palavras podem ter representações diferentes se aparecerem em contextos diferentes. DUas fases: pré-treinamentos e fine tuning. Pré-treinamento (Masked Language Model & Next Sentence Prediction)
Arquitetura Deep Matching: recebe os dois objetos X e Y a serem comparados na camada de entrada, os objetos são convertidos no formato de embeddings, .... a camada de saída pode ser MLP
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.