Pular para o conteúdo principal

Jupyter Notebook - Embeddings I

KGTK + Embeddings slides

Playing with the embeddings

What can we do with the embeddings now that we have computed them? For applications like query answering or entity resolution, we need a representation where similar concepts have similar embeddings. Let's perform a small trial. We will use the customary metric cosine similarity to measure vector similarities. We use invoke an existing function from the sklearn package in Python:

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

Let's first load all embeddings into a key-value dictionary:

In [5]:
embeddings={}
with open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-TransE-author.tsv', 'r') as f:
    header = next(f)
    for line in f:
        node1, label, embedding=line.split()
        embeddings[node1]=embedding.split(',')

Comparar os vetores do Sérgio (autor do seu próprio lattes) com outros professores como o Hermann e a Fernanda

In [6]:
emb_Fernanda_lattes = embeddings['lattes:5068302552861597#author-5068302552861597']
emb_Sergio_lattes = embeddings['lattes:8164403687403639#author-8164403687403639']
emb_Hermann_lattes = embeddings['lattes:6075905438020841#author-6075905438020841']
In [7]:
sim1=cosine_similarity([emb_Sergio_lattes], [emb_Sergio_lattes])
sim2=cosine_similarity([emb_Sergio_lattes], [emb_Fernanda_lattes])
sim3=cosine_similarity([emb_Sergio_lattes], [emb_Hermann_lattes])
sim4=cosine_similarity([emb_Fernanda_lattes], [emb_Hermann_lattes])
In [8]:
print(sim1)
print(sim2)
print(sim3)
print(sim4)
[[1.]]
[[0.63489655]]
[[0.60878335]]
[[0.54897511]]
In [6]:
print(len(embeddings))
2805

Matriz N x N de similaridades

In [7]:
import os

# 1. cria o arquivo
f = open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-sim-author.csv', 'w', newline='', encoding='utf-8')
In [8]:
for node1 in embeddings:
    emb_node1 = embeddings[node1]
    for node2 in embeddings:
        emb_node2 = embeddings[node2]
        if node1 != node2: 
            sim=cosine_similarity([emb_node1], [emb_node2])
            line = str(node1) + ";" + str(node2) + ";" + str(sim) + '\n'
# 2. grava as linhas
            f.write(line)
In [9]:
# 3. feche o arquivo
f.close() 
In [22]:
import pandas as pd
df = pd.read_csv('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-sim-author.csv',sep=";")
df.columns=["node1", "node2", "cosine similarity"]
df.head()
Out[22]:
node1 node2 cosine similarity
0 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689581412016 [[0.48241659]]
1 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689579763632 [[0.57903481]]
2 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689578778352 [[0.62988191]]
3 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689578018320 [[0.21852978]]
4 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689579587712 [[0.70800659]]
In [11]:
df[df['cosine similarity']==df['cosine similarity'].max()]
Out[11]:
node1 node2 cosine similarity
833215 lattes:5068302552861597#author-idm46689580002496 lattes:5068302552861597#author-idm46689579444000 [[9.7185601e-05]]
1203212 lattes:5068302552861597#author-idm46689579444000 lattes:5068302552861597#author-idm46689580002496 [[9.7185601e-05]]
In [12]:
df.sort_values(by=['cosine similarity'])
Out[12]:
node1 node2 cosine similarity
3709654 lattes:8164403687403639#author-idm45713128004400 lattes:6075905438020841#author-idm45437522424208 [[-0.00010638]]
7762793 lattes:6075905438020841#author-idm45437522424208 lattes:8164403687403639#author-idm45713128004400 [[-0.00010638]]
1266888 lattes:5068302552861597#author-idm46689580876240 lattes:6075905438020841#author-idm45437522348928 [[-0.00012201]]
6410394 lattes:6075905438020841#author-idm45437522348928 lattes:5068302552861597#author-idm46689580876240 [[-0.00012201]]
4303465 lattes:8164403687403639#author-idm45713128248336 lattes:6075905438020841#author-idm45437522644176 [[-0.00012432]]
... ... ... ...
7554819 lattes:6075905438020841#author-idm45437522572768 lattes:5068302552861597#author-idm46689580971456 [[9.02537389e-05]]
7746059 lattes:6075905438020841#author-idm45437525981488 lattes:8164403687403639#author-idm45713128574160 [[9.65800708e-05]]
3962008 lattes:8164403687403639#author-idm45713128574160 lattes:6075905438020841#author-idm45437525981488 [[9.65800708e-05]]
833215 lattes:5068302552861597#author-idm46689580002496 lattes:5068302552861597#author-idm46689579444000 [[9.7185601e-05]]
1203212 lattes:5068302552861597#author-idm46689579444000 lattes:5068302552861597#author-idm46689580002496 [[9.7185601e-05]]

7865219 rows × 3 columns

Conexão com o Allegro

In [2]:
from franz.openrdf.connect import ag_connect

from franz.openrdf.query.query import QueryLanguage

# Create a connection to an AllegroGraph repository. Utility Function

with ag_connect(repo='lattes-professores-mai21', host='127.0.0.1', port='10035', user='veronica.santos', password='jupyter1') as conn:
        print (conn.size())
3894265
In [9]:
q = """ SELECT distinct ?s ?p ?o
          { ?s ?p ?o . 
           (?s ?o) fti:match 'Sergio Lifschitz'.
           VALUES ?p { foaf:name foaf:citationName }.           
           } 
           ORDER BY ?o 
           LIMIT 10 """

query = conn.prepareTupleQuery(query=q)
query.evaluate(output=True)
-------------------------------------------------------------------------------------------------------------------------------------------------------
| s                                                                           | p                 | o                                                 |
=======================================================================================================================================================
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129572864 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126013456 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129579856 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126051216 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-8164403687403639  | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129586944 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126069392 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713127349840 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126060592 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/1929107472151568#author-idm45772854072544 | foaf:citationName | LIFSCHITZ, SERGIO                                 |
-------------------------------------------------------------------------------------------------------------------------------------------------------

Como pode ser visto, um mesmo autor tem vários URIs em função do processo de conversão XML para RDF. Seria possível através de embeddings identificar quais nós correspondem ao mesmo indivíduo (owl:sameAs)?

In [12]:
q = """ SELECT distinct ?s 
          { ?s ?p ?o . 
           (?s ?o) fti:match 'Sergio Lifschitz'.
           VALUES ?p { foaf:name foaf:citationName }.           
           } """

tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0

# Comparar com o nó referente ao URI "principal" gerado para cada autor do seu próprio CV Lattes
node2 = 'lattes:8164403687403639#author-8164403687403639'
emb_node2 = embeddings[node2]

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
        try:        
            emb_node1 = embeddings[node1]
            if node1 != node2: 
                sim_Sergio_lattes = cosine_similarity([emb_node1], [emb_node2])
                if sim_Sergio_lattes < 0.4 : print (sim_Sergio_lattes, node1)
                if sim_Sergio_lattes < min : min = sim_Sergio_lattes
                if sim_Sergio_lattes > max : max = sim_Sergio_lattes
                soma = soma + sim_Sergio_lattes
                cont = cont + 1
        except Exception as e: 
            pass
        
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.1426685]] lattes:6075905438020841#author-idm45437522294304
[[0.30746482]] lattes:6075905438020841#author-idm45437523343888
[[0.26812012]] lattes:6075905438020841#author-idm45437522284448
[[0.24533419]] lattes:5068302552861597#author-idm46689578875248
[[0.13399867]] lattes:6075905438020841#author-idm45437522304928
[[0.25648873]] lattes:6075905438020841#author-idm45437523330960
Mínimo:  [[0.13399867]]  Máximo:  [[0.77043258]]  Média:  [[0.64274928]]
In [13]:
q = """ SELECT distinct ?s 
          { ?s ?p ?o . 
           (?s ?o) fti:match 'Sergio Lifschitz'.
           VALUES ?p { foaf:name foaf:citationName }.           
           } """

tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0

# Comparar com o nó referente a um URI "secundário" gerado para cada autor de uma publicação
node2 = 'lattes:8164403687403639#author-idm45713129572864'
emb_node2 = embeddings[node2]

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
        try:        
            emb_node1 = embeddings[node1]
            if node1 != node2: 
                sim_Sergio_lattes = cosine_similarity([emb_node1], [emb_node2])
                if sim_Sergio_lattes < 0.4 : print (sim_Sergio_lattes, node1)
                if sim_Sergio_lattes < min : min = sim_Sergio_lattes
                if sim_Sergio_lattes > max : max = sim_Sergio_lattes
                soma = soma + sim_Sergio_lattes
                cont = cont + 1
        except Exception as e: 
            pass
        
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.23115076]] lattes:6075905438020841#author-idm45437522294304
[[0.23160824]] lattes:6075905438020841#author-idm45437523343888
[[0.34148057]] lattes:6075905438020841#author-idm45437522284448
[[0.28484893]] lattes:5068302552861597#author-idm46689578875248
[[0.16531615]] lattes:6075905438020841#author-idm45437522304928
[[0.18799357]] lattes:6075905438020841#author-idm45437523330960
Mínimo:  [[0.16531615]]  Máximo:  [[0.96494461]]  Média:  [[0.77628674]]

Mesmo processamento com as publicações. Cada publicação recebe um URI com base no Lattes de cada autor

In [14]:
embeddings={}
with open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-TransE-pub.tsv', 'r') as f:
    header = next(f)
    for line in f:
        node1, label, embedding=line.split()
        embeddings[node1]=embedding.split(',')
In [24]:
q = """ SELECT distinct ?s ?p ?o
          { ?s ?p ?o . 
           (?s ?o) fti:match 'database tuning'.
           VALUES ?p { dc:title }.           
           } 
           ORDER BY ?o 
 """

query = conn.prepareTupleQuery(query=q)
query.evaluate(output=True)
----------------------------------------------------------------------------------------------------------------------------------------------------
| s                                                       | p        | o                                                                           |
====================================================================================================================================================
| http://www.nima.puc-rio.br/lattes/5068302552861597#P533 | dc:title | An Ontological Perspective for Database Tuning Heuristics                   |
| http://www.nima.puc-rio.br/lattes/8164403687403639#P612 | dc:title | An Ontological Perspective for Database Tuning Heuristics.                  |
| http://www.nima.puc-rio.br/lattes/8164403687403639#P584 | dc:title | Database tuning with partial indexes                                        |
| http://www.nima.puc-rio.br/lattes/8164403687403639#P656 | dc:title | Tun-OCM: A model-driven approach to support database tuning decision making |
| http://www.nima.puc-rio.br/lattes/5068302552861597#P584 | dc:title | Tun-OCM: A model-driven approach to support database tuning decision making |
----------------------------------------------------------------------------------------------------------------------------------------------------
In [26]:
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0
node_list = []

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
        node_list.append(node1)

for node1 in node_list:
    for node2 in node_list: 
        if node1 != node2:         
            try:        
                emb_node1 = embeddings[node1]
                emb_node2 = embeddings[node2]
                sim_publ = cosine_similarity([emb_node1], [emb_node2])
                if sim_publ > 0.5 : print (sim_publ, node1, node2)
                if sim_publ < min : min = sim_publ
                if sim_publ > max : max = sim_publ
                soma = soma + sim_publ
                cont = cont + 1
            except Exception as e: 
                pass
        
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612
[[0.5973676]] lattes:5068302552861597#P533 lattes:5068302552861597#P584
[[0.65243417]] lattes:8164403687403639#P612 lattes:5068302552861597#P533
[[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584
[[0.5973676]] lattes:5068302552861597#P584 lattes:5068302552861597#P533
[[0.58288298]] lattes:5068302552861597#P584 lattes:8164403687403639#P656
Mínimo:  [[0.35136259]]  Máximo:  [[0.65243417]]  Média:  [[0.47769548]]

Comentários sobre o resultado:

An Ontological Perspective for Database Tuning Heuristics

[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612

Tun-OCM: A model-driven approach to support database tuning decision making

[[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584

As duas publicações dentro do mesmo Lattes são mais "semelhantes" do que a combinação anterior

[[0.5973676]] lattes:5068302552861597#P584 lattes:5068302552861597#P533

In [30]:
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0
node_list = []

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
        node_list.append(node1)

for node1 in node_list:
    for node2 in node_list: 
        lattes1 = node1[node1.index(':')+1:node1.index('#')]
        lattes2 = node2[node2.index(':')+1:node2.index('#')]
        if node1 != node2 and lattes1 != lattes2:         
            try:        
                emb_node1 = embeddings[node1]
                emb_node2 = embeddings[node2]
                sim_publ = cosine_similarity([emb_node1], [emb_node2])
                if sim_publ > 0.5 : print (sim_publ, node1, node2)
                if sim_publ < min : min = sim_publ
                if sim_publ > max : max = sim_publ
                soma = soma + sim_publ
                cont = cont + 1
            except Exception as e: 
                pass
        
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612
[[0.65243417]] lattes:8164403687403639#P612 lattes:5068302552861597#P533
[[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584
[[0.58288298]] lattes:5068302552861597#P584 lattes:8164403687403639#P656
Mínimo:  [[0.35136259]]  Máximo:  [[0.65243417]]  Média:  [[0.46576804]]
In [ ]:
 

Comentários

  1. Vou tentar outra forma de publicar o jupyter notebook aqui no blog

    ResponderExcluir
    Respostas
    1. Exportei como html e slide e só copiei o corpo (exclui o que estava entre as tags head). Está melhor para visualizar

      Excluir
  2. Considerações do professor Daniel

    Em rel. a trabalhos relacionados, a seguinte conversa rolou no grupo, recentemente - acho q vale a pena olhar os artigos:

    "This paper (https://openreview.net/forum?id=vsxYOZoPvne) at ESWC 2021 try to measure whether graph embedding can preserve graph semantics, and whether there is a difference between different graph training algorithms. It shows that RESCAL is better than TransE, and ComplEx doesn't have very good performance, which I think is consistent to some observations you have got so far.This paper (https://openreview.net/forum?id=BkxSmlBFvr) from ICLR 2020 shows that hyperparameters have a huge impact on graph embeddings, and with some tuning, RESCAL is a very competitive model. I'm not sure what metrics are used, but trying different algorithms could yield significant improvements as well.”

    - Um dos pontos chave é que a forma de calcular os embeddings para Grafos captura a noção de contexto de forma diferente. Faz parte da pesquisa identificar (como no artigo acima sugere) quais os tipos de embeddings fazem mais sentido neste caso, e até eventualmente adaptar/propor um novo tipo de embedding.

    Na discussão hoje um certo consenso emergiu que para computar similaridade em algum domínio, os embeddings por si só são muito “grosseiros”; podem ser usados como um primeiro filtro para gerar candidatos, mas estes precisam ser processados adicionalmente para chegar a respostas “aceitáveis”.

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...