Jupyter Notebook - Embeddings I

KGTK + Embeddings slides

Playing with the embeddings

What can we do with the embeddings now that we have computed them? For applications like query answering or entity resolution, we need a representation where similar concepts have similar embeddings. Let's perform a small trial. We will use the customary metric cosine similarity to measure vector similarities. We use invoke an existing function from the sklearn package in Python:

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

Let's first load all embeddings into a key-value dictionary:

In [5]:
with open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-TransE-author.tsv', 'r') as f:
    header = next(f)
    for line in f:
        node1, label, embedding=line.split()

Comparar os vetores do Sérgio (autor do seu próprio lattes) com outros professores como o Hermann e a Fernanda

In [6]:
emb_Fernanda_lattes = embeddings['lattes:5068302552861597#author-5068302552861597']
emb_Sergio_lattes = embeddings['lattes:8164403687403639#author-8164403687403639']
emb_Hermann_lattes = embeddings['lattes:6075905438020841#author-6075905438020841']
In [7]:
sim1=cosine_similarity([emb_Sergio_lattes], [emb_Sergio_lattes])
sim2=cosine_similarity([emb_Sergio_lattes], [emb_Fernanda_lattes])
sim3=cosine_similarity([emb_Sergio_lattes], [emb_Hermann_lattes])
sim4=cosine_similarity([emb_Fernanda_lattes], [emb_Hermann_lattes])
In [8]:
In [6]:

Matriz N x N de similaridades

In [7]:
import os

# 1. cria o arquivo
f = open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-sim-author.csv', 'w', newline='', encoding='utf-8')
In [8]:
for node1 in embeddings:
    emb_node1 = embeddings[node1]
    for node2 in embeddings:
        emb_node2 = embeddings[node2]
        if node1 != node2: 
            sim=cosine_similarity([emb_node1], [emb_node2])
            line = str(node1) + ";" + str(node2) + ";" + str(sim) + '\n'
# 2. grava as linhas
In [9]:
# 3. feche o arquivo
In [22]:
import pandas as pd
df = pd.read_csv('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-sim-author.csv',sep=";")
df.columns=["node1", "node2", "cosine similarity"]
node1 node2 cosine similarity
0 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689581412016 [[0.48241659]]
1 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689579763632 [[0.57903481]]
2 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689578778352 [[0.62988191]]
3 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689578018320 [[0.21852978]]
4 lattes:5068302552861597#author-idm46689580673264 lattes:5068302552861597#author-idm46689579587712 [[0.70800659]]
In [11]:
df[df['cosine similarity']==df['cosine similarity'].max()]
node1 node2 cosine similarity
833215 lattes:5068302552861597#author-idm46689580002496 lattes:5068302552861597#author-idm46689579444000 [[9.7185601e-05]]
1203212 lattes:5068302552861597#author-idm46689579444000 lattes:5068302552861597#author-idm46689580002496 [[9.7185601e-05]]
In [12]:
df.sort_values(by=['cosine similarity'])
node1 node2 cosine similarity
3709654 lattes:8164403687403639#author-idm45713128004400 lattes:6075905438020841#author-idm45437522424208 [[-0.00010638]]
7762793 lattes:6075905438020841#author-idm45437522424208 lattes:8164403687403639#author-idm45713128004400 [[-0.00010638]]
1266888 lattes:5068302552861597#author-idm46689580876240 lattes:6075905438020841#author-idm45437522348928 [[-0.00012201]]
6410394 lattes:6075905438020841#author-idm45437522348928 lattes:5068302552861597#author-idm46689580876240 [[-0.00012201]]
4303465 lattes:8164403687403639#author-idm45713128248336 lattes:6075905438020841#author-idm45437522644176 [[-0.00012432]]
... ... ... ...
7554819 lattes:6075905438020841#author-idm45437522572768 lattes:5068302552861597#author-idm46689580971456 [[9.02537389e-05]]
7746059 lattes:6075905438020841#author-idm45437525981488 lattes:8164403687403639#author-idm45713128574160 [[9.65800708e-05]]
3962008 lattes:8164403687403639#author-idm45713128574160 lattes:6075905438020841#author-idm45437525981488 [[9.65800708e-05]]
833215 lattes:5068302552861597#author-idm46689580002496 lattes:5068302552861597#author-idm46689579444000 [[9.7185601e-05]]
1203212 lattes:5068302552861597#author-idm46689579444000 lattes:5068302552861597#author-idm46689580002496 [[9.7185601e-05]]

7865219 rows × 3 columns

Conexão com o Allegro

In [2]:
from franz.openrdf.connect import ag_connect

from franz.openrdf.query.query import QueryLanguage

# Create a connection to an AllegroGraph repository. Utility Function

with ag_connect(repo='lattes-professores-mai21', host='', port='10035', user='veronica.santos', password='jupyter1') as conn:
        print (conn.size())
In [9]:
q = """ SELECT distinct ?s ?p ?o
          { ?s ?p ?o . 
           (?s ?o) fti:match 'Sergio Lifschitz'.
           VALUES ?p { foaf:name foaf:citationName }.           
           ORDER BY ?o 
           LIMIT 10 """

query = conn.prepareTupleQuery(query=q)
| s                                                                           | p                 | o                                                 |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129572864 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126013456 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129579856 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126051216 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-8164403687403639  | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129586944 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126069392 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713127349840 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126060592 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO |
| http://www.nima.puc-rio.br/lattes/1929107472151568#author-idm45772854072544 | foaf:citationName | LIFSCHITZ, SERGIO                                 |

Como pode ser visto, um mesmo autor tem vários URIs em função do processo de conversão XML para RDF. Seria possível através de embeddings identificar quais nós correspondem ao mesmo indivíduo (owl:sameAs)?

In [12]:
q = """ SELECT distinct ?s 
          { ?s ?p ?o . 
           (?s ?o) fti:match 'Sergio Lifschitz'.
           VALUES ?p { foaf:name foaf:citationName }.           
           } """

tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0

# Comparar com o nó referente ao URI "principal" gerado para cada autor do seu próprio CV Lattes
node2 = 'lattes:8164403687403639#author-8164403687403639'
emb_node2 = embeddings[node2]

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
            emb_node1 = embeddings[node1]
            if node1 != node2: 
                sim_Sergio_lattes = cosine_similarity([emb_node1], [emb_node2])
                if sim_Sergio_lattes < 0.4 : print (sim_Sergio_lattes, node1)
                if sim_Sergio_lattes < min : min = sim_Sergio_lattes
                if sim_Sergio_lattes > max : max = sim_Sergio_lattes
                soma = soma + sim_Sergio_lattes
                cont = cont + 1
        except Exception as e: 
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.1426685]] lattes:6075905438020841#author-idm45437522294304
[[0.30746482]] lattes:6075905438020841#author-idm45437523343888
[[0.26812012]] lattes:6075905438020841#author-idm45437522284448
[[0.24533419]] lattes:5068302552861597#author-idm46689578875248
[[0.13399867]] lattes:6075905438020841#author-idm45437522304928
[[0.25648873]] lattes:6075905438020841#author-idm45437523330960
Mínimo:  [[0.13399867]]  Máximo:  [[0.77043258]]  Média:  [[0.64274928]]
In [13]:
q = """ SELECT distinct ?s 
          { ?s ?p ?o . 
           (?s ?o) fti:match 'Sergio Lifschitz'.
           VALUES ?p { foaf:name foaf:citationName }.           
           } """

tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0

# Comparar com o nó referente a um URI "secundário" gerado para cada autor de uma publicação
node2 = 'lattes:8164403687403639#author-idm45713129572864'
emb_node2 = embeddings[node2]

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
            emb_node1 = embeddings[node1]
            if node1 != node2: 
                sim_Sergio_lattes = cosine_similarity([emb_node1], [emb_node2])
                if sim_Sergio_lattes < 0.4 : print (sim_Sergio_lattes, node1)
                if sim_Sergio_lattes < min : min = sim_Sergio_lattes
                if sim_Sergio_lattes > max : max = sim_Sergio_lattes
                soma = soma + sim_Sergio_lattes
                cont = cont + 1
        except Exception as e: 
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.23115076]] lattes:6075905438020841#author-idm45437522294304
[[0.23160824]] lattes:6075905438020841#author-idm45437523343888
[[0.34148057]] lattes:6075905438020841#author-idm45437522284448
[[0.28484893]] lattes:5068302552861597#author-idm46689578875248
[[0.16531615]] lattes:6075905438020841#author-idm45437522304928
[[0.18799357]] lattes:6075905438020841#author-idm45437523330960
Mínimo:  [[0.16531615]]  Máximo:  [[0.96494461]]  Média:  [[0.77628674]]

Mesmo processamento com as publicações. Cada publicação recebe um URI com base no Lattes de cada autor

In [14]:
with open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-TransE-pub.tsv', 'r') as f:
    header = next(f)
    for line in f:
        node1, label, embedding=line.split()
In [24]:
q = """ SELECT distinct ?s ?p ?o
          { ?s ?p ?o . 
           (?s ?o) fti:match 'database tuning'.
           VALUES ?p { dc:title }.           
           ORDER BY ?o 

query = conn.prepareTupleQuery(query=q)
| s                                                       | p        | o                                                                           |
| http://www.nima.puc-rio.br/lattes/5068302552861597#P533 | dc:title | An Ontological Perspective for Database Tuning Heuristics                   |
| http://www.nima.puc-rio.br/lattes/8164403687403639#P612 | dc:title | An Ontological Perspective for Database Tuning Heuristics.                  |
| http://www.nima.puc-rio.br/lattes/8164403687403639#P584 | dc:title | Database tuning with partial indexes                                        |
| http://www.nima.puc-rio.br/lattes/8164403687403639#P656 | dc:title | Tun-OCM: A model-driven approach to support database tuning decision making |
| http://www.nima.puc-rio.br/lattes/5068302552861597#P584 | dc:title | Tun-OCM: A model-driven approach to support database tuning decision making |
In [26]:
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0
node_list = []

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")

for node1 in node_list:
    for node2 in node_list: 
        if node1 != node2:         
                emb_node1 = embeddings[node1]
                emb_node2 = embeddings[node2]
                sim_publ = cosine_similarity([emb_node1], [emb_node2])
                if sim_publ > 0.5 : print (sim_publ, node1, node2)
                if sim_publ < min : min = sim_publ
                if sim_publ > max : max = sim_publ
                soma = soma + sim_publ
                cont = cont + 1
            except Exception as e: 
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612
[[0.5973676]] lattes:5068302552861597#P533 lattes:5068302552861597#P584
[[0.65243417]] lattes:8164403687403639#P612 lattes:5068302552861597#P533
[[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584
[[0.5973676]] lattes:5068302552861597#P584 lattes:5068302552861597#P533
[[0.58288298]] lattes:5068302552861597#P584 lattes:8164403687403639#P656
Mínimo:  [[0.35136259]]  Máximo:  [[0.65243417]]  Média:  [[0.47769548]]

Comentários sobre o resultado:

An Ontological Perspective for Database Tuning Heuristics

[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612

Tun-OCM: A model-driven approach to support database tuning decision making

[[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584

As duas publicações dentro do mesmo Lattes são mais "semelhantes" do que a combinação anterior

[[0.5973676]] lattes:5068302552861597#P584 lattes:5068302552861597#P533

In [30]:
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()

min = 10000
max = -10000
cont = 0
soma = 0
node_list = []

with result:
   for binding_set in result:
        s = str(binding_set.getValue("s"))
        node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")

for node1 in node_list:
    for node2 in node_list: 
        lattes1 = node1[node1.index(':')+1:node1.index('#')]
        lattes2 = node2[node2.index(':')+1:node2.index('#')]
        if node1 != node2 and lattes1 != lattes2:         
                emb_node1 = embeddings[node1]
                emb_node2 = embeddings[node2]
                sim_publ = cosine_similarity([emb_node1], [emb_node2])
                if sim_publ > 0.5 : print (sim_publ, node1, node2)
                if sim_publ < min : min = sim_publ
                if sim_publ > max : max = sim_publ
                soma = soma + sim_publ
                cont = cont + 1
            except Exception as e: 
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612
[[0.65243417]] lattes:8164403687403639#P612 lattes:5068302552861597#P533
[[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584
[[0.58288298]] lattes:5068302552861597#P584 lattes:8164403687403639#P656
Mínimo:  [[0.35136259]]  Máximo:  [[0.65243417]]  Média:  [[0.46576804]]
In [ ]:


  1. Vou tentar outra forma de publicar o jupyter notebook aqui no blog

    1. Exportei como html e slide e só copiei o corpo (exclui o que estava entre as tags head). Está melhor para visualizar

  2. Considerações do professor Daniel

    Em rel. a trabalhos relacionados, a seguinte conversa rolou no grupo, recentemente - acho q vale a pena olhar os artigos:

    "This paper (https://openreview.net/forum?id=vsxYOZoPvne) at ESWC 2021 try to measure whether graph embedding can preserve graph semantics, and whether there is a difference between different graph training algorithms. It shows that RESCAL is better than TransE, and ComplEx doesn't have very good performance, which I think is consistent to some observations you have got so far.This paper (https://openreview.net/forum?id=BkxSmlBFvr) from ICLR 2020 shows that hyperparameters have a huge impact on graph embeddings, and with some tuning, RESCAL is a very competitive model. I'm not sure what metrics are used, but trying different algorithms could yield significant improvements as well.”

    - Um dos pontos chave é que a forma de calcular os embeddings para Grafos captura a noção de contexto de forma diferente. Faz parte da pesquisa identificar (como no artigo acima sugere) quais os tipos de embeddings fazem mais sentido neste caso, e até eventualmente adaptar/propor um novo tipo de embedding.

    Na discussão hoje um certo consenso emergiu que para computar similaridade em algum domínio, os embeddings por si só são muito “grosseiros”; podem ser usados como um primeiro filtro para gerar candidatos, mas estes precisam ser processados adicionalmente para chegar a respostas “aceitáveis”.


