Playing with the embeddings¶
What can we do with the embeddings now that we have computed them? For applications like query answering or entity resolution, we need a representation where similar concepts have similar embeddings. Let's perform a small trial. We will use the customary metric cosine similarity to measure vector similarities. We use invoke an existing function from the sklearn package in Python:
from sklearn.metrics.pairwise import cosine_similarity
Let's first load all embeddings into a key-value dictionary:
embeddings={}
with open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-TransE-author.tsv', 'r') as f:
header = next(f)
for line in f:
node1, label, embedding=line.split()
embeddings[node1]=embedding.split(',')
Comparar os vetores do Sérgio (autor do seu próprio lattes) com outros professores como o Hermann e a Fernanda
emb_Fernanda_lattes = embeddings['lattes:5068302552861597#author-5068302552861597']
emb_Sergio_lattes = embeddings['lattes:8164403687403639#author-8164403687403639']
emb_Hermann_lattes = embeddings['lattes:6075905438020841#author-6075905438020841']
sim1=cosine_similarity([emb_Sergio_lattes], [emb_Sergio_lattes])
sim2=cosine_similarity([emb_Sergio_lattes], [emb_Fernanda_lattes])
sim3=cosine_similarity([emb_Sergio_lattes], [emb_Hermann_lattes])
sim4=cosine_similarity([emb_Fernanda_lattes], [emb_Hermann_lattes])
print(sim1)
print(sim2)
print(sim3)
print(sim4)
[[1.]] [[0.63489655]] [[0.60878335]] [[0.54897511]]
print(len(embeddings))
2805
Matriz N x N de similaridades
import os
# 1. cria o arquivo
f = open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-sim-author.csv', 'w', newline='', encoding='utf-8')
for node1 in embeddings:
emb_node1 = embeddings[node1]
for node2 in embeddings:
emb_node2 = embeddings[node2]
if node1 != node2:
sim=cosine_similarity([emb_node1], [emb_node2])
line = str(node1) + ";" + str(node2) + ";" + str(sim) + '\n'
# 2. grava as linhas
f.write(line)
# 3. feche o arquivo
f.close()
import pandas as pd
df = pd.read_csv('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-sim-author.csv',sep=";")
df.columns=["node1", "node2", "cosine similarity"]
df.head()
node1 | node2 | cosine similarity | |
---|---|---|---|
0 | lattes:5068302552861597#author-idm46689580673264 | lattes:5068302552861597#author-idm46689581412016 | [[0.48241659]] |
1 | lattes:5068302552861597#author-idm46689580673264 | lattes:5068302552861597#author-idm46689579763632 | [[0.57903481]] |
2 | lattes:5068302552861597#author-idm46689580673264 | lattes:5068302552861597#author-idm46689578778352 | [[0.62988191]] |
3 | lattes:5068302552861597#author-idm46689580673264 | lattes:5068302552861597#author-idm46689578018320 | [[0.21852978]] |
4 | lattes:5068302552861597#author-idm46689580673264 | lattes:5068302552861597#author-idm46689579587712 | [[0.70800659]] |
df[df['cosine similarity']==df['cosine similarity'].max()]
node1 | node2 | cosine similarity | |
---|---|---|---|
833215 | lattes:5068302552861597#author-idm46689580002496 | lattes:5068302552861597#author-idm46689579444000 | [[9.7185601e-05]] |
1203212 | lattes:5068302552861597#author-idm46689579444000 | lattes:5068302552861597#author-idm46689580002496 | [[9.7185601e-05]] |
df.sort_values(by=['cosine similarity'])
node1 | node2 | cosine similarity | |
---|---|---|---|
3709654 | lattes:8164403687403639#author-idm45713128004400 | lattes:6075905438020841#author-idm45437522424208 | [[-0.00010638]] |
7762793 | lattes:6075905438020841#author-idm45437522424208 | lattes:8164403687403639#author-idm45713128004400 | [[-0.00010638]] |
1266888 | lattes:5068302552861597#author-idm46689580876240 | lattes:6075905438020841#author-idm45437522348928 | [[-0.00012201]] |
6410394 | lattes:6075905438020841#author-idm45437522348928 | lattes:5068302552861597#author-idm46689580876240 | [[-0.00012201]] |
4303465 | lattes:8164403687403639#author-idm45713128248336 | lattes:6075905438020841#author-idm45437522644176 | [[-0.00012432]] |
... | ... | ... | ... |
7554819 | lattes:6075905438020841#author-idm45437522572768 | lattes:5068302552861597#author-idm46689580971456 | [[9.02537389e-05]] |
7746059 | lattes:6075905438020841#author-idm45437525981488 | lattes:8164403687403639#author-idm45713128574160 | [[9.65800708e-05]] |
3962008 | lattes:8164403687403639#author-idm45713128574160 | lattes:6075905438020841#author-idm45437525981488 | [[9.65800708e-05]] |
833215 | lattes:5068302552861597#author-idm46689580002496 | lattes:5068302552861597#author-idm46689579444000 | [[9.7185601e-05]] |
1203212 | lattes:5068302552861597#author-idm46689579444000 | lattes:5068302552861597#author-idm46689580002496 | [[9.7185601e-05]] |
7865219 rows × 3 columns
Conexão com o Allegro¶
from franz.openrdf.connect import ag_connect
from franz.openrdf.query.query import QueryLanguage
# Create a connection to an AllegroGraph repository. Utility Function
with ag_connect(repo='lattes-professores-mai21', host='127.0.0.1', port='10035', user='veronica.santos', password='jupyter1') as conn:
print (conn.size())
3894265
q = """ SELECT distinct ?s ?p ?o
{ ?s ?p ?o .
(?s ?o) fti:match 'Sergio Lifschitz'.
VALUES ?p { foaf:name foaf:citationName }.
}
ORDER BY ?o
LIMIT 10 """
query = conn.prepareTupleQuery(query=q)
query.evaluate(output=True)
------------------------------------------------------------------------------------------------------------------------------------------------------- | s | p | o | ======================================================================================================================================================= | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129572864 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126013456 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129579856 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126051216 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-8164403687403639 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713129586944 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126069392 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713127349840 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/8164403687403639#author-idm45713126060592 | foaf:citationName | LIFSCHITZ, S.;LIFSCHITZ, Sergio;LIFSCHITZ, SÉRGIO | | http://www.nima.puc-rio.br/lattes/1929107472151568#author-idm45772854072544 | foaf:citationName | LIFSCHITZ, SERGIO | -------------------------------------------------------------------------------------------------------------------------------------------------------
Como pode ser visto, um mesmo autor tem vários URIs em função do processo de conversão XML para RDF. Seria possível através de embeddings identificar quais nós correspondem ao mesmo indivíduo (owl:sameAs)?¶
q = """ SELECT distinct ?s
{ ?s ?p ?o .
(?s ?o) fti:match 'Sergio Lifschitz'.
VALUES ?p { foaf:name foaf:citationName }.
} """
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()
min = 10000
max = -10000
cont = 0
soma = 0
# Comparar com o nó referente ao URI "principal" gerado para cada autor do seu próprio CV Lattes
node2 = 'lattes:8164403687403639#author-8164403687403639'
emb_node2 = embeddings[node2]
with result:
for binding_set in result:
s = str(binding_set.getValue("s"))
node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
try:
emb_node1 = embeddings[node1]
if node1 != node2:
sim_Sergio_lattes = cosine_similarity([emb_node1], [emb_node2])
if sim_Sergio_lattes < 0.4 : print (sim_Sergio_lattes, node1)
if sim_Sergio_lattes < min : min = sim_Sergio_lattes
if sim_Sergio_lattes > max : max = sim_Sergio_lattes
soma = soma + sim_Sergio_lattes
cont = cont + 1
except Exception as e:
pass
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.1426685]] lattes:6075905438020841#author-idm45437522294304 [[0.30746482]] lattes:6075905438020841#author-idm45437523343888 [[0.26812012]] lattes:6075905438020841#author-idm45437522284448 [[0.24533419]] lattes:5068302552861597#author-idm46689578875248 [[0.13399867]] lattes:6075905438020841#author-idm45437522304928 [[0.25648873]] lattes:6075905438020841#author-idm45437523330960 Mínimo: [[0.13399867]] Máximo: [[0.77043258]] Média: [[0.64274928]]
q = """ SELECT distinct ?s
{ ?s ?p ?o .
(?s ?o) fti:match 'Sergio Lifschitz'.
VALUES ?p { foaf:name foaf:citationName }.
} """
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()
min = 10000
max = -10000
cont = 0
soma = 0
# Comparar com o nó referente a um URI "secundário" gerado para cada autor de uma publicação
node2 = 'lattes:8164403687403639#author-idm45713129572864'
emb_node2 = embeddings[node2]
with result:
for binding_set in result:
s = str(binding_set.getValue("s"))
node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
try:
emb_node1 = embeddings[node1]
if node1 != node2:
sim_Sergio_lattes = cosine_similarity([emb_node1], [emb_node2])
if sim_Sergio_lattes < 0.4 : print (sim_Sergio_lattes, node1)
if sim_Sergio_lattes < min : min = sim_Sergio_lattes
if sim_Sergio_lattes > max : max = sim_Sergio_lattes
soma = soma + sim_Sergio_lattes
cont = cont + 1
except Exception as e:
pass
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.23115076]] lattes:6075905438020841#author-idm45437522294304 [[0.23160824]] lattes:6075905438020841#author-idm45437523343888 [[0.34148057]] lattes:6075905438020841#author-idm45437522284448 [[0.28484893]] lattes:5068302552861597#author-idm46689578875248 [[0.16531615]] lattes:6075905438020841#author-idm45437522304928 [[0.18799357]] lattes:6075905438020841#author-idm45437523330960 Mínimo: [[0.16531615]] Máximo: [[0.96494461]] Média: [[0.77628674]]
Mesmo processamento com as publicações. Cada publicação recebe um URI com base no Lattes de cada autor¶
embeddings={}
with open('C:\\Users\\versa\\OneDrive - puc-rio.br\\Vika\\NIMA\\kgtk\\lattes-g_emb-TransE-pub.tsv', 'r') as f:
header = next(f)
for line in f:
node1, label, embedding=line.split()
embeddings[node1]=embedding.split(',')
q = """ SELECT distinct ?s ?p ?o
{ ?s ?p ?o .
(?s ?o) fti:match 'database tuning'.
VALUES ?p { dc:title }.
}
ORDER BY ?o
"""
query = conn.prepareTupleQuery(query=q)
query.evaluate(output=True)
---------------------------------------------------------------------------------------------------------------------------------------------------- | s | p | o | ==================================================================================================================================================== | http://www.nima.puc-rio.br/lattes/5068302552861597#P533 | dc:title | An Ontological Perspective for Database Tuning Heuristics | | http://www.nima.puc-rio.br/lattes/8164403687403639#P612 | dc:title | An Ontological Perspective for Database Tuning Heuristics. | | http://www.nima.puc-rio.br/lattes/8164403687403639#P584 | dc:title | Database tuning with partial indexes | | http://www.nima.puc-rio.br/lattes/8164403687403639#P656 | dc:title | Tun-OCM: A model-driven approach to support database tuning decision making | | http://www.nima.puc-rio.br/lattes/5068302552861597#P584 | dc:title | Tun-OCM: A model-driven approach to support database tuning decision making | ----------------------------------------------------------------------------------------------------------------------------------------------------
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()
min = 10000
max = -10000
cont = 0
soma = 0
node_list = []
with result:
for binding_set in result:
s = str(binding_set.getValue("s"))
node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
node_list.append(node1)
for node1 in node_list:
for node2 in node_list:
if node1 != node2:
try:
emb_node1 = embeddings[node1]
emb_node2 = embeddings[node2]
sim_publ = cosine_similarity([emb_node1], [emb_node2])
if sim_publ > 0.5 : print (sim_publ, node1, node2)
if sim_publ < min : min = sim_publ
if sim_publ > max : max = sim_publ
soma = soma + sim_publ
cont = cont + 1
except Exception as e:
pass
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612 [[0.5973676]] lattes:5068302552861597#P533 lattes:5068302552861597#P584 [[0.65243417]] lattes:8164403687403639#P612 lattes:5068302552861597#P533 [[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584 [[0.5973676]] lattes:5068302552861597#P584 lattes:5068302552861597#P533 [[0.58288298]] lattes:5068302552861597#P584 lattes:8164403687403639#P656 Mínimo: [[0.35136259]] Máximo: [[0.65243417]] Média: [[0.47769548]]
Comentários sobre o resultado:
An Ontological Perspective for Database Tuning Heuristics
[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612
Tun-OCM: A model-driven approach to support database tuning decision making
[[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584
As duas publicações dentro do mesmo Lattes são mais "semelhantes" do que a combinação anterior
[[0.5973676]] lattes:5068302552861597#P584 lattes:5068302552861597#P533
tuple_query = conn.prepareTupleQuery(QueryLanguage.SPARQL, q)
result = tuple_query.evaluate()
min = 10000
max = -10000
cont = 0
soma = 0
node_list = []
with result:
for binding_set in result:
s = str(binding_set.getValue("s"))
node1 = s.replace("<http://www.nima.puc-rio.br/lattes/","lattes:").replace(">", "")
node_list.append(node1)
for node1 in node_list:
for node2 in node_list:
lattes1 = node1[node1.index(':')+1:node1.index('#')]
lattes2 = node2[node2.index(':')+1:node2.index('#')]
if node1 != node2 and lattes1 != lattes2:
try:
emb_node1 = embeddings[node1]
emb_node2 = embeddings[node2]
sim_publ = cosine_similarity([emb_node1], [emb_node2])
if sim_publ > 0.5 : print (sim_publ, node1, node2)
if sim_publ < min : min = sim_publ
if sim_publ > max : max = sim_publ
soma = soma + sim_publ
cont = cont + 1
except Exception as e:
pass
print ('Mínimo: ', min, ' Máximo: ', max, ' Média: ', soma/cont)
[[0.65243417]] lattes:5068302552861597#P533 lattes:8164403687403639#P612 [[0.65243417]] lattes:8164403687403639#P612 lattes:5068302552861597#P533 [[0.58288298]] lattes:8164403687403639#P656 lattes:5068302552861597#P584 [[0.58288298]] lattes:5068302552861597#P584 lattes:8164403687403639#P656 Mínimo: [[0.35136259]] Máximo: [[0.65243417]] Média: [[0.46576804]]
- Gerar link
- X
- Outros aplicativos
Marcadores
embeddings KGTK- Gerar link
- X
- Outros aplicativos
Vou tentar outra forma de publicar o jupyter notebook aqui no blog
ResponderExcluirExportei como html e slide e só copiei o corpo (exclui o que estava entre as tags head). Está melhor para visualizar
ExcluirConsiderações do professor Daniel
ResponderExcluirEm rel. a trabalhos relacionados, a seguinte conversa rolou no grupo, recentemente - acho q vale a pena olhar os artigos:
"This paper (https://openreview.net/forum?id=vsxYOZoPvne) at ESWC 2021 try to measure whether graph embedding can preserve graph semantics, and whether there is a difference between different graph training algorithms. It shows that RESCAL is better than TransE, and ComplEx doesn't have very good performance, which I think is consistent to some observations you have got so far.This paper (https://openreview.net/forum?id=BkxSmlBFvr) from ICLR 2020 shows that hyperparameters have a huge impact on graph embeddings, and with some tuning, RESCAL is a very competitive model. I'm not sure what metrics are used, but trying different algorithms could yield significant improvements as well.”
- Um dos pontos chave é que a forma de calcular os embeddings para Grafos captura a noção de contexto de forma diferente. Faz parte da pesquisa identificar (como no artigo acima sugere) quais os tipos de embeddings fazem mais sentido neste caso, e até eventualmente adaptar/propor um novo tipo de embedding.
Na discussão hoje um certo consenso emergiu que para computar similaridade em algum domínio, os embeddings por si só são muito “grosseiros”; podem ser usados como um primeiro filtro para gerar candidatos, mas estes precisam ser processados adicionalmente para chegar a respostas “aceitáveis”.