Pular para o conteúdo principal

edX @ From Graph to Knowledge Graph – Algorithms and Applications / Knowledge Graph Fundamentals and Construction

Microsoft: DAT278x
From Graph to Knowledge Graph – Algorithms and Applications

Module 4: Knowledge Graph Fundamentals and Construction

Knowledge graph fundamentals

A brief history of knowledge graph

If we claim a knowledge graph is representing knowledge in the graph form, the earliest, primitive form of such representation is called existential graph (Charles Sanders Peirce, late 1800s). 

"The Existential Graphs of Charles S. Perice": a book published in 1973 by Don Roberts.

Semantic network (name coined at 1956 by Richard Richens, an early researcher in computational linguistic)

Semantic network was created for machine translation task for natural language at that time.

An example of semantic network is WordNet. It groups English words into sets of synonyms which has short general definition and have various semantic relations between the synonym sets. Relation example are included like A is part of B, B contains A, or A is the same as B, or A is the opposite of B.

This is indicated by the psychological experiments that human organized our knowledge of concepts in hierarchical fashion or graph structure.

Linked Data (name coined at 2006 by Tim Berners-Lee)

It is a common framework to publish and share data across different applications and domains. Linked data, according to Berners-Lee, stated three very simple principles.
(1) all conceptual things have name start with http.
(2) these http names can be used for look up and retrieve everything related to it.
(3) the retrieved contents not only include attributes but also relations with other http names.

Exemplos: DBpedia, WikiData

Expert systems (1970s)

In the AI domain, expert system is created as a computer system which tries to emulate the decision making ability of human experts. It is composed of two sub-systems, Knowledge Base and Inference Engine. Knowledge Base contains facts and rules, while Inference Engine is applying the rules to the known facts to infer, or deduce, some new facts.

KB é composto de fatos (conhecimento assercional) e regras, Inferência descobre novos fatos a partir dos fatos conhecidos usando as regras definidas.

Knowledge Graph (by Google in 2012) - Google’s knowledge base

The entity information box, that powered by Knowledge Graph, were added to Google's search result in year 2012.

Knowledge graph - Ontology (information science)

Here the ontology means a set of concepts and categories in certain subject area that shows their properties and the relations between them.



And the term knowledge graph and the knowledge base (KB), are mostly interchangeable in many circumstances. In this course, we will use knowledge graph term to emphasize the aspect
about it's strong relation with graph theory and its knowledge in the graph form definition.


Knowledge graph representation

- graph data structure: Adjacency matrix (a square matrix indicates whether two nodes in a graph is connected or not) / Adjacency LIST
- database: Table – relational explicit Schema (SQL + define the data contents, such as their names and the types) vs. Schema-less (Non-SQL + implicit schema)
- Resource Description Framework (RDF by W3C)
• Subject, Predicate, Object (S,P,O) = triples
• “schema” + “data” = metadata data model

Types of entities and the relations are defined in machine-understandable dictionaries, which is called ontology. The standard ontology language is Web Ontology Language, or OWL.
RDF and OWL together enables an efficient and effective approach to represent knowledge. In short, RDF provides a graph-based data model to describe the objects, while OWL offers a standard way of defining vocabularies for data annotation.

Knowledge graph logical representation

Matriz de Adjacências

In relational model

If it is homogeneous graph, which means that we have a single type of node and the single type of connection between nodes, then we have one table for nodes and the one table for edges. When it comes to heterogeneous graph usually with multiple types of nodes and relations, or edges, we need multiple tables to represent the both nodes and the relations. Such representation would be highly dependent on the data applications and hard to generalize the schema.

Homogeneous vs Heterogeneous
Node: 1 table Node: multiple tables
Edge: 1 table Edge: multiple tables

In RDF 

Família de especificações, modelo de dados de metadados, Ontologias em OWL para definição do esquema

(Subject, Predicate, Object)
• Each node has an universal id (S)
• It’s attribute is represented as: (S, attributeName (P), attributeValue(O))
• An edge connected two nodes (e.g. S1, S2): (S1, relationName (P), S2(O))

Node + Edge: SINGLE table

Hence, regardless the number of types of nodes and edges a single table is enough for representing the knowledge graph. However, a companion ontology in OWL is required to define the types and the details of entities and relations.



Pros:
• Universal
• Simple
Schema-less on the form ( “Schema” defined in Ontology) = It is schema-less on the surface, which means it doesn't have any explicit schema to define the table columns. However, just as every schema-less approach implicit schema is required and it's defined by its ontology.

Cons:
• Could be very complex for “simple / direct” relationship

KG construction overview

Fonte: Conhecimento Humano (manual), Bancos de Dados estruturados (ETL), Documentos (não estruturado - NLP)

 Challenges (extrair conhecimento de documentos):
Incomplete -> the information may not be complete in the documents. It's quite likely some hidden knowledge or hidden assumptions are not explicitly articulated in the documents.
Inconsistent -> since machines are not reading a single document, instead, it's reading in the scale of thousands, or even millions documents. It's unavoidable that, across different documents,
or sometimes even within the same document, there are contradictions or inconsistent statements about some fact. Which statement we should trust, and what we should ignore or discard?
Ambiguous -> some language statement could be ambiguous, and with multiple interpretations. Depends on the context, which interpretation the machine should choose.


We can classify the problems into supervised, semi-supervised, and unsupervised ones. 

Human-labeled data are collected to guide the machine to learn, lots of human efforts is involved, it's slow, but with very high precision. With more label data available, which is more expensive and more human efforts involved, it produces very good quality and high-precision results. 
The torso section of the data, usually with larger size of distinct values, with less or comparable coverage as the head scenarios and the user cases. It's more common to use semi-supervised learning, with some human-labeled data, to let the computers first learn how to label more data, and then learn from the labeled data.
The last part is a long tail, which is very sparse, and have huge amount of distinct values, covers the tail spectrum of the use cases. Unsupervised learning is applied, with minimal human label interventions. We can have machines to learn and to generate with minimum supervision. It's fast and has great coverage, but tends to be very noisy and low-quality. 

In reality, we need to evaluate the application scenarios and the nature of the data at hand to best design an appropriate solution for each case.

NLP Fundamentals – Information Extraction

NLP é parte da Inteligência Artificial

How can we leverage the machines power and intelligence to make such unstructured, noisy, and huge amount of documents to be built into a more structured clean data set, which can be indexed, so that can be searched and understood by human easily, and to further facilitate human for decision making action?

Tarefas: Sintática, Semâmtica, Discurso,Fala

Sentence level:
1) Part-of-speech (POS) tagging: determine on each word of a sentence which part of speech it is. For example, is it a noun, a verb, or adjective?
2) Named-entity recognition: identify entity mentions in text and classify them into predefined types.
3) Dependency Parsing: focuses on getting the relationships between words in the sentence, such that marking things like primary objects and predicates.



Document level: understand what a pronoun refers to or which noun or names it matches, such a task is called coreference resolution.

It,He or She: quem está sendo referenciado em uma frase depende de frase/orações anteriores.

In NLP, Information Extraction (IE), is a task to automatically extract structure information from unstructured or semi-structured documents. Main tasks involved in IE includes but not only limited to below three: 

• Named Entity Recognition (WHO)
• Entity Linking (WHO / WHAT)
• Relation Extraction (HOW)

These three tasks are also the key pieces that are used for knowledge graph construction.

How to identify / recognize entities (nodes) = Named Entity Recognition

Named Entity Recognition, or NER, is a task to identify to entity mentions in given text, and then classifying them into a predefined set of types. Such as person, organization, date, or numbers. In short, NER has two key steps; detect mention and identify type of the mentions.

Tipos podem ser Pessoas, Organizações, Lugares, Números, Datas, etc ...

Two encoding methods: IO (inside outside); IOB (inside outside beginning). B prefix before a tag indicates that the tag is a beginning of a trunk. While an I prefix, indicates it's inside a trunk.

Traditional machine learning sequence model could be used to solve the NER. It uses various features extracted from the documents (Feature engineering).

Algorithms
• Naïve Bayes (NB)
• Hidden Markov Model (HMM)
• Conditional Random Field (CRF)

Recently, with a rise and popularity on deep learning approach, researchers are also trying to apply LSTM, which stands for Long Short Term Memory and the CNN, which stands for Convolutional Neural Networks together, for the NER problem.

Entity Linking

It basically try to link free text mentions to entities. The text could be any format or of any nature such as formal publications, such as scientific articles, news documents, or blog post, tweets, or search engine queries. It is a natural next step after NER. The entities are usually coming from knowledge base such as Freebase or Wikipedia.

• Enable Semantic Search experience
• Used for Knowledge Graph population
• Used as feature for improving:

  •  Classification
  •  Retrieval
  • Question and answering
  • Semantic similarity


In entity linking task, it usually follows below three main steps:

• Select candidate entity links (mention detection): determine which phrases are linkable, or we can call such phrases as mentions.
• Generate candidate entities/Link; May include NILs (null values, i.e., no target in KB)
• Use “context” to disambiguate/filter/improve

There are two measures to disambiguate entity candidates. In this slide, we will introduce a measure called commonness. The commonness of a sense is defined by the number of times it is used as a destination in Wikipedia. The second measure used for disambiguate an candidate is called relatedness. It describes how a given candidate's word sense relate to its surrounding context. The relatedness of a candidate word sense is a weighted average of its relatedness to each context article.

Não é meramente sintático mas parte do match sintático para depois analisar os termos que circundam aquela palavra/expressão para entender o contexto.

How to obtain semantic relationships (edges) between entities = Relation extraction

How to discover and build the relations between entities, which we call relation extraction. For a given sentence, how can we extract the semantic relationship between mentioned entity pairs?

• Undefined vs pre-defined set of relations
• Binary vs multiple relations
• Supervised vs unsupervised vs distant-supervision

When can we apply Bootstrapping approach? It is when we do not have enough annotated text for training. However, we do have some seed instances of the relation, and a lot of unannotated text or documents. In this sense, the Bootstrapping approach can be considered as a semi-supervised learning.

Usa uma ferramenta de busca para identificar padrões entre duas entidades conhecidas.

Supervised Relation Extraction

• Utilizing labels of relation mentions
• Traditional relation extraction datasets (ACE 2004, MUC-7, Biomedical datasets)
• Learn classifiers (SVM, Multiclass logistic regression, Naïve Bayes) from those positive and negative examples

Datasets conhecidos com relações para treinar os algoritmos/modelos.

Typical features
• Bags of words & bigrams between, before, and after the entities
• POS tags
• The types of the entities
• Dependency path between entities
• Distance between entities
• Tree distance between the entities
• NER tags

Pros
• Higher accuracy
• Explicit negative examples (Sample 1% of unrelated pairs of entities for
roughly balanced data)
Cons
• Very expensive to label data
• Doesn’t generalize well to different relations

Distantly Supervised Relation Extraction

KB deve ser grande/rico. Junto com o KB, o Corpus de texto não anotado deve gerar uma boa base de treinamento

• Collection training data: Corpus Text + Knowledge graph (Freebase fact triples) -> Training data (labels = X, Y, features = X relation Y)

With all this positive and the negative training data collected we can send them into a multiclass logistic regression solver to train a relation classifier. The classifier takes entity pairs and the feature vectors as input and then returns a relation name and a confidence score, which is a probability of the entity pair belongs to the returned relation.

Pros
• Can scale since no supervision required
• Leverage rich and reliable data from knowledge base
• Leverage unlimited amounts of text data
• Can generalize to different domains
Cons
• Needs high quality entity matching

Comentários

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russell, 1972, p. 36.) “Truthmaker theories” hold that in order for any truthbe

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The