edX @ From Graph to Knowledge Graph – Algorithms and Applications / Knowledge Graph Fundamentals and Construction
Microsoft: DAT278x
From Graph to Knowledge Graph – Algorithms and Applications
Module 4: Knowledge Graph Fundamentals and Construction
Knowledge graph fundamentals
A brief history of knowledge graph
If we claim a knowledge graph is representing knowledge in the graph form, the earliest, primitive form of such representation is called existential graph (Charles Sanders Peirce, late 1800s).
"The Existential Graphs of Charles S. Perice": a book published in 1973 by Don Roberts.
Semantic network (name coined at 1956 by Richard Richens, an early researcher in computational linguistic)
Semantic network was created for machine translation task for natural language at that time.
An example of semantic network is WordNet. It groups English words into sets of synonyms which has short general definition and have various semantic relations between the synonym sets. Relation example are included like A is part of B, B contains A, or A is the same as B, or A is the opposite of B.
This is indicated by the psychological experiments that human organized our knowledge of concepts in hierarchical fashion or graph structure.
Linked Data (name coined at 2006 by Tim Berners-Lee)
It is a common framework to publish and share data across different applications and domains. Linked data, according to Berners-Lee, stated three very simple principles.
(1) all conceptual things have name start with http.
(2) these http names can be used for look up and retrieve everything related to it.
(3) the retrieved contents not only include attributes but also relations with other http names.
Exemplos: DBpedia, WikiData
Expert systems (1970s)
In the AI domain, expert system is created as a computer system which tries to emulate the decision making ability of human experts. It is composed of two sub-systems, Knowledge Base and Inference Engine. Knowledge Base contains facts and rules, while Inference Engine is applying the rules to the known facts to infer, or deduce, some new facts.
KB é composto de fatos (conhecimento assercional) e regras, Inferência descobre novos fatos a partir dos fatos conhecidos usando as regras definidas.
Knowledge Graph (by Google in 2012) - Google’s knowledge base
The entity information box, that powered by Knowledge Graph, were added to Google's search result in year 2012.
Knowledge graph - Ontology (information science)
Here the ontology means a set of concepts and categories in certain subject area that shows their properties and the relations between them.
And the term knowledge graph and the knowledge base (KB), are mostly interchangeable in many circumstances. In this course, we will use knowledge graph term to emphasize the aspect
about it's strong relation with graph theory and its knowledge in the graph form definition.
Knowledge graph representation
- graph data structure: Adjacency matrix (a square matrix indicates whether two nodes in a graph is connected or not) / Adjacency LIST
- database: Table – relational explicit Schema (SQL + define the data contents, such as their names and the types) vs. Schema-less (Non-SQL + implicit schema)
- Resource Description Framework (RDF by W3C)
• Subject, Predicate, Object (S,P,O) = triples
• “schema” + “data” = metadata data model
Types of entities and the relations are defined in machine-understandable dictionaries, which is called ontology. The standard ontology language is Web Ontology Language, or OWL.
RDF and OWL together enables an efficient and effective approach to represent knowledge. In short, RDF provides a graph-based data model to describe the objects, while OWL offers a standard way of defining vocabularies for data annotation.
Knowledge graph logical representation
Matriz de Adjacências
In relational model
If it is homogeneous graph, which means that we have a single type of node and the single type of connection between nodes, then we have one table for nodes and the one table for edges. When it comes to heterogeneous graph usually with multiple types of nodes and relations, or edges, we need multiple tables to represent the both nodes and the relations. Such representation would be highly dependent on the data applications and hard to generalize the schema.
Homogeneous vs Heterogeneous
Node: 1 table Node: multiple tables
Edge: 1 table Edge: multiple tables
In RDF
Família de especificações, modelo de dados de metadados, Ontologias em OWL para definição do esquema
(Subject, Predicate, Object)
• Each node has an universal id (S)
• It’s attribute is represented as: (S, attributeName (P), attributeValue(O))
• An edge connected two nodes (e.g. S1, S2): (S1, relationName (P), S2(O))
Node + Edge: SINGLE table
Hence, regardless the number of types of nodes and edges a single table is enough for representing the knowledge graph. However, a companion ontology in OWL is required to define the types and the details of entities and relations.
Pros:
• Universal
• Simple
• Schema-less on the form ( “Schema” defined in Ontology) = It is schema-less on the surface, which means it doesn't have any explicit schema to define the table columns. However, just as every schema-less approach implicit schema is required and it's defined by its ontology.
Cons:
• Could be very complex for “simple / direct” relationship
KG construction overview
Fonte: Conhecimento Humano (manual), Bancos de Dados estruturados (ETL), Documentos (não estruturado - NLP)
Challenges (extrair conhecimento de documentos):
• Incomplete -> the information may not be complete in the documents. It's quite likely some hidden knowledge or hidden assumptions are not explicitly articulated in the documents.
• Inconsistent -> since machines are not reading a single document, instead, it's reading in the scale of thousands, or even millions documents. It's unavoidable that, across different documents,
or sometimes even within the same document, there are contradictions or inconsistent statements about some fact. Which statement we should trust, and what we should ignore or discard?
• Ambiguous -> some language statement could be ambiguous, and with multiple interpretations. Depends on the context, which interpretation the machine should choose.
We can classify the problems into supervised, semi-supervised, and unsupervised ones.
Human-labeled data are collected to guide the machine to learn, lots of human efforts is involved, it's slow, but with very high precision. With more label data available, which is more expensive and more human efforts involved, it produces very good quality and high-precision results.
The torso section of the data, usually with larger size of distinct values, with less or comparable coverage as the head scenarios and the user cases. It's more common to use semi-supervised learning, with some human-labeled data, to let the computers first learn how to label more data, and then learn from the labeled data.
The last part is a long tail, which is very sparse, and have huge amount of distinct values, covers the tail spectrum of the use cases. Unsupervised learning is applied, with minimal human label interventions. We can have machines to learn and to generate with minimum supervision. It's fast and has great coverage, but tends to be very noisy and low-quality.
In reality, we need to evaluate the application scenarios and the nature of the data at hand to best design an appropriate solution for each case.
NLP Fundamentals – Information Extraction
NLP é parte da Inteligência Artificial
How can we leverage the machines power and intelligence to make such unstructured, noisy, and huge amount of documents to be built into a more structured clean data set, which can be indexed, so that can be searched and understood by human easily, and to further facilitate human for decision making action?
Tarefas: Sintática, Semâmtica, Discurso,Fala
Sentence level:
1) Part-of-speech (POS) tagging: determine on each word of a sentence which part of speech it is. For example, is it a noun, a verb, or adjective?
2) Named-entity recognition: identify entity mentions in text and classify them into predefined types.
3) Dependency Parsing: focuses on getting the relationships between words in the sentence, such that marking things like primary objects and predicates.
Document level: understand what a pronoun refers to or which noun or names it matches, such a task is called coreference resolution.
It,He or She: quem está sendo referenciado em uma frase depende de frase/orações anteriores.
In NLP, Information Extraction (IE), is a task to automatically extract structure information from unstructured or semi-structured documents. Main tasks involved in IE includes but not only limited to below three:
• Named Entity Recognition (WHO)
• Entity Linking (WHO / WHAT)
• Relation Extraction (HOW)
These three tasks are also the key pieces that are used for knowledge graph construction.
How to identify / recognize entities (nodes) = Named Entity Recognition
Named Entity Recognition, or NER, is a task to identify to entity mentions in given text, and then classifying them into a predefined set of types. Such as person, organization, date, or numbers. In short, NER has two key steps; detect mention and identify type of the mentions.
Tipos podem ser Pessoas, Organizações, Lugares, Números, Datas, etc ...
Two encoding methods: IO (inside outside); IOB (inside outside beginning). B prefix before a tag indicates that the tag is a beginning of a trunk. While an I prefix, indicates it's inside a trunk.
Traditional machine learning sequence model could be used to solve the NER. It uses various features extracted from the documents (Feature engineering).
Algorithms
• Naïve Bayes (NB)
• Hidden Markov Model (HMM)
• Conditional Random Field (CRF)
Recently, with a rise and popularity on deep learning approach, researchers are also trying to apply LSTM, which stands for Long Short Term Memory and the CNN, which stands for Convolutional Neural Networks together, for the NER problem.
Entity Linking
It basically try to link free text mentions to entities. The text could be any format or of any nature such as formal publications, such as scientific articles, news documents, or blog post, tweets, or search engine queries. It is a natural next step after NER. The entities are usually coming from knowledge base such as Freebase or Wikipedia.
• Enable Semantic Search experience
• Used for Knowledge Graph population
• Used as feature for improving:
- Classification
- Retrieval
- Question and answering
- Semantic similarity
In entity linking task, it usually follows below three main steps:
• Select candidate entity links (mention detection): determine which phrases are linkable, or we can call such phrases as mentions.
• Generate candidate entities/Link; May include NILs (null values, i.e., no target in KB)
• Use “context” to disambiguate/filter/improve
There are two measures to disambiguate entity candidates. In this slide, we will introduce a measure called commonness. The commonness of a sense is defined by the number of times it is used as a destination in Wikipedia. The second measure used for disambiguate an candidate is called relatedness. It describes how a given candidate's word sense relate to its surrounding context. The relatedness of a candidate word sense is a weighted average of its relatedness to each context article.
Não é meramente sintático mas parte do match sintático para depois analisar os termos que circundam aquela palavra/expressão para entender o contexto.
How to obtain semantic relationships (edges) between entities = Relation extraction
How to discover and build the relations between entities, which we call relation extraction. For a given sentence, how can we extract the semantic relationship between mentioned entity pairs?
• Undefined vs pre-defined set of relations
• Binary vs multiple relations
• Supervised vs unsupervised vs distant-supervision
When can we apply Bootstrapping approach? It is when we do not have enough annotated text for training. However, we do have some seed instances of the relation, and a lot of unannotated text or documents. In this sense, the Bootstrapping approach can be considered as a semi-supervised learning.
Usa uma ferramenta de busca para identificar padrões entre duas entidades conhecidas.
Supervised Relation Extraction
• Utilizing labels of relation mentions
• Traditional relation extraction datasets (ACE 2004, MUC-7, Biomedical datasets)
• Learn classifiers (SVM, Multiclass logistic regression, Naïve Bayes) from those positive and negative examples
Datasets conhecidos com relações para treinar os algoritmos/modelos.
Typical features
• Bags of words & bigrams between, before, and after the entities
• POS tags
• The types of the entities
• Dependency path between entities
• Distance between entities
• Tree distance between the entities
• NER tags
Pros
• Higher accuracy
• Explicit negative examples (Sample 1% of unrelated pairs of entities for
roughly balanced data)
Cons
• Very expensive to label data
• Doesn’t generalize well to different relations
Distantly Supervised Relation Extraction
KB deve ser grande/rico. Junto com o KB, o Corpus de texto não anotado deve gerar uma boa base de treinamento
• Collection training data: Corpus Text + Knowledge graph (Freebase fact triples) -> Training data (labels = X, Y, features = X relation Y)
With all this positive and the negative training data collected we can send them into a multiclass logistic regression solver to train a relation classifier. The classifier takes entity pairs and the feature vectors as input and then returns a relation name and a confidence score, which is a probability of the entity pair belongs to the returned relation.
Pros
• Can scale since no supervision required
• Leverage rich and reliable data from knowledge base
• Leverage unlimited amounts of text data
• Can generalize to different domains
Cons
• Needs high quality entity matching
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.