Pular para o conteúdo principal

Knowledge graphs: Introduction, history, and perspectives - Leitura de Artigo

Chaudhri, V. K., C. Baru, N. Chittar, X. L. Dong, M. Genesereth, J. Hendler, A. Kalyanpur, D. Lenat, J. Sequeda, D. Vrandečić, and K.Wang 2022. “Knowledge graphs: Introduction, history, and perspectives.” AI Magazine 43: 17–29. https://doi.org/10.1002/aaai.12033

Knowledge graphs (KGs) have emerged as a compelling abstraction for organizing the world’s structured knowledge and for integrating information extracted from multiple data sources.

KNOWLEDGE GRAPH DEFINITION

A KG is a directed labeled graph in which domain-specific meanings are associated with nodes and edges.

[Definição focado no COMO representar, diferente dos KBs]

There are multiple approaches for associating meanings with the nodes and edges. At the simplest level, the meanings could be stated as documentation strings expressed in a human understandable language such as English. At a computational level, the meanings can be expressed in a formal specification language such as first-order logic. An active area of current research is to automatically compute the meanings captured in a vector consisting of a sequence of numbers.

[Rótulos em linguagem natural permite a busca com consultas em linguagem natural. A semântica pode ser representada por consultas em GQL?]

Information can be added to a KG via a combination of human-driven, semiautomated, and/or fully automated methods. Regardless of the method, it is expected that the recorded information can be easily understood and verified by humans.

[Compreensível por humanos além da máquina]

Directed labeled graph representation and graph algorithms are effective for several classes of problems. They are, however, insufficient to capture all inferences of interest. We will discuss this in more detail in a later section on big semantics versus little semantics.

[Pq o modelo RDF é insuficiente?]

Edge properties can be used for a variety of purposes: to represent facts that are in dispute (for example, a country in which Kashmir resides); highly time-dependent information (for example, the president of USA); or genuine diversities (for example, user behaviors). With the recent emphasis on responsible AI, annotating the edges with information on how they were obtained plays a key role in explaining inferences based on the KG. For example, an edge property of confidence could be used to represent the probability with which that relationship is known to be true.

[Contexto]

 Organizing open information

Wikidata includes information from several independent providers including, for example, the Library of Congress. By using unique internal identifiers for distinct entities, for example, Winterthur, from a variety of sources, such as, the Library of Congress and others, the information about an entity can be easily linked together. Wikidata makes it easy to integrate the different data sources by publishing a mapping of the Wikidata relations to the schema.org ontology.

 [Mapeamento de Ontologias]

Search engines are routinely using the results of such queries to enhance their results ... As per a recent estimate, 31% of all websites and over 12 million data providers are currently using the vocabulary of schema.org to publish annotations to their web pages

Second, even though it is manually curated, the cost of curation is shared by a community of contributors. Third, while some of the data in Wikidata may be automatically extracted from sources (Wu, Hoffmann, and Weld 2008), all information is required to be easily understandable and verifiable as per the Wikidata editorial policies. Lastly, and importantly, there is a commitment to providing semantic definitions of relation names through the vocabulary in schema.org.

A recent example of another openly accessible KG is from the Data Commons effort whose goal is to make publicly available data readily accessible and usable. Data Commons performs the necessary cleaning and joining of data from a variety of publicly available government and other authoritative data sources and provides access to the resulting KG. It currently incorporates data on demographics
(US Census, Eurostat), economics (World Bank, Bureau of Labor Statistics, Bureau of Economic Analysis), health (World Health Organization, Center for Disease Control), climate (Intergovernmental Panel on Climate Change, National Oceanic and Atmospheric Administration), and sustainability.

[Integração de dados públicos abertos]

 Organizing enterprise information

A “360-degree view” of a customer of a company includes the data about that customer from within the company and the data about the customer from sources outside the company.

This often requires solving the entity disambiguation problem to uniquely identify entities under question

[Como desambiguar entidades que existem nos BDs relacionais e nas KBs abertas]

the visual nature of the graph-oriented KG schemas facilitates whiteboarding of the schemas by the business users and subject matter experts in specifying their requirements. Next, the KG schema needs to be mapped to the schemas of the underlying sources so that the respective data can be loaded into the KG engine. 

The meaning of the data stored in enterprise databases is hidden in logic embedded in queries, data models, application code, written documentation, or simply in the minds of subject matter experts requiring both human andmachine effort in the mapping process (Sequeda and Lassila 2021).

[A semântica não está somente no BD e nos dados]

First, the integrated information may come from text and other unstructured sources (for example, news, social media, and others) as well as structured data sources (for example, relational databases). As many information extraction systems already output information in triples, using a generic schema of triples substantially reduces the cost of starting such data integration projects. Second, it can be easier to adapt a triple-based schema in response to changes than the comparable effort required to adapt a traditional relational database.

[Schema later x Schema a priori]

All these computations leverage domain independent graph algorithms such as centrality detection and community detection.

[Algoritmos independentes de domínio]

Once this snippet of knowledge is incorporated into a larger KG, we can use logical inference to derive additional links ...

[graph snippet é um sub-grafo, text snippet é um parágrafo de texto gerado pela lexicalização de um sub-grafo baseado em cada aresta]

We next take the example of a specific kind of commonsense reasoning known as cause-and-effect reasoning ... A general strategy to program such reasoning is to first curate a KG manually and then use it in conjunction with a machine learning algorithm to predict the effects for events that do not exist in the KG.

Indeed, choosing representations that allow agents to store information and derive new conclusions is a problem that is central to AI. The earliest research in AI used frame representations, known as semantic networks, which were directed labeled graphs (Woods 1975). This directed labeled graph representation has been adapted depending on the needs of a given application. ... A directed labeled graph containing data and taxonomy is often referred to as an ontology.

[KG x KB semantic network e Ontologias]

While some researchers used first-order logic (FOL) to computationally understand semantic networks (Hayes 1981), others advocated that FOL was required to represent the knowledge needed for AI agents (McCarthy 1989). Because of the computational difficulty of reasoning with FOL, different subsets of FOL, such as description logics (Brachman and Levesque 1984) and logic programs (Kowalski 2014), were investigated. There was an analogous development in databases where the initial data systems were based on a network data model (Taylor and Frank 1976), but a desire to achieve independence between the data model and the query processing eventually led to the development of relational data model (Codd 1982), which shares its mathematical core with logic programming.

[Lógica de Primeira Ordem até BD relacional]

A need to handle semistructured data (Buneman 1997) inspired the investigation of “schema-free” systems or triple stores that capture an important class of problems addressed by modern KG systems.

[Modelo flexível]

CONTRASTING PERSPECTIVES

symbolic representation versus vector representation, human curation versus machine curation, and “little semantics” versus “big semantics.” There are spirited debates in the community about the effectiveness and efficacy ...

A commonly used vector representation in NLP is word embedding. For example, given a corpus of text, one can count how often a word appears next to every other word, resulting in a vector of numbers. Sophisticated algorithms are available for reducing the dimensions of the vectors to calculate a more compact vector, known as a word embedding (Mikolov et al. 2013). Word embeddings capture the semantic meaning of the word in a way that can be computationally leveraged in tasks, such asword similarity calculation, entity extraction, and relation extraction. Analogously, the CV algorithms operate on vector representation of images. Graph embedding is a generalization of word embedding, but for graph-structured input

[Representação vetorial - embeddings para texto, imagem e grafo]

Algorithms using vector representations have excelled at many tasks, for example, web search and image recognition. Using web search of today, we can answer questions such as: Who was the prime minister of the UK in October of 1956? But the search fails if the question is modified to an unusual combination of inference steps, for example, Who was the prime minister of the UK when Theresa May was born? Humans have little difficulty in understanding such questions (Lenat 2019a; Lenat 2019b). 

[Exemplo interessante e difícil de resolver, se considerar o contexto temporal e usar uma comparação de contexto pode ser viável]

The limitations of vector representations can be addressed by encoding the information extracted from text and images into a KG,

Human curation versus machine curation

The MAG team consequently leveraged machine curation by identifying a publication by its contents and disambiguating authors based on their field(s) of research, affiliation(s), coauthor(s), and other factors that are more natural to humans.

Machine curation techniques were leveraged at different levels of scaling. To get the project off the ground, highly accurate automated knowledge extraction models were created to generate trustworthy data on a small scope of products, where each model extracted knowledge for a single attribute from a single product domain (Zheng et al. 2018). Even though neural networks were explored to automate the process, tremendous manual work was involved to create training data, conduct human evaluation, and to identify postprocessing rules to remove extraction noise.

[MAG tinha furos, mesmo com a abordagem automática

[Amazon Product Graph (APG)]

Unlike, MAG and APG, Wikidata allows conflicting data to coexist and provides mechanisms to organize this plurality in values. Checking, verifying, and allowing such a plurality of data is something the Wikipedia community has been doing for years. Wikidata’s human curation effort involves a community of over 400,000 editors, with over 20,000 active editors. In this process, Wikidata has leveraged standard published identifiers, including the International Standard Name Identifier (ISNI), ChinaAcademic Library and Information System (CALIS), International Air Transport Association (IATA), MusicBrainz for albums and performers, and North Atlantic Basin’s Hurricane Database (HURDAT). Wikidata itself publishes a list of standard identifiers for items that appear in its corpus, which are now increasingly being used in commercial KGs.

[Além do mapeamento de ontologias, WD tem s IDs em diferentes fontes/bases]

Little semantics versus big semantics

The big semantics perspective may be viewed as one that advocates for capturing more meaning about concepts. Whereas, the little semantics perspective, is focused on capturing/recording the basic facts and not so much the concept meanings. A KG defined as a directed labeled graph is a representative technique of the little semantics approach. The representation kanguage CycL is a representative
technique of the big semantics approach. 

[Hiper relacional seria big semantics?]

Using only directed labeled graph representation for KGs has its inherent limitations. A simple example of such a limitation is in representing the statement: Los Angeles is between San Diego and San Jose along US 101....The statement can be captured directly if we allow four-place predicates which are not supported in directed graphs—although many implementations of graph and semantic web databases do include this capability. ... Use of triples and reification makes downstream tasks such as natural language generation more difficult as they must now assemble information spread across multiple  triples. 

[Reificação dificulta tarefas de NLP]

As a more involved example, consider the statements Every Swede has a King, and Every Swede has a mother, which are syntactically similar in English, and many KGs would represent them  identically, but these statements have very different computational meanings . It is possible to extend the directed graphs in a variety of ways to correctly capture the semantics of the example considered in (Chaudhri et al. 2004; Sowa 2008), but such extensions lose the simplicity offered by the triple representation. Not surprisingly, similar efforts are underway for machine learning of non binary relationships as well (Fatemi et al. 2019).

X

[Existe um rei y para todos Suíços (1:n) x é diferente de Para todo suíco x existe uma mãe Y (1:1)]

What common naming conventions will allow users to interact with multiple existing KGs and create their own combined products, which in turn can be used by others and combined still further, ad infinitum?

A KG also serves the purpose of capturing knowledge learned and used by modern machine learning methods. The most notable uses of directed labeled graphs in AI and databases (data modeling) have taken the form of data graphs, taxonomies, and ontologies.

Even though a directed labeled graph is a common thread linking present day KGs with the early semantic networks in AI, there are some important differences in the research methodology and technical problems addressed. Early semantic networks were created by top-down design methods and manual knowledge engineering processes. They never reached the size and scale of today’s KGs. In contrast, modern KGs tend to be large in scale; employ bottom-up development techniques; and employ manual as well as automated strategies for their construction. 

[Escala de KB x KB e abordagem de construção em função dessa escala]

[Regras lógicas para KB e Algoritmos de grafo e de embeddings para KG]

[KB eram constryídos somente por pessoas do domínio enquanto que KG pode ser criados por ML e por grupos de usuários]

The emphasis in the early AI semantic networks was on complex logical inferencing, in contrast to the focus on supporting analytics operations in modern KGs. Furthermore, vast proliferation of available data, difficulty in arriving at a top-down schema design for data integration, and the data-driven nature of machine learning have all led to a bottom-up methodology for creating KGs. Contemporary KGs are also supplementing manual knowledge engineering techniques with crowdsourcing and significant automation that is now possible through progress in machine learning.


Comentários

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russell, 1972, p. 36.) “Truthmaker theories” hold that in order for any truthbe

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The