Neves, A. B., Leme, L. A. P. P., Izquierdo, Y. T., & Casanova, M. A. (2021). Automatic Construction of Benchmarks for RDF Keyword Search Systems Evaluation. In Proceedings of the 23rd International Conference on Enterprise Information Systems (pp. 126–137).
Vídeo -> https://youtu.be/JZd5DS2hW-c
Abstract: Keyword search systems provide users with a friendly alternative to access Resource Description Framework (RDF) datasets. The evaluation of such systems requires adequate benchmarks, consisting of RDF datasets and keyword queries, with their correct answers. However, the sets of correct answers such benchmarks provide for each query are often incomplete, mostly because they are manually built with experts’ help. The central contribution of this paper is an offline method that helps build RDF keyword search benchmarks automatically,leading to more complete sets of correct answers, called solution generators. The paper focuses on computing sets of generators and describes heuristics that circumvent the combinatorial nature of the problem. The paper then describes five benchmarks, constructed with the proposed method and based on three real datasets, DBpedia, IMDb, and Mondial, and two synthetic datasets, LUBM and BSBM. Finally, the paper compares the constructed benchmarks with keyword search benchmarks published in the literature.
1 Introduction
RDF-KwS tools, have three main tasks: (1) retrieve nodes in the RDF graph that the keywords specify; (2) discover how they are interrelated to compose complete answers; and (3) rank these answers (Menendez et al., 2019).
.. answers for keyword searches over RDF datasets are .. subgraphs of the dataset.
Despite the many existing benchmarks for structured data, (Coffman and Weaver, 2010; Dosso
and Silvello, 2020; Dubey et al., 2019; Trivedi et al., 2017) these benchmarks have at least three limitations when it comes to RDF-KwS: (1) they are frequently built for relational data; (2) they are incomplete in the sense that they do not cover many reasonable answers; and (3) they are not always publicly available.
To remedy the first limitation, some authors adapted benchmarks originally developed for relational databases. However, the adaptation requires the triplification of relational databases and benchmark data, leading to problems while comparing different systems using different triplifications.
Incompleteness in this sense is a serious problem that is difficult to overcome in manually constructed benchmarks.Computing the set of all correct answers for a given keyword query over an RDF dataset leads to an explosive combinatorial problem.
8 Conclusions
One of the main issues in the development of RDF keyword search algorithms is the lack of appropriate benchmarks.
It circumvents the combinatorial nature of generating correct answers by pruning the search space, following four heuristics based on the concepts of seeds and solution generators. The proposed heuristics introduce flexibilization points of the method that enable the construction of different benchmarks according to the purpose.
The paper proceeded to describe synthetic benchmarks for IMDb, Mondial, BSBM, LUBM, and DBpedia. The experiments compared the synthetic benchmarks with baseline benchmarks, and showed that the solution generators obtained express the majority of the correct answers defined in the baseline benchmarks, but express many more answers, for the datasets with real data, IMDb, Mondial, and DBpedia, and express exactly the answers defined for the synthetic datasets, BSBM and LUBM.
2 Related Work
For IMDb, they defined 50 keyword queries and their correct translations to SPARQL queries.
For DBpedia, the authors considered 50 topics from the classes QALD2 te and QALD2 tr of the Question Answering over Linked Data (QALD) campaigns (http://qald.aksw.org).
For the synthetic databases, they used 14 SPARQL queries for LUBM and 13 SPARQL queries for BSBM.
For all original SELECT queries from these datasets, Dosso and Silvello mapped these queries to SPARQL CONSTRUCT queries and produced their equivalent keyword queries.
3 Definitions
Finally, we can informally state the RDF KwS-Problem as: “Given an RDF dataset T and a keyword query K , find an answer A for K over T , preferably, with as many keyword matches as possible and with the smallest set of triples as possible”.
4 Overview of the Method
The neighborhood graph is composed of the nodes and edges visited by a breadth-first walk of distance d through the paths in EGT, starting from i (an inducer) , where d is a user-defined parameter.
Fortunately, and unlike traditional Information Retrieval (IR) systems, the RDF KwS-Problem has some peculiarities that can be exploited to define optimization heuristics that can reduce the computational cost and also help rank solution generators.
The differences between traditional IR systems and RDF-KwS systems stem from the fact that keywords which co-occur in a document may not convey the idea of keyword correlation. By contrast, an RDF subgraph connecting resources linked to these keywords is much more likely to be related to the intended meaning of the keyword query because resources are not connected by chance in an RDF graph, as it may be the case in a text document.
By assuming such differences, one can define some characteristics of good answers: the keywords must be connected as closely as possible, and answers must contain as many keywords as possible.
5 Generating Keyword Queries
1. Select user relevant entities (inducers) I = {i1, i2, i3}
2. Compute the neighborhood graph ... start from in and BFS of distance d
3. Randomly extract sets of keywords from textual properties of nodes and edges
The relevance scores typically express users’ preferences and can be used to select entities and, consequently, induce relevant sets of keywords and their respective answers
6 Computing Solution Generators
1. Compute the set of seeds S={s1, s2, ..., sj}. A seed is any node whose property values match at least one keyword
2. Compute the power set of the set of the seeds PS = {Ss1, Ss2, ...., Ss(j^2)}. A Power Set is a set of all the subsets of a set. Here, except the empty set {}. If the original set has n members, then the Power Set will have 2n members.
3. For each subset Ssi, compute all paths pairs of seeds in S. The set of triples obtained is a solution generator.
From section 4
However, computing all solution generators can be expensive, depending on the cardinality of SK and the number of paths between the seeds. We then propose to compute solution generators that capture only the most relevant answers. ...
four heuristics to work around the complexity of the problem of computing solution generators
The first heuristic refers to the selection of the most relevant seeds.... Select the top-K
The second heuristic refers to the selection of sets in 2SK for which one would compute the solution generators. This is done by scoring each R 22SK and selecting the top ranked ones based on four principles.
The third heuristic is to consider only paths between seeds with length less than or equal to a given limit L, say L =4, to compute solution generators. In fact, as argued in Nunes et al. (2014), paths longer than 4 would express unusual relationships, which might be misinterpreted by users.
The fourth heuristic is to rank the solution generators in PK according to their scores, following user needs.
7 Evaluation of the Benchmark Generation Method
Let b =(D,Qb,Ab) be a baseline benchmark, where D is an RDF dataset, Qb is a set of keyword queries, and Ab defines the correct answers for the queries in Qb. The strategy is to construct a synthetic benchmark s =(D,Qs,As) for the same dataset D, using the proposed method, where Qb <interseção> Qs <é vazio> and As contains, for each keyword query in Qs, a list of solution generators
over D.
Coffman’s benchmark was created to evaluate keyword search systems over relational databases, and is based on data and schemas of relational samples of IMDb, Mondial, and Wikipedia. For each database, it has 50 keyword queries and their correct answers. Quite a few keyword queries in Coffman’s benchmark are simple queries in that their expected answers are single entities.
Dosso and Silvello (2020) used three real RDF datasets, LinkedMDB, IMDb, and DBpedia, and two synthetic RDF datasets, The Lehigh University Benchmark — LUBM, and The Berlin SPARQL Benchmark — BSBM.
Results: 5 synthetic benchmarks, for IMDb, Mondial, BSBM, LUBM, and DBpedia. The RDF datasets are available at https: //doi.org/10.6084/m9.figshare.11347676.v3, and the implementation of Algorithm 1, the keyword queries, the solution generators, and statistics are available at https://doi.org/10.6084/m9.figshare.9943655.v12
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.