Pular para o conteúdo principal

Data-driven Computational Social Science: A Survey - Leitura de Artigo

Abstract
[Contexto] Social science concerns issues on individuals, relationships, and the whole society. The complexity of research topics in social science makes it the amalgamation of multiple disciplines, such as economics, political science, and sociology, etc. For centuries, scientists have conducted many studies to understand the mechanisms of the society.

[Problema] However, due to the limitations of traditional research methods, there exist many critical social issues to be explored.

[Proposta de Solução: uma nova área de pesquisa interdisciplinar] To solve those issues, computational social science (CSS) emerges due to the rapid advancements of computation technologies and the profound studies on social science. With the aids of the advanced research techniques, various kinds of data from diverse areas can be acquired nowadays, and they can help us look into social problems with a new eye. As a result, utilizing various data to reveal issues derived from computational social science area has attracted more and more attentions.

[Estado da Arte] In this paper, to the best of our knowledge, we present a survey on data-driven computational social science for the first time which primarily focuses on reviewing application domains involving human dynamics. The state-of-the-art research on human dynamics is reviewed from three aspects: individuals, relationships, and collectives. Specifically, the research methodologies used to address research challenges in aforementioned application domains are summarized. In addition, some important open challenges with respect to both emerging research topics and research methods are discussed.

1. INTRODUCTION
 
... Thanks to the efforts of scientists and the advanced processing technologies, detailed personal data are now available such as relationships, GPS coordinates, community memberships, and contact frequency. Exploiting such data can provide a new perspective for garnering invaluable insights into human social phenomena ... The scale of problems that social scientists study ranges from micro to macro level. For instance, economists can explore individuals’ investment behaviors or predict global economic changes by using financial datasets or data collected by mobile phones.
 
[Disponibilidade de dados a partir do uso da tecnologia: dispositivos e redes sociais]

Traditional methods of retrieving empirical data for the analysis of issues in the domain of social science are founded on the principles of social investigations. ... Therefore, Computational Social Science (CSS) emerges as time requires, which takes advantage of mathematical theories together with data processing and analyzing technologies from computational science to tackle those social issues.

[Mudança na forma de obtenção de dados para Ciências Sociais] 

...Through utilizing such data, traditional social issues can be investigated from a new perspective. More social phenomena can be discovered. Meanwhile, new research topics or methods emerge because of the availability of data.

[Mudança nos temas a serem investigados em função dos dados disponíveis]

Human dynamics refer to a branch of complex systems research in statistical physics such as the movement of crowds and queues and other systems of complex human interactions including statistical modellingof human networks, including interactions over communications networks. 

[Perspectivas de investigação centrada no Ser Humano: (1) Indivíduo, suas características e comportamentos, (2) Relacionamentos, suas dinâmicas em função do tempo,espaço e outros contextos e (3) Grupos Sociais baseados em suas características ou relações em comum]

Data collection and pre-processing are the initial steps in social science research. Then, we mainly focus on the procedure of data analysis with the combination of traditional statistics methods and the most widely employed machine learning methods. At the final stage, the validation process is needed in order to ensure correctness and accuracy. According to the above-mentioned steps of our research methodology, which are data collection, data pre-processing, data analysis, and data validation, we then expound the most frequently used approaches in each stage in detail.

[Estágios da Pesquisa (e suas técnicas e tecnologias): coletar dados, pré-processar, analisar e validar]


6. CONCLUSION

CSS has been a topic of research interest in recent years due to an increasing realization of the enormous potentials of its data-driven capabilities.
...
In view of this, it is apparent that the analysis of the data on the daily activities of individuals will certainly contribute to the understanding of human dynamics mechanisms. 

[CSS é recente e emergiu da disponibilidade de tecnologia para unir Ciências Sociais e seus domínios de estudo com Computação e seus métodos e técnicas para dados]


2. COMPUTATIONAL SOCIAL SCIENCE AND RELATED AREAS

...Traditional social science mainly consists of social psychology, anthropology, economics, political science, and sociology, which are the so-called “Big Five”. ...
The origination of CSS can be dated to the 1960s, when social scientists began using computers for analyzing statistical data. .. Accordingly, CSS is by all means an instrument-enabled scientific discipline. ... For the first time, social scientists (or computational social scientists) are able to analyze the large volume of data in order to test previous scientific social hypotheses. 

[Ciências Sociais como Psicologia, Antropologia, Economia, Ciências Políticas e Sociologia (Big Five), começaram a se beneficiar na década de 60 do uso do computador para análise estatísticas]

Recently, the capacity to collect and analyze massive amounts of data has promoted the evolution of CSS and expedites the emergence of data-driven CSS. ... we now live life with many digital traces. ... CSS could take full advantage of all these digital traces to better understand the individual or collective behaviors and furthermore to better understand our society and solve social problems.

[Os dados são gerados no uso da tecnologia no dia a dia para comunicação, interação, locomoção, etc ... Antes eram projetados experimentos para essa coleta que poderiam ter o viés do observador e do observado]

An important feature of CSS is the interdisciplinary nature. CSS is an integrated and interdisciplinary new area, which aims to analyze previous and present social issues and human behaviors with an emphasis on information processing through the methods of advanced computation. ... CSS mainly focuses on analyzing the issues studied by social scientists with the computational methods which have been developed by computational scientists. The field is inherently interdisciplinary. Social scientists provide the insight into pertinent research issues such as economics, politics, and environment, while the computer scientists contribute expertise in developing mathematical methods such as social network analysis, data mining, and machine learning.

[Cada disciplina isoladamente contribui para a pesquisa interdisciplinar. Para a computação acaba sendo a aplicação prática em ambiente real de técnicas e teorias testadas em ambiente sintético. Para as Ciências Sociais é a oportunidade de validar (ou refutar) Teorias e Hipóteses usando dados]

3. RESEARCH TOPICS

A. Individuals

... people have their personal attributes, such as age, gender, interest, and personality, etc. Meanwhile, they also have a series of activities, e.g., driving, shopping, sending emails to others, etc. In order to make a better understanding of human beings from the individual level, we must take both individual attributes and individual behaviors into account.


1) Individual Attributes

These personal features can be found through mining the relative data. On the other hand, people have an influence on each other as a part of society. Influence, which measures the individuals’ effects on others and takes consideration of the whole network, is an important social attribute of individuals.

a. Personal Features

Information gathered from online social media can be used to analyze the physiological features of individuals, including gender, age, race, etc. Park et al. [12] analyzed 10 million messages from over 52,000 Facebook users to explore the differences in language use across gender, affiliation, and assertiveness. Some other highly subtle personal attributes such as ethnicity, religious, political views, even sexual orientation, also can be accurately predicted using social media records.

[Linguagem neutra e identificação de gênero]

Personality can be reliably evaluated by combining the digital trail with the Five-Factor Model [13], which has a distinguished reputation for the most accurate prediction of personality traits.

[Big Five e o caso Facebook e eleições americanas - Cabridge Analytica]

Five-Factor Model of Personality

b. Individual Influence

Influence is an important measurement to indicate individuals’ positions in the community. Identifying the most influential individuals (nodes) is critical in understanding the dynamics of a social network. ...
The main method to quantify individuals’ influence is through network analysis. Centrality [18], [19], which is a traditional network analysis method, is a measure to assess the importance of nodes in the network.

[Centralidade de caminhos, PageRank para homofilia (Do grego, "Homophylía" que significa "identidade de grupos ou raças" "Homo" - igual; "Phylía" - filiação, parentesco), algoritmos de grafo]

2) Individual Behaviors

a. Human Actions

Sentiment analysis aims to extract individuals’ subjective information, such as evaluation of something, emotion states, and polarity of attitude. ... combining lexical-based and machine-learning techniques for the extraction of user positive/neutral/negative sentiments and the detection of user sentiment changes.
...
Routine behaviors can also be investigated to some extent, which may provide further insight into human behaviors. Barabasi [31] showed that the timing of many human actions follows heavy-tailed distribution based on an e-mail data set which captures the information of the sender, recipient, time, and the size of each e-mail.

[Comentários de produtos e serviços. Padrões de comportamento de rotina.]

b. Influence Factors

Researchers aim to discover these determining factors, such as individual characteristics, sites, or environmental aspects, which are related to individual behaviors and behavior intensity.... Tsou et al. [33] analyzed the commenting behavior of users for two video sharing websites, i.e., YouTube and TED (Technology, Entertainment, Design).

[Diferenças de comportamento em diferentes plataformas: Facebook x LinkedIn]

Besides investigating factors that have impacts on individual behaviors, scientists also explore the external causes of human’s various sentiments.

[Diferenças de comportamento em função de fatores ambientais, culturais]

c. Behavior Prediction

The accessibility of huge datasets and the advanced CSS techniques allow for greater flexibility and accuracy when predicting human behaviors. .. By this measurement, Song et al. [39] explored the limits of predictability in human mobility and found an exceptionally high value of potential predictability using a trajectory data set captured from mobile phones.

[Mobilidade, Waze]


[Não vi exemplos sobre padrões de busca na Web. Tem bastante exemplo com Twitter]

B. Relationships

...These relationships can be formed through online communication or offline interactions. Relationships between individuals are constantly evolving due to continuous changes in interactions and behavior. By analyzing and partitioning socially relevant big data, researchers can glean a moment-by-moment representation of both, the structure and content of relationships. 


1) Relationship Identification

... A crucial and broad issue for the analysis of social networks is to leverage features in the available data to mine the relationship semantics and identify the type of relationships.
...
It was found that residence proximity and duration of weekend night interactions may help to explain close friendship ties. ... Choi et al. [49] found indicators from communication patterns including co-location data and instant messenger data to infer social relationship types, which were divided into formal and informal. ... Mining the semantics of relationship multiplexity makes social network analysis plentiful and closer to our real physical social networks.

[O comportamento online e offline permite inferir a semântica da relação]

2) Relationship Prediction

... Hence, the majority of researchers are dived into friendship prediction employing machine learning and statistical methodologies. At the same time, other relationships have also been studied, such as co-authorship and trust relationships. ...
In friendship prediction, homophily is an emerging research hotspot, which is traditionally accounted for people who are always bond with people alike. I
...
On the other hand, location proximity and simultaneous occurrence at one place may not possess a good predictive power for the majority of friendship prediction.

[Predição de relacionamentos por interesses comuns é mais efetivo do que por localização próxima]

The prediction of other types of relationships (such as trust and co-authorship relations) can be exploited in relationship dynamics. ... Along this line, co-author relationship prediction has been investigated. Mostly, authors are considered to belong to a homogeneous network, which means that only one type of object (authors) and one type of link (co-authorship) exist in the network. ... However, bibliographic networks are heterogeneous in reality, such that there are multiple types of objects (e.g., venues, papers, topics) and multiple links among these objects. The prediction of various types of relationships
in different social networks deserves further study, as it may be conducive to other applications, such as community detection [71], influence analysis, and link recommendation [72].

[Predição de co-autoria e outras relações entre pesquisadores]

3) Relationship Evolution

However, tracing a relationship’s temporal evolution appears to require multiple observations of the social network over time. These expensive and repeated crawls are only able to answer questions from observation to observation, but not what happened before or between network snapshots. Therefore, numerous researchers concentrate on the influential factors of the process of relationship evolution and ignore the entire relationship lifecycle.

[Prever as mudanças nos relacionamentos e que fatores as influenciam]


>...First of all, people are more likely to interact with others who have similar interests and they will form a group. We are interested in identifying these groups for specific purposes, for example, advertisement recommendation. Secondly, the community provides a global view of human dynamics [81], [82]. Collective behavior is more stable, whereas individual behavior is noisy and easy to change. Finally, some human behavior can only be observed from a community setting and not from an individual level.

[Comportamentos dentro de grupo podem ser diferentes do comportamento do indivíduo ... efeito manada]

1) Community Detection

... the problem of identification of communities has been the focus for many years [87]. Depending on the underlying methodological rule as well as the different features of communities, we introduce four common community detection methods:

a. Modularity-based Methods

.. Many approaches have been proposed to solve this problem including greedy agglomeration [89], mathematical programming [90], sampling techniques [91] etc. One of the most famous algorithms devised to maximize modularity was a greedy method by Newman [89]. He proposed a fast clustering method that uses a greedy strategy to get a maxim ∆Q by merging pairs of nodes iteratively.

b. Divisive Algorithms

These methods identify and remove nodes or edges among communities via measures such as betweenness centrality. An edge has high betweenness if more short paths go through it. One can detect communities through a process of ranking all the edges based on betweenness. Girvan and Newman [93] proposed a GN algorithm, which is a typical divisive algorithm. The GN algorithm progressively computes betweenness for all edges and removes the edges of a network with the highest score. They also considered three other definitions including geodesic edge betweenness, random-walk edge betweenness, and current-flow edge betweenness.

c. Social Interaction-based methods

... Problems occur when we try to identify a hidden community with missing links or with multi relationships.

[Quando o grafo de conexões está incompleto é necessário inferir algumas conexões para detectar as comunidades]

d. Overlap Community Detection

[Comunidades podem se sobrepor, ter interseções]

2) Community Evolution

Community detection methods introduced above assume that networks are static, which means that the nodes and links among the communities are stable and do not change over time. In reality, human society is a temporal social network (TSN).

[As relações assim como as comunidades/grupos mudam ao longo do tempo já que o indivíduo e o ambiente também mudam. Entender essa dinâmica permite fazer predições sobre as comunidades, se vão nascer novas, se vão se extinguir, se vão crescer ou reduzir, se vão se dividir ou se agregar a outras] 

3) Community Behaviors

... Among many aspects of collective behavior analysis, we are interested in the following topics, collective decision-making, cooperations and contagions among collectives, human mobility, and collective behavior prediction.

a. Collective decision-making

Individuals with similar attributes or purposes are often deemed as being in the same communities. Although members in the same community share some common characteristics, they may hold different opinions toward different alternatives. Therefore, when collectives have to choose among diverse choices, it is fairly difficult to reach the common decisions which are accepted by all the group members. ...
Social choice theory is concerned with the analysis of collective decision making [106].

[Vários cenários de tomada de decisão coletiva, democracia]

• Preference aggregation.

• Ranking and the Internet searching engine.
...
Despite the above applications on ranking and searching methods, there also exist other kinds of utilizations that involve with human dynamics by using this area of methods, for instance, recommendation systems [112] and evaluation of researchers’ impacts.

• Resource allocation. How to divide a cake seems to be a simple question that has been studied by scientists for a long time. It requires distributing definite resources to a set of individuals. ... There are two fundamental criteria that should be considered when allocating resources, which are fairness and efficiency. Based on these two crucial factors, several allocating mechanisms are explored.

The application that involves human dynamics using this area methods is, for instance, ranking authors’ sequence. Schall [115] indicated that different author’s goals such as career requirements for promotion to rank the sequence should be considered. Ackerman et al. [116] proposed a game-theoretic model to study the allocation of credit to authors, and the results show that alphabetical ordering can lead to higher research quality while ordering by contribution results in a denser collaboration network and more publications.

• Voting.

b. Cooperation and Altruism

Cooperation has always been the core problem that involves with interests of individuals and collectives, and attracts researchers from diverse areas to investigate. Cooperation exists universally in our daily lives, however, human beings are both rational and emotional when making choices related to their own benefits, sometimes they are selfish, and sometimes they would choose to cooperate with others even bare costs instead of maximizing their own profits over the whole society.

An important mechanism that affects the evolution of cooperation is altruism or reciprocity [123]. People sometimes are willing to help others even under the circumstances that they have to pay the cost due to complex reasons, which we call altruism.

[Organização de voluntários e doações para Petrópolis]

There are two general directions to investigate the mechanism of scientific collaboration using scientific data sources. From the perspective of network analyses, Newman [128] explored the structure of scientific collaboration or coauthorship networks, in which scientists are connected if they have coauthored papers by using data collected from various databases. Their works indicated that scientific collaboration networks tend to form “small world” [129] and discovered a variety of attributes about collaboration patterns. Following his pioneering works, researchers then used diverse databases on various disciplines to establish collaboration networks and investigate the impacts of the structures on the collaboration patterns. Another type of network that scientists explore using scholarly data is the citation network. Researchers in [130] found that either in the coauthorship or citation network, the nodes and structure can have impacts on each other, and also the comprehensiveness of scholarly data sources can have a significant influence on the network structures and its attributes

[Rede de co-autoria e rede de citação. Escolha de banca.]

c. Contagion

Another problem that involves the vast majority is social contagion, which utilizes the relationships among individuals to diffuse. Any goods, emotions, behaviors, or viruses can propagate through it. Each of these diffusion processes through the population based on some specific mechanisms by exploiting interactions among individuals.

[Propagação de informação, fake news, Whatsapp]

Most of these methods hypothesize the network structures are homogeneous, while humans are connected via different types of relationships in the real world. Gui et al. [143] proposed an information diffusion model in multi-relational networks, by distinguishing the power in passing information around for different types of relationships. In addition, it applied the model to DBLP network and APS network, and experimentally demonstrated the effectiveness of their methods compared with single-relational diffusion models.

The closed-world assumption is proven incorrect in recent work on Twitter done by Myers et al. [145] in which authors observe that information tends to jump across the network. Consequently, they provide a model capable of quantifying the level of external exposure and influence using hazard functions.
In 2010, Demon Centola conducted an online experiment to prove that information diffusion is different from behavior spreading, which individual adoption was much more likely when participants received social reinforcement from multiple neighbors in the social network [146]. It indicated that information diffuses through weak ties, and strong ties can enhance behavior diffusion.

d. Human Mobility

Uncovering human mobility behavior is crucial for future predicting and controlling spatially embedded events such as disease control [149] and traffic forecasting [150]. ... Most research on human mobility is performed from location-based devices such as cell phones, portable computers, and GPS localizers. These devices record digital footprints of human activities which can reflect the social interactions of a community.

[Dados de GPS de celulares, de fotos. Predição de fluxo de Trânsito]

e. Collective Behavior Prediction

For example, Salehan et al. [163] investigated the effect of review sentiment on readership and helpfulness of online consumer reviews. Furthermore, Beauchamp [164] combined 1,200 state-level polls during the 2012 presidential campaign with over 100 million state-located political tweets to predict opinion polls using a new linear regularization feature-selection method.

[Comportamento coletivos em política]


4. RESEARCH METHODOLOGY

The research methods in CSS follow several general steps, which are data collection, data pre-processing, data analysis, data validation...

A. Data Collection

Once the research topics are confirmed, the investigation focuses preliminarily on the selection and collection of data such as the content or the scale of the data.

[Definir o problema para identificar as fontes de dados que darão suporte a análise]

1) Web Resources

[Caso não existam APIs disponíveis ou outras formas de disponibilizar os dados para consumo, o desenvolemento de Web Crawlers para capturar dados de redes sociais e páginas Web]

2) Sensing Data

[Coleta de dados de celular, padrões de mobilidade na COVID]

3) Data from Self-designed Experiments

[Quesitonários online, plataformas como Mechanical Turk]

B. Data Pre-processing

Before analyzing the data, it needs to be processed into recognizable formats that require being cleaned and reorganized. Pre-processing the raw data can greatly improve the quality of data and significantly enhance the efficiency of the analyzing process.

• Data cleaning. Initially, the collected data may be incomplete, inconsistent, noisy, redundant. ... There are several approaches to handle the raw data. For the incomplete tuples, we can either choose to ignore them or fill in the missing values according to specific rules such as removing missing values related instances, replacing missing values with the most common constant, and replacing missing ones with values estimated by other features of the same instance. While when the datasets are noisy, we can apply several smoothing techniques such as binning, regression, outlier analysis and etc.

[Etapa de limpeza para completar dados faltantes ou remover registros com dados faltantes, amortizar ruídos nos dados, remover outliers, aplicar regras para detectar e corrigir inconsistências, aplicar regras para detectar e fazer o merge de dados redundantes]

• Data reduction. ... to obtain a reduced representation of original data sets and still get the almost same analytical results. Generally, there are two directions to reduce the data scale which are dimensionality reduction and numerosity reduction.
Dimensionality reduction methods mainly focus on transforming the original data into a smaller space...  method is used to reduce the dimensionality of the data set in. Feature extraction is also used in dimensionality reduction.
While numerosity reduction approaches use different representation methods to store data and decrease the volume, which includes parametric and nonparametric methods. In addition, sampling methods are used to reduce the size of data and improve the processing efficiency.

[Redução de dimensionalidade e numerosidade (?), técnicas como Singular-value decomposition (SVD) e amostragem. Caso do DFCris em relação ao BrCris]

• Data reformation. ...  The features of the data also require aggregation or discretization analysis, providing new features which can perform the research task better. Strategies of data reformation include smoothing, attribute construction, aggregation, normalization, discretization, and concept hierarchy generation for nominal data... Another problem that needs to be considered in the data pre-processing procedure is the multi-source data fusion circumstance because using single data input maybe not adequate for some situations. Therefore, utilizing multi-source data has become increasingly important and some achievements have been made.

[Fusão de dados de fontes distintas] 

C. Data Analysis

 [Várias aplicações como predição de amizades/colaboração, classificação de personalidades, análise de sentimentos, identificação de padrões de mobilidade, etc ...]


The data collected from Internet Social network and interaction services, infrastructure-bound sensor devices and mobile and wearable sensor devices can include text, images, graphs, or any other types of form after pre-processing. So how to extract useful information from these diversiform forms of data set?

[Variedade de fontes e formatos]

Traditional data analysis means to use previous statistical methods to analyze massive first-hand data to concentrate, extract, and refine useful data from raw data.

[Análises estatísticas tradicionais como cálculo de probabilidade são custosas em ambiente de Big Data]

Machine learning can be defined as a set of methods that can automatically detect rules in data, and then to predict future data or perform other kinds of decision making under uncertainty.

[Aprendizado de máquina é mais eficiente para detectar padrões e fazer predições]

1) Supervised Learning
Supervised learning needs a labeled training dataset, consisting of a set of training examples. ...  The algorithm of supervised learning analyzes the training data, producing an inferred function. This function is used to map new input examples to output class. Supervised learning can be divided into classification and regression according to the type of output class.
When the output class attribute is discrete, it is the classification; on the contrary, when the class attribute is continuous, it is the regression.

[Aprendizado Supervisionado requer uma dataset classificado para inferir uma função de classificação. Se a saída da função de classificação foi discreta é chamada de classificador mesmo mas se for contínua, é regressão. A classificação associa rótulos de classes aos novos registros. ]

a. Classification

There are many kinds of classifiers, such as Decision Tree learning, Bayesian classifier, K-nearest Neighbor classifier, Support Vector Machine (SVM), and Artificial Neural Networks (ANN) classifier. Based on neural network technologies, deep learning algorithms recently have been widely developed and utilized [180]. Researchers have proposed many improved deep learning algorithms in solving issues such as image recognition, academic recommendation, and decision making.

Apart from deep learning methods, the commonly used classifiers in CSS are ...

[Arvore de Decisão, Classificador Bayesiano, SVM. Cada um tem seus pontos fortes e fracos, definindo a aplicabilidade]

• Decision Tree. ...In the Decision Tree, each internal node contains a test on an attribute to separate records that have different characteristics, each branch represents an outcome of the test, and each leaf node is assigned a class label. When classifying a test record, we need to apply the tree condition to the record and follow the appropriate branch until we get the leaf node as the final class.

• Bayesian classifier. Bayesian classification is based on Bayes’ theorem. Bayesian classifiers are statistical classifiers that predict the probability that a given test belongs to a certain class. They select the class with the maximum probability as the final one which the test record belongs to.

• SVM. Belonging to the linear classifier, SVM searches for the linear optimal separating hyperplanes to separate data from different classes. When the training data is nonlinearly separable, SVM uses a nonlinear mapping to transform the original training data into a higher dimension. Then it constructs a hyperplane or a set of hyperplanes in the new dimensional space. So SVM can be used for both linear and nonlinear data.

b. Regression

Regression analysis is a mathematical tool for revealing correlations between one variable and several other variables. Based on a group of experiments or observed data, regression analysis identifies dependence relationships among variables hidden by randomness. Regression analysis may change complex and undetermined correlations among variables into simple and regular correlations. ...
Regression analysis is widely used for prediction and forecasting, where its use is similar to that of machine learning. It can also be used to find out which among all independent variables are related to the dependent variable. ...
Linear regression requires that the model is linear in regression parameters. In linear regression, data are modeled using linear predictor function, while the unknown model parameters estimated from the data. Specifically, linear regression refers to a model in which the conditional mean of y given the value of X. .. In their model the independent variable X is easily accessible digital records, for example, Facebook Likes and the dependent variable y is a range of highly sensitive human attributes including, sexual orientation, political view, personality traits happiness, age, gender and etc.

[Predição e Prognóstico] 

2) Unsupervised Learning

In machine learning, unsupervised learning is a method of trying to find hidden and intrinsic structure in data which is unlabeled and with no target attribute. ... Furthermore, unsupervised learning is arguably more typical and analogical of human learning. And it does not require a human expert to manually label the data, which makes it more widely used than supervised learning.

[Não requer dados rotulados]

For the research of human dynamics in data-driven CSS, the majority of unsupervised learning algorithms employed in this problem can be considered to belong to two parts; clustering and ranking.

Cluster analysis or simply clustering is the process of partitioning a set of data objects into subsets, each of which is a cluster. The objects in a cluster are similar to one another, yet dissimilar to objects in other clusters. ...

[Clustering: Topic Models (LDA), Latent Variable Model, e Matrix Factorization]

Topic models regard each document in a corpus as a distribution over topics, and each topic as a distribution over words. ...
The most widely used topic model is LDA, ...

Latent variable models provide an approach to find disciplines from data with the help of hidden variables. In statistics, latent (hidden) variables are not observed directly but rather inferred from other observed variables.

Matrix factorization is the decomposition of a matrix into the product of matrices, which allows us to discover the hidden patterns underlying some data. There are many different matrix factorizations, each of which, is often applied to a specific class of problems.

Some ranking-related algorithms include Random Walk, PageRank as well as Collaborative Filtering.

A random walk is a process in which randomly-moving objects wander away from the point of origin. ...
PageRank is a link analysis algorithm, which assigns a numerical weight to each element of a hyper-linked set of documents, with the purpose of ranking its relative importance within the set. It is now regularly applied in many different research fields, such as bibliometrics and social network analysis, link prediction, and recommendation. ...
Additionally, another popular ranking algorithm is Collaborative Filtering, which is a process of filtering for information or patterns with collaboration among multiple agents or data sources.

[Cada um tem seus pontos fortes e fracos, definindo a aplicabilidade]

3) Semi-supervised Learning

Semi-supervised learning lies between unsupervised and supervised learning. It refers to learning problems in which a small amount of labeled data and a large amount of unlabeled data exist. ... In this situation, semi-supervised learning can be of great practical value, but it has not been widely used in the field of CSS. ... During the past couple of years, the most active area of research in semi-supervised learning has been in graph-based methods. These graph-based methods can properly depict the structural characteristics of data itself, however, it needs a great amount of calculation and may become a barrier to practical application.

[Compensar os pontos fracos e explorar os pontos fortes dos anteriores]

Various graph-based algorithms for semi-supervised learning have been proposed in the recent literature in the field of CSS. Graph-based semi-supervised learning starts by constructing a graph, where nodes are instances of the labeled and unlabeled data, and edges represent similarities between instances. Known labels are used to propagate information through the graph in order to label all nodes. Many researchers employed graph-based semi-supervised learning algorithms to achieve higher accuracy in terms of prediction and identification. 

[Em desenvolvimento e seria o estado da arte nesse aspecto de aplicação]

D. Validation

... In the validation process, the model induced from the collected data should be substantiated on whether it exactly solve the issue and how efficient it is. ... In other words, validation is the last step of the research procedure and is the guarantee of a well-performed model. The validation can be conducted from the perspective of both internal and external. The internal validation refers to how well an experiment is done, especially whether it avoids confounding. And external validity refers to how well data and theories from one setting apply to another.

[Validação interna quanto ao rigor científico na aplicação do método, Validação externa quanto a capacidade de generalização do modelo]

Here some usual benchmarks and methods of external validation are introduced as follows.

a. Evaluation of Classifiers
In other words, the performance of classifiers is crucial to the whole research work. Therefore, several classifiers’ evaluating methods

[Avaliar a acurácia do modelo gerado pelo classificador com base em conjunto de treinamento e de teste. Usar amostra aleatória de dados para validação. Alternar os conjuntos de treinamento e validação.]

b. Comparison of Classifiers

Generally, we can use three statistic methods to compare the performance of different classifiers, which are estimating a confidence interval for accuracy, comparing the performance between models ...

c. Predictive Validation

If ... the purpose of the model is to predict future behavior or future event, and then comparisons can be made between the tracking data and the models forecast.

d. Participatory Approaches

Participants who produced the data or stakeholders such as model users are best qualified to speak on the effectiveness and veracity of the model.

[Opinião de Especialista]

5. OPEN CHALLENGES AND FUTURE TRENDS

A. Interactions Between Online and Offline

[Influência Online e Offline seria bidirecional. Comportamento diferente em redes sociais diferentes (Facebook x LinkedIn). Fadiga de Zoom no modo remoto da pandemia. Eleições e Fake News. ]

B. Data Quality and Analysis Validation

The primary data-related issue is how to encourage people to share their data. The data used in CSS research is highly associated with people’s private lives.

[Aspectos de privacidade. Como estimular o compartilhamento de dados pessoais? ]

The first one is avoiding mistakes and ensuring accurate statistics in data collection. The second one is extracting high-level intelligence data from large-scale useless sensing data. And the last one is managing heterogeneous data sources and combining data of different aspects effectively.

[Qualidade. Como garantir a coleta de dados com qualidade (precisão, refreshness, variedade de fontes)]

This error alerts us with two significant aspects we need to consider when solving problems in the future. The first one is whether the data we plan to use conforms to our designed experiments or instruments since using estimated data source can lead to unexpected results. Another factor is the dynamics
of algorithms, whose stability and comparability may have crucial influences on the final outcomes.

[Qualidade. Como garantir que a predição de dados tenha precisão? Fontes de dados, Algoritmos]

C. Relationship Evolution

[Como as relações se modificam ao longo do tempo e como a tecnologia influencia isso? Agregando pessoas geograficamente dispersas.]

D. Social Dynamics of Science

[Como analisar a dinâmica de produção e iteração dentro da Academia. Produção e disseminação de material para aulas online. Congressos e Eventos online gerando vídeos. Isso contribui para a abertura da Ciência? Tendências de pesquisa sobre determinados tópicos: a influência do financiamento, das questões locais de cultura e governo]

Comentários

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...