Video Youtube - 10 Steps for Verifying Statements in Knowledge Graphs

Vídeo https://youtu.be/jq9o4mYEFLo?si=6WCtMd8TbFUWziTq

Fluxo https://drive.google.com/file/d/1t0EH2nGZNZX6cYBqktmuvrGl2iDpkJpY/view

Curadoria das Fontes: selecionar fontes confiáveis

Técnicas de integração de dados

Step 1: Data, Content, Source Audit (Estruturadas ou não estruturadas)

Step 2: Source Tier Logic

Step 3: Statement Similarity Assesment

Step 4: Identification of duplicates, clusters or similarity statements

Step 5: Weighting Assessment

Step 6: Temporal data Flow

Step 7: Final score calculation

Step 8: Statement router

Step 9: Human Verification

Step 10: Adding verification tags and verified statements to the graph

hey everybody so today we are going to

be going over how do you do statement

validation for your knowledge graph

there's a lot of reasons that you might

want to do this if you are one of the

folks that thinks that you know because

you're sourcing data for your knowledge

graph from internal sources or licensed

sources licensed data sets it should all

be correct that's not true and if you

are using this for anything that's

critical like you're reporting or maybe

feeding this information to the llm

because you want you know help with the

hallucination problem which is another

reason that people are really looking at

this specific topic this is something

that we're going to walk through today a

lot of what we're walking through today

is founded long long time ago in regular

scholarship so when you're doing

research you want to be evidence-based

right so when you look for evidence

you're looking for other authors or

other projects that support the claims

that you are making or they have claims

that can also be reproduced or supported

that's essentially the same methodology

we're going to be using today so I'm

going to be walking through the exact

process that I have used in many

organizations it has worked really

really well for this with verifying what

is coming from a graph or verifying and helping the llm

not do so many hallucinations and this is also

something that I've used to verify um

different news sources is coming in to

see if uh a claim from a new source can

be corroborated with other data points

so these are all areas that this process

should be helpful for all right so this

sounds interesting to you let's go get

started all right so we are going to go

through this step by step and I'm going

to explain each step so that you

understand what each one is doing and

also I am going to put uh a link in the

description bar below to the whole

diagram that I'm going through through

here so don't worry about taking

screenshots as you go the whole thing is

down below if you want to go check that

out okay so the first area is really

really important and that is doing a

source and that would be from data

sources or unstructured uh you need to do an audit

of all of those that are going to be

ingested and used in your knowledge

graph even if you don't do anything else

you should do that

anyways that is a really important thing

so that you have the provenance of all

of your your information in your

knowledge graph and that's going to be

then used to do your Source tier list so

what you're doing here is at first

you're looking at your unstructured so

these are supposed to have some kind of

metadata associated with them and if

they're not you're going to have to add

some and if you missed my why is

taxonomy useful in the AI space link up

above here uh this is a really big part

of that and that is you need to be able

to identify what your coverage is of Any Given topic

or thing in your graph and a b a big way

to do that is to be able to look at the

data sources or the documents if it's

unstructured uh that you're using in

your knowledge graph that also helps you

in the llm space because if your llm is

failing in a lot of medical areas well

now you can go back to your sources and

see how much medical content do you

actually have or how much medical uh

statements and again that statement is

like Tom Cruz birth date date of birth

right like that is a statement um how

much of that do you have in the medical

space maybe that's why you're having

hallucinations you just don't have

enough data to help it um so this is a

big important part maybe these

unstructured data sources are coming

from machine learning models that you've

used that data to train and if so it

would be really helpful to have not just

the confidence score which is what the

machine the model thinks that it did on

its test right so it's looking at did I

get the the right answer for whatever

the output is but also the accuracy

score which is and then do the humans

agree with that confidence score so some

of that information is really helpful to

have because you can use that in

identifying how trustworthy some of

these sources are uh when you get

further into this process then in the

data set side there's a whole lot of

potential landm

in your ETL this is one of those things

that I have seen so many times people

come back and say oh you know and I've

used this process I've used what we're

going through here multiple times and

almost always why am I getting so many

errors I don't know why this data source

is really high quality why is it what's

happening here so if you think that

you're licensing data and therefore it

is all good and factual and accurate

it's not it's not it's much High higher

quality than something you would just

scrape from the web chances are if you

are going in and looking at how the ETL

was done if you don't have documentation

on the decisions that were made and if

those decisions are not queriable or

accessible to kind of understand on a on

a larger scale across all your sources

what decisions were made for instance um

were certain Fields uh joined together

and maybe that made perfect sense for

the knowledge graph but now when you're

getting into

understanding the trustworthiness of

that statement it's not totally

trustworthy now because there's

something off about it um a good example

of this is even like date formats

there's so many different date formats

maybe someone took uh you know Legacy

data that was the year of the cherry

blossom true story that I had in my past

and they had to to make a decision on

what that means from a date perspective

maybe they guessed wrong so if you're

finding that you're seeing a lot of

invalid or uh your your statements in

your graph and even the ones going into

the llm and those responses that you're

seeing are not as trustworthy go check

your ETL or the people doing the ETL

because oftentimes the stuff isn't

written down so that is my my word to

the wise to to go in and look at that

another area that you're going to see

here is the source ID you really want to

make sure that you have um that that

audit so that you have a source ID for

every Source you're using using and how

often is it verified how often is it

updated um you know is is it more of an

opinion uh piece where it's more like

you know magazines and the opinions of

fashion designers or is it from an

authoritative Source on fashion design

uh education right those are a little

different so making sure you have those

Source IDs and the information along

those sources will help you with the

provenant and making sure that you can

reproduce or at least backtrack to where

did these statements come from and how

out of date are they right date and time

sensitive things is a part of this as

well and we'll get into that a little

later last is if there are uh IDs from

the original data source make sure you

keep them just again so you can

backtrack to the source and if there are

issues again this is going to come back

to us in the last part of this process

you can push it back we do not you don't

have to do this part but we do not want

to continue to propagate bad data right

let's all do ourselves a favor so if you

see bad

data see something say something right

so at least if you have those IDs and

you know which source they came from you

can then contact the source the data

provider whoever they might be and just

say hey um we found that there was all

these errors on these statements or this

data that you have in your data source

can you like go fix it and the reason

that you want to do that um is not only

just not propagating the bad data but

also you don't want to have to fix that

mistake over and over and over and over

again so if the original data source is

corrected or at least those things are

taken out of the original data source

you don't have to worry about them

anymore okay now into the source tier

list this is really important now if you

were following along or are following

along with the uh times copyright in

llm stuff um one you should not be using

data that you do not have the rights to

use number one number one for FR and

foremost number

two one of the things they were using in

their court case was that they found out

that the times information was actually

ranked and weighted higher in uh the

trustworthiness scoring of whatever llm

is under Fire for that so this is what a

lot of folks do when they are doing

verification on statements within not

just their Knowledge Graph but in the

llm space even if they're not using a

Knowledge Graph

so this tiered list goes back to the

beginning of scholarship right you have

something called an authoritative Source

we are all taught this in our research

right you're doing research you're

probably not going to want to site

something that doesn't have peer review

you're not going to want to maybe site

something that cannot be verified that

it needs to be evidencebased um there

are lists that we are taught of you know

like things coming from government

resources things that are coming from um

citable journals and that that sort of

thing are more authoritative than not

that doesn't mean they're all factual it

doesn't mean that all the statements

coming from those are going to be good

quality but what it does mean again

that's why this whole process exists

what it does mean is it is a higher

likelihood that you're going to get

better data from you know something that

has a reputation of being factual really

having a high degree of scrutiny making

sure that they really care about the

data that they are

providing that stuff goes into tier one

also if we're talking about things let's

go down to my little note here um you

know to give yourself a heads up on this

if you've never done it you can go back

to what Scholars use and which is the H

index or how um well regarded a certain

Journal is again there's some

controversy in that too so that's why I

also suggest you use something called

alt metric which is some journals are

really well done and they just maybe

don't have the money to publish and get

the certification of of being in these

indexes uh that other Publishers have so

looking at how trustworthy does the

scholarly Community see this thing even

if you're using archive which is not

technically a journal um and those

things are not yet necessarily

peer-reviewed sometimes um and sometimes

they are not the finished product or

publication in in some cases you can

still see how many people are using it

and how people are talking about it

again that doesn't necessarily mean you

trust it wholesale but it is a good

indicator that others who know about

this are talking about it in a positive

way now if that doesn't show up at all

if if the sources that you're looking at

don't show up in those at all one way to

also get your tiered list going is

looking at a sample of statements from

either unstructured or structured and

trying to understand if you look across

all of the data sources that you're

using or you know even having smes if

you don't have if you can't find it

elsewhere in other sources having smmes

or human verified which is a different

process than just smmes um and we'll get

to that in the latter part of of this

process

too they need to be able to say yeah I I

know this author I know this Source um

it's it's really good I looked at these

statements they are accurate um all of

that goes into how you're going to set

up that tiered list the other thing is

you might have many lists many lists um

many tiered lists and the reason for

that is different use cases if you have

your knowledge graph is being used for

let's say a medical product that you

have and then you have another one for

um Medical Products meaning like um I

don't know vitamins or something the

trustworthiness of different sources for

those two use cases even if you have all

medical sources is going to be different

and so what do you do with this so I've

colorcoded this so that you can see it

as we go so tier one you're probably not

going to have a whole lot of things in

tier one only because those are like top

shelf things that you really know are

very trustworthy and um out of the gate

they're they're going to have higher

quality and by the way data quality is

not the same thing as what we're talking

about here data quality like I said I

think in the intro you can have shackle

shapes and other things to say like are

the con being met that does not mean the

same thing as it is an accurate

statement right so keep that in mind um

things that are you know maybe mixing uh

opinion pieces and you know letters to

the editor and that kind of stuff mixed

in with more you know let's say

investigative journalism or something

that might be tier two not because it's

not trustworthy but because it mixes

things that are pure opinion versus more

evidence-based um and then tier three

are things that are you know maybe they

have a few Journal articles for instance

or a few editors or authors that are

just like really topn and doing so many

good things and then they have a bunch

of others that are maybe not so

trustworthy or maybe not you know doing

the best type of scholarship or

evidence-based research so that's where

tier three comes in and you can decide

that you don't want tier three at all

right this is where this tier list comes

in and meeing it to your use case so

once you get the statements from all of

these

sources you're going to then do a

similarity of those statements so here

we have statement

one and we can see that Source One three

and four are contributing what does that

mean it means that this statement or

something very similar statements very

similar are coming from these three

different sources so this is what you're

trying to do this is going back again to

that triangulation evidence based do you

have other authors do you have other

sources that are agreeing or near

agreeing with that statement that is

what you're looking at here does that

statement show up in other resources

because if it does it's likely doesn't

mean it's true it's likely that it's

more trustworthy than something that you

can only find in one place again that

doesn't mean that one's not trustworthy

either and we'll get into like how this

is a

reciprocal you you can go back and look

at statements and see if they get more

evidence and and so on and so forth

again just like scholarship then you

want to make sure that you do some D

duplication because this is just looking

at statements within the sources that

are similar but it doesn't look at does

the same Source give you the same

statement over and over again so that

adds again some more weight and so you

can see this

green and this this orange is showing it

for statement one and also statement

three so are these similar and we'll say

that 95% confident that these are

similar statements therefore you don't

need to carry on with statement three

because it's just a a slight variation

of statement one so th that drops out as

we move forward now we're going to go

to the confidence and trustworthy

waiting which is really important here

so you can see that the higher tiers get

a higher weight the middle tiers get an

even weight and the lower tiers actually

get a negative weight again you get to

decide the weights that you use for this

uh this is just for demonstration

purposes and it gives you a score right

so this is a score three now this

statement again these are different

statements going through this statement

statement two has um actually a much

higher uh score because it has two of

the top tiers and so it gets a A score

of four and then down here you can see

these are all very low levels so this is

negative of one this one is very risky

to put into your graph or to continue to

use your graph if it's already there so

let's look at this so This confidence

waiting is based on the source tiers

that's why that area is so important and

the especially the confidence scores

that you already have then you base the

weight on the number of valid facts

again valid here is talking about do

they all agree is it evidencebased based

again we're not saying that it's

trustworthy quite yet we're saying it's

evidence-based um the Rarity of the fact

so that's something else that isn't

necessarily a negative like I had

mentioned earlier maybe there is a

groundbreaking statement found in a

certain research article that doesn't

mean it's not trustworthy it just means

that it needs more evidence to be

corroborated by others and that could be

you know citing that article others

talking about that article but it's a

rarity and that doesn't mean it's bad it

actually means if you have that

statement and no one else you know that

is also using llms has that statement

means you have some statements that your

llm or your graph has that nobody else

might have so that's actually not a bad

thing you just need to make sure you can

trust what what it's doing um then

you're going to also be looking at the

saturation in other sources so how often

does that show up in all the other

sources and that really is that

similarity stuff that we were talking

about you know based on all the

different sources and the statements

coming from them

you're also going to want to look at the

need for the fact and that your Gap

analysis remember sources have those

taxonomies uh associated with them so

you can understand if you have gaps if

you have a gap you might be more willing

to let something with a lower confidence

not too low you have to make your own

thresholds um you might be able you

might be more willing to have that go in

because you really need more data in

that space or maybe you go and you talk

to your um sourcing but folks and go and

get more data sources that will then uh

support that area and so that

information would be helpful to know

when you are determining whether you're

going to let something in or not because

maybe you just don't have a lot of data

on that yet um which means there's going

to be less sources corroborating it but

it's not necessarily because it's not

trustworthy it's because you just don't

have enough data for it uh refresh speed

this is important for uh temporal facts

so llms suck when it comes to anything

uh time sensitive because they are

trained on a point in time which is why

they have to use all these other

resources to continue to feed it data

but also to supplement when it might not

have the most up-to-date information

which is why if you're going through

this having again that Source

information on how often does this get

refreshed how uh how fast is it and

getting you know the next uh thing in

finance that that everybody's talking

about that that is a really important

thing to keep in mind with this and also

there is a special process in here that

we're going to go over for temporal data

you might not be able to do the same

process for temporal data you might have

to have two pipelines like this or two

uh algorithms running uh on your data

where if there's a temporal you have

first of all temporal means there might

not be data to support it yet so so you

need to have um a special

Cadence for temporal data because you

need to get it in fast but you also

don't have enough data to support it so

you you have to think through that when

you're doing this um and then you know

what is the impact of the graph itself

so um if you add something in is it

going to add is it going to make uh some

of the algorithms that you have really

gnarly like some of those things you

might want to think about too and then

new and valid entities and statements

they are weighted weight weighed higher

because um again you want to get the

most up to-date stuff coming in and then

anything that is uh error disputed uh or

opinion based statements and you look at

that from an overtime perspective so

that

means after you've done this a few times

you're going to start to figure out some

sources or some statements show up with

more errors or show up with more

disputed and disputed means that there's

no definitive yes no um kind of

statement on it um things that are

opinions where it's not really

corroborated with evidence um you you

can see that over time and so you want

to uh factor that into your trustworthy

waiting as well let me just talk about

this temporal piece for a second since

we were touching on it so if the

confidence is not high enough or maybe

there's not enough data to support it it

then goes back into the cue for

re-evaluation and you want to do

re-evaluation on something um at least

one to three every 1 to three months um

and then for Evergreen which is that

don't change very often that would be

maybe 6 to 12 months again it depends on

your use case if you need stuff really

fast and you need that temporal

information really fast maybe it's an

hourly or a daily refresh to say like

who else is saying this um or maybe you

have to skip some of these steps and

just maybe put a statement or uh

something into your UI to make sure that

folks are eyes wide open this has not

been verified or we don't have enough

evidence to say this is accurate or not

you know making sure that when you're

doing this that the the end consumers

are aware of what uh they can trust or

not so all of the uh additional

information we went over like the need

for the get the for the facts uh refresh

speed different sources that are um you

know giving you more error prone things

or not all of that goes into um your

final score which you can see over here

so this is just talking about the source

waiting but then you have a different uh

assessment on this information on you

know waiting the the the actual

statements to give you your final uh

score as whether it can be uh deemed

trustworthy or not and so you can see

here this first statement got a 81 the

second statement was very high remember

because it had a lot of high tiered

sources anyways so it has a 93 and then

this poor statement only has a 32 and so

this this router is taking the

calculation that happened in this

trustworthiness box and then it's

routing it to the appropriate space so

if something was uh given a very poor

verification score and you set the

threshold for

that it then sends it to a different

process where it's flagged as erroneous

or do not use or something and then

again not propagating bad data you want

to send that back to the original data

source so that they know that that's

something that doesn't have a high

confidence in and it's up to you whether

you released to them all the other ways

that you were verifying that but at the

very least you need to flag it in the

original data source when it's coming in

so you don't keep trying to reprocess uh

that statement the high confidence stuff

goes into the next piece which is really

important and that is sensitivity check

error check opinion disputed all of that

and so what this means is again using uh

behavioral data that you've seen through

the past so you've done this a few times

you're saying okay even though this

statement is high

confidence um we've seen a lot of

opinions start to show up or we've

noticed that there's weirdly some

sensitive data that's coming in from

from this Source or for some reason we

got an outside um tip that this thing is

this source is no longer giving the same

accuracy that that we were anticipating

like this is kind of your um catchall of

like if you need to put a whole lot more

checks and balances in this is the place

to do that and so if you're seeing

things that do show up constantly that

are not very good or they're disputed

again you want to send that to um the

data source or at least send it to the

folks who are ingesting the data source

so they can flag it on their end okay so

when you send things to be human

verified it's a separate Pipeline and if

you are interested in my human

verification pipeline for machine

learning and AI projects I will make a

video on that but what you do is you can

either send it to something like

Mechanical Turk where it's a survey

where you at least send two statements

and either they're the disputed

statements or statements that don't

agree on something so maybe somebody uh

said in an article that

um a certain level of vitamin C causes

back pain or something but a different

article says it's a different level or

if it it says it doesn't cause back pain

um those conflicting things are the

things that you would want to send uh

through to human verification now with

something like the one I just used which

is medical based you might want to have

smmes so there that would be a different

type of of pipeline where you can still

use the same human verification pipeline

but you would have as mes for medical

specifically looking at this thing but

humans make lots of Errors too and also

especially if you're using Mechanical

Turk people want to get paid and so they

just answer willingly and it doesn't

really help you so to avoid that uh one

way to do it is to ask them to give the

source to where they found this um and

that could be a citation you know if

you're they're using Google Scholar or

something you know get citation real

fast or the link to um the article or

the website that they were using to

verify this so one that helps you

identify new sources if you don't

already have them if you do have them

you already have a trustworthiness score

right so then you can understand if this

um this this human you know that's been

answering um was looking at a

authoritative Source or

not and then it also helps you identify

Bad actors in your human verification

Loop now obviously you're not going to

be sending sensitive data to uh the

human verifiers and so if something is

deemed sensitive like I said it gets

sent to a different it gets routed out

it gets taken out of the data sources

and all of that and then if something is

deemed one of the statements is deemed

um accurate or at least uh

evidence-based there's others that that

can verify this is accurate um then this

statement can finally be deemed verified

and it would get a flag of verification

with a date and the sources right we

want to be able to backtrack everything

that we do here then it can go into your

knowledge graph and it can live and

breathe and and and have a good time and

get verified again on an annual basis or

whatever basis you need to meet your use

case requirements of

trustworthiness um and of course if you

start to see some of those errors from

Downstream applications whether it's llm

or recommendations or reports or

whatever um then this will kick off all

over again all right so I know that was

a whirlwind I hope this has been very

helpful if there was anything that I

went over because there was a lot to

unpack in this video please leave some

questions down below and I will be sure

to answer them I do regularly check the

comments and also if you have any

additions to this process that you have

found helpful please let me know all

right so with that want to thank you

very much and I'll catch you next time

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Video Youtube - 10 Steps for Verifying Statements in Knowledge Graphs

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

KnOD 2021