Vídeo https://youtu.be/jq9o4mYEFLo?si=6WCtMd8TbFUWziTq
Fluxo https://drive.google.com/file/d/1t0EH2nGZNZX6cYBqktmuvrGl2iDpkJpY/view
Curadoria das Fontes: selecionar fontes confiáveis
Técnicas de integração de dados
Step 1: Data, Content, Source Audit (Estruturadas ou não estruturadas)
Step 2: Source Tier Logic
Step 3: Statement Similarity Assesment
Step 4: Identification of duplicates, clusters or similarity statements
Step 5: Weighting Assessment
Step 6: Temporal data Flow
Step 7: Final score calculation
Step 8: Statement router
Step 9: Human Verification
Step 10: Adding verification tags and verified statements to the graph
hey everybody so today we are going to
be going over how do you do statement
validation for your knowledge graph
there's a lot of reasons that you might
want to do this if you are one of the
folks that thinks that you know because
you're sourcing data for your knowledge
graph from internal sources or licensed
sources licensed data sets it should all
be correct that's not true and if you
are using this for anything that's
critical like you're reporting or maybe
feeding this information to the llm
because you want you know help with the
hallucination problem which is another
reason that people are really looking at
this specific topic this is something
that we're going to walk through today a
lot of what we're walking through today
is founded long long time ago in regular
scholarship so when you're doing
research you want to be evidence-based
right so when you look for evidence
you're looking for other authors or
other projects that support the claims
that you are making or they have claims
that can also be reproduced or supported
that's essentially the same methodology
we're going to be using today so I'm
going to be walking through the exact
process that I have used in many
organizations it has worked really
really well for this with verifying what
is coming from a graph or verifying and helping the llm
not do so many hallucinations and this is also
something that I've used to verify um
different news sources is coming in to
see if uh a claim from a new source can
be corroborated with other data points
so these are all areas that this process
should be helpful for all right so this
sounds interesting to you let's go get
started all right so we are going to go
through this step by step and I'm going
to explain each step so that you
understand what each one is doing and
also I am going to put uh a link in the
description bar below to the whole
diagram that I'm going through through
here so don't worry about taking
screenshots as you go the whole thing is
down below if you want to go check that
out okay so the first area is really
really important and that is doing a
source and that would be from data
sources or unstructured uh you need to do an audit
of all of those that are going to be
ingested and used in your knowledge
graph even if you don't do anything else
you should do that
anyways that is a really important thing
so that you have the provenance of all
of your your information in your
knowledge graph and that's going to be
then used to do your Source tier list so
what you're doing here is at first
you're looking at your unstructured so
these are supposed to have some kind of
metadata associated with them and if
they're not you're going to have to add
some and if you missed my why is
taxonomy useful in the AI space link up
above here uh this is a really big part
of that and that is you need to be able
to identify what your coverage is of Any Given topic
or thing in your graph and a b a big way
to do that is to be able to look at the
data sources or the documents if it's
unstructured uh that you're using in
your knowledge graph that also helps you
in the llm space because if your llm is
failing in a lot of medical areas well
now you can go back to your sources and
see how much medical content do you
actually have or how much medical uh
statements and again that statement is
like Tom Cruz birth date date of birth
right like that is a statement um how
much of that do you have in the medical
space maybe that's why you're having
hallucinations you just don't have
enough data to help it um so this is a
big important part maybe these
unstructured data sources are coming
from machine learning models that you've
used that data to train and if so it
would be really helpful to have not just
the confidence score which is what the
machine the model thinks that it did on
its test right so it's looking at did I
get the the right answer for whatever
the output is but also the accuracy
score which is and then do the humans
agree with that confidence score so some
of that information is really helpful to
have because you can use that in
identifying how trustworthy some of
these sources are uh when you get
further into this process then in the
data set side there's a whole lot of
potential landm
in your ETL this is one of those things
that I have seen so many times people
come back and say oh you know and I've
used this process I've used what we're
going through here multiple times and
almost always why am I getting so many
errors I don't know why this data source
is really high quality why is it what's
happening here so if you think that
you're licensing data and therefore it
is all good and factual and accurate
it's not it's not it's much High higher
quality than something you would just
scrape from the web chances are if you
are going in and looking at how the ETL
was done if you don't have documentation
on the decisions that were made and if
those decisions are not queriable or
accessible to kind of understand on a on
a larger scale across all your sources
what decisions were made for instance um
were certain Fields uh joined together
and maybe that made perfect sense for
the knowledge graph but now when you're
getting into
understanding the trustworthiness of
that statement it's not totally
trustworthy now because there's
something off about it um a good example
of this is even like date formats
there's so many different date formats
maybe someone took uh you know Legacy
data that was the year of the cherry
blossom true story that I had in my past
and they had to to make a decision on
what that means from a date perspective
maybe they guessed wrong so if you're
finding that you're seeing a lot of
invalid or uh your your statements in
your graph and even the ones going into
the llm and those responses that you're
seeing are not as trustworthy go check
your ETL or the people doing the ETL
because oftentimes the stuff isn't
written down so that is my my word to
the wise to to go in and look at that
another area that you're going to see
here is the source ID you really want to
make sure that you have um that that
audit so that you have a source ID for
every Source you're using using and how
often is it verified how often is it
updated um you know is is it more of an
opinion uh piece where it's more like
you know magazines and the opinions of
fashion designers or is it from an
authoritative Source on fashion design
uh education right those are a little
different so making sure you have those
Source IDs and the information along
those sources will help you with the
provenant and making sure that you can
reproduce or at least backtrack to where
did these statements come from and how
out of date are they right date and time
sensitive things is a part of this as
well and we'll get into that a little
later last is if there are uh IDs from
the original data source make sure you
keep them just again so you can
backtrack to the source and if there are
issues again this is going to come back
to us in the last part of this process
you can push it back we do not you don't
have to do this part but we do not want
to continue to propagate bad data right
let's all do ourselves a favor so if you
see bad
data see something say something right
so at least if you have those IDs and
you know which source they came from you
can then contact the source the data
provider whoever they might be and just
say hey um we found that there was all
these errors on these statements or this
data that you have in your data source
can you like go fix it and the reason
that you want to do that um is not only
just not propagating the bad data but
also you don't want to have to fix that
mistake over and over and over and over
again so if the original data source is
corrected or at least those things are
taken out of the original data source
you don't have to worry about them
anymore okay now into the source tier
list this is really important now if you
were following along or are following
along with the uh times copyright in
llm stuff um one you should not be using
data that you do not have the rights to
use number one number one for FR and
foremost number
two one of the things they were using in
their court case was that they found out
that the times information was actually
ranked and weighted higher in uh the
trustworthiness scoring of whatever llm
is under Fire for that so this is what a
lot of folks do when they are doing
verification on statements within not
just their Knowledge Graph but in the
llm space even if they're not using a
Knowledge Graph
so this tiered list goes back to the
beginning of scholarship right you have
something called an authoritative Source
we are all taught this in our research
right you're doing research you're
probably not going to want to site
something that doesn't have peer review
you're not going to want to maybe site
something that cannot be verified that
it needs to be evidencebased um there
are lists that we are taught of you know
like things coming from government
resources things that are coming from um
citable journals and that that sort of
thing are more authoritative than not
that doesn't mean they're all factual it
doesn't mean that all the statements
coming from those are going to be good
quality but what it does mean again
that's why this whole process exists
what it does mean is it is a higher
likelihood that you're going to get
better data from you know something that
has a reputation of being factual really
having a high degree of scrutiny making
sure that they really care about the
data that they are
providing that stuff goes into tier one
also if we're talking about things let's
go down to my little note here um you
know to give yourself a heads up on this
if you've never done it you can go back
to what Scholars use and which is the H
index or how um well regarded a certain
Journal is again there's some
controversy in that too so that's why I
also suggest you use something called
alt metric which is some journals are
really well done and they just maybe
don't have the money to publish and get
the certification of of being in these
indexes uh that other Publishers have so
looking at how trustworthy does the
scholarly Community see this thing even
if you're using archive which is not
technically a journal um and those
things are not yet necessarily
peer-reviewed sometimes um and sometimes
they are not the finished product or
publication in in some cases you can
still see how many people are using it
and how people are talking about it
again that doesn't necessarily mean you
trust it wholesale but it is a good
indicator that others who know about
this are talking about it in a positive
way now if that doesn't show up at all
if if the sources that you're looking at
don't show up in those at all one way to
also get your tiered list going is
looking at a sample of statements from
either unstructured or structured and
trying to understand if you look across
all of the data sources that you're
using or you know even having smes if
you don't have if you can't find it
elsewhere in other sources having smmes
or human verified which is a different
process than just smmes um and we'll get
to that in the latter part of of this
process
too they need to be able to say yeah I I
know this author I know this Source um
it's it's really good I looked at these
statements they are accurate um all of
that goes into how you're going to set
up that tiered list the other thing is
you might have many lists many lists um
many tiered lists and the reason for
that is different use cases if you have
your knowledge graph is being used for
let's say a medical product that you
have and then you have another one for
um Medical Products meaning like um I
don't know vitamins or something the
trustworthiness of different sources for
those two use cases even if you have all
medical sources is going to be different
and so what do you do with this so I've
colorcoded this so that you can see it
as we go so tier one you're probably not
going to have a whole lot of things in
tier one only because those are like top
shelf things that you really know are
very trustworthy and um out of the gate
they're they're going to have higher
quality and by the way data quality is
not the same thing as what we're talking
about here data quality like I said I
think in the intro you can have shackle
shapes and other things to say like are
the con being met that does not mean the
same thing as it is an accurate
statement right so keep that in mind um
things that are you know maybe mixing uh
opinion pieces and you know letters to
the editor and that kind of stuff mixed
in with more you know let's say
investigative journalism or something
that might be tier two not because it's
not trustworthy but because it mixes
things that are pure opinion versus more
evidence-based um and then tier three
are things that are you know maybe they
have a few Journal articles for instance
or a few editors or authors that are
just like really topn and doing so many
good things and then they have a bunch
of others that are maybe not so
trustworthy or maybe not you know doing
the best type of scholarship or
evidence-based research so that's where
tier three comes in and you can decide
that you don't want tier three at all
right this is where this tier list comes
in and meeing it to your use case so
once you get the statements from all of
these
sources you're going to then do a
similarity of those statements so here
we have statement
one and we can see that Source One three
and four are contributing what does that
mean it means that this statement or
something very similar statements very
similar are coming from these three
different sources so this is what you're
trying to do this is going back again to
that triangulation evidence based do you
have other authors do you have other
sources that are agreeing or near
agreeing with that statement that is
what you're looking at here does that
statement show up in other resources
because if it does it's likely doesn't
mean it's true it's likely that it's
more trustworthy than something that you
can only find in one place again that
doesn't mean that one's not trustworthy
either and we'll get into like how this
is a
reciprocal you you can go back and look
at statements and see if they get more
evidence and and so on and so forth
again just like scholarship then you
want to make sure that you do some D
duplication because this is just looking
at statements within the sources that
are similar but it doesn't look at does
the same Source give you the same
statement over and over again so that
adds again some more weight and so you
can see this
green and this this orange is showing it
for statement one and also statement
three so are these similar and we'll say
that 95% confident that these are
similar statements therefore you don't
need to carry on with statement three
because it's just a a slight variation
of statement one so th that drops out as
we move forward now we're going to go
to the confidence and trustworthy
waiting which is really important here
so you can see that the higher tiers get
a higher weight the middle tiers get an
even weight and the lower tiers actually
get a negative weight again you get to
decide the weights that you use for this
uh this is just for demonstration
purposes and it gives you a score right
so this is a score three now this
statement again these are different
statements going through this statement
statement two has um actually a much
higher uh score because it has two of
the top tiers and so it gets a A score
of four and then down here you can see
these are all very low levels so this is
negative of one this one is very risky
to put into your graph or to continue to
use your graph if it's already there so
let's look at this so This confidence
waiting is based on the source tiers
that's why that area is so important and
the especially the confidence scores
that you already have then you base the
weight on the number of valid facts
again valid here is talking about do
they all agree is it evidencebased based
again we're not saying that it's
trustworthy quite yet we're saying it's
evidence-based um the Rarity of the fact
so that's something else that isn't
necessarily a negative like I had
mentioned earlier maybe there is a
groundbreaking statement found in a
certain research article that doesn't
mean it's not trustworthy it just means
that it needs more evidence to be
corroborated by others and that could be
you know citing that article others
talking about that article but it's a
rarity and that doesn't mean it's bad it
actually means if you have that
statement and no one else you know that
is also using llms has that statement
means you have some statements that your
llm or your graph has that nobody else
might have so that's actually not a bad
thing you just need to make sure you can
trust what what it's doing um then
you're going to also be looking at the
saturation in other sources so how often
does that show up in all the other
sources and that really is that
similarity stuff that we were talking
about you know based on all the
different sources and the statements
coming from them
you're also going to want to look at the
need for the fact and that your Gap
analysis remember sources have those
taxonomies uh associated with them so
you can understand if you have gaps if
you have a gap you might be more willing
to let something with a lower confidence
not too low you have to make your own
thresholds um you might be able you
might be more willing to have that go in
because you really need more data in
that space or maybe you go and you talk
to your um sourcing but folks and go and
get more data sources that will then uh
support that area and so that
information would be helpful to know
when you are determining whether you're
going to let something in or not because
maybe you just don't have a lot of data
on that yet um which means there's going
to be less sources corroborating it but
it's not necessarily because it's not
trustworthy it's because you just don't
have enough data for it uh refresh speed
this is important for uh temporal facts
so llms suck when it comes to anything
uh time sensitive because they are
trained on a point in time which is why
they have to use all these other
resources to continue to feed it data
but also to supplement when it might not
have the most up-to-date information
which is why if you're going through
this having again that Source
information on how often does this get
refreshed how uh how fast is it and
getting you know the next uh thing in
finance that that everybody's talking
about that that is a really important
thing to keep in mind with this and also
there is a special process in here that
we're going to go over for temporal data
you might not be able to do the same
process for temporal data you might have
to have two pipelines like this or two
uh algorithms running uh on your data
where if there's a temporal you have
first of all temporal means there might
not be data to support it yet so so you
need to have um a special
Cadence for temporal data because you
need to get it in fast but you also
don't have enough data to support it so
you you have to think through that when
you're doing this um and then you know
what is the impact of the graph itself
so um if you add something in is it
going to add is it going to make uh some
of the algorithms that you have really
gnarly like some of those things you
might want to think about too and then
new and valid entities and statements
they are weighted weight weighed higher
because um again you want to get the
most up to-date stuff coming in and then
anything that is uh error disputed uh or
opinion based statements and you look at
that from an overtime perspective so
that
means after you've done this a few times
you're going to start to figure out some
sources or some statements show up with
more errors or show up with more
disputed and disputed means that there's
no definitive yes no um kind of
statement on it um things that are
opinions where it's not really
corroborated with evidence um you you
can see that over time and so you want
to uh factor that into your trustworthy
waiting as well let me just talk about
this temporal piece for a second since
we were touching on it so if the
confidence is not high enough or maybe
there's not enough data to support it it
then goes back into the cue for
re-evaluation and you want to do
re-evaluation on something um at least
one to three every 1 to three months um
and then for Evergreen which is that
don't change very often that would be
maybe 6 to 12 months again it depends on
your use case if you need stuff really
fast and you need that temporal
information really fast maybe it's an
hourly or a daily refresh to say like
who else is saying this um or maybe you
have to skip some of these steps and
just maybe put a statement or uh
something into your UI to make sure that
folks are eyes wide open this has not
been verified or we don't have enough
evidence to say this is accurate or not
you know making sure that when you're
doing this that the the end consumers
are aware of what uh they can trust or
not so all of the uh additional
information we went over like the need
for the get the for the facts uh refresh
speed different sources that are um you
know giving you more error prone things
or not all of that goes into um your
final score which you can see over here
so this is just talking about the source
waiting but then you have a different uh
assessment on this information on you
know waiting the the the actual
statements to give you your final uh
score as whether it can be uh deemed
trustworthy or not and so you can see
here this first statement got a 81 the
second statement was very high remember
because it had a lot of high tiered
sources anyways so it has a 93 and then
this poor statement only has a 32 and so
this this router is taking the
calculation that happened in this
trustworthiness box and then it's
routing it to the appropriate space so
if something was uh given a very poor
verification score and you set the
threshold for
that it then sends it to a different
process where it's flagged as erroneous
or do not use or something and then
again not propagating bad data you want
to send that back to the original data
source so that they know that that's
something that doesn't have a high
confidence in and it's up to you whether
you released to them all the other ways
that you were verifying that but at the
very least you need to flag it in the
original data source when it's coming in
so you don't keep trying to reprocess uh
that statement the high confidence stuff
goes into the next piece which is really
important and that is sensitivity check
error check opinion disputed all of that
and so what this means is again using uh
behavioral data that you've seen through
the past so you've done this a few times
you're saying okay even though this
statement is high
confidence um we've seen a lot of
opinions start to show up or we've
noticed that there's weirdly some
sensitive data that's coming in from
from this Source or for some reason we
got an outside um tip that this thing is
this source is no longer giving the same
accuracy that that we were anticipating
like this is kind of your um catchall of
like if you need to put a whole lot more
checks and balances in this is the place
to do that and so if you're seeing
things that do show up constantly that
are not very good or they're disputed
again you want to send that to um the
data source or at least send it to the
folks who are ingesting the data source
so they can flag it on their end okay so
when you send things to be human
verified it's a separate Pipeline and if
you are interested in my human
verification pipeline for machine
learning and AI projects I will make a
video on that but what you do is you can
either send it to something like
Mechanical Turk where it's a survey
where you at least send two statements
and either they're the disputed
statements or statements that don't
agree on something so maybe somebody uh
said in an article that
um a certain level of vitamin C causes
back pain or something but a different
article says it's a different level or
if it it says it doesn't cause back pain
um those conflicting things are the
things that you would want to send uh
through to human verification now with
something like the one I just used which
is medical based you might want to have
smmes so there that would be a different
type of of pipeline where you can still
use the same human verification pipeline
but you would have as mes for medical
specifically looking at this thing but
humans make lots of Errors too and also
especially if you're using Mechanical
Turk people want to get paid and so they
just answer willingly and it doesn't
really help you so to avoid that uh one
way to do it is to ask them to give the
source to where they found this um and
that could be a citation you know if
you're they're using Google Scholar or
something you know get citation real
fast or the link to um the article or
the website that they were using to
verify this so one that helps you
identify new sources if you don't
already have them if you do have them
you already have a trustworthiness score
right so then you can understand if this
um this this human you know that's been
answering um was looking at a
authoritative Source or
not and then it also helps you identify
Bad actors in your human verification
Loop now obviously you're not going to
be sending sensitive data to uh the
human verifiers and so if something is
deemed sensitive like I said it gets
sent to a different it gets routed out
it gets taken out of the data sources
and all of that and then if something is
deemed one of the statements is deemed
um accurate or at least uh
evidence-based there's others that that
can verify this is accurate um then this
statement can finally be deemed verified
and it would get a flag of verification
with a date and the sources right we
want to be able to backtrack everything
that we do here then it can go into your
knowledge graph and it can live and
breathe and and and have a good time and
get verified again on an annual basis or
whatever basis you need to meet your use
case requirements of
trustworthiness um and of course if you
start to see some of those errors from
Downstream applications whether it's llm
or recommendations or reports or
whatever um then this will kick off all
over again all right so I know that was
a whirlwind I hope this has been very
helpful if there was anything that I
went over because there was a lot to
unpack in this video please leave some
questions down below and I will be sure
to answer them I do regularly check the
comments and also if you have any
additions to this process that you have
found helpful please let me know all
right so with that want to thank you
very much and I'll catch you next time
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.