1/44Is there another side to this?Identifying Disputed Information on the WebRob Ennals, Intel Research Berkeley  -  rob@ennals.orgWork done in collaboration with:    John Mark Agosta, Dan Byler, Beth Trushkowsky,     Barbara Rosario, Tad Hirsch, Tye Rattenbury
About Me: Rob EnnalsSenior Research Scientist at Intel Research
Represent Intel at W3C for HTML and Web Apps.
PhD from University of Cambridge(advised by Simon Peyton Jones – Microsoft Research)Diverse interests: PL, Concurrency, Systems, Web, Mashups, HCI, NLP, Politics, etcNot everything on the web is true, balanced, and objective3/44Not everything on the web is true, balanced, and objective
4/44People increasingly rely on the web for informationsource: Pew Research
Old Model: small number of known sources    TV, Radio, Newspaper, Book PublishersNew Model: huge number of unknown sources    Blogs, random websites, foreign newspapers5/44Not just an issue of source credibility. If we ignore untrusted sources then we ignore 	a lot of the information on the web.
6/44Dispute Finder:inform users when information that they encounter in their lives is disputed by a source that they might trust
7/44Browser extensionFirefox extension examines every page you browse     (including email, intranet pages, etc).Highlights claims that are disputed.
8/44Click a dispute for more informationShow sources that support or oppose the claim.
9/44Search Engine Front-EndBuilt with Yahoo BOSS.Examines text on all linked pages.
Early Work:Mobile Voice InterfaceCurrently an early prototype, running on a laptop, based on Dragon NaturallySpeaking.Listen to everything people say around you. Keep a list of disputed things you may have heard. Vibrate when you hear something disputed.10/44
11/44 Future: Disputed Claims on TV
12/44Future: Mail, Books, News, etc ...
People seem to like itCovered by: NPR, New Scientist, Fast Company, Christian Science Monitor, Wall Street Journal,  NY Times Bay Area, San Jose Mercury, SF Chronicle, The Guardian, ACM TechNews, CBC (Canadian Public Radio), Cnet, Sacramento Bee, + many othersTG Daily: “This is hands down, the most amazing idea I’ve ever heard of when it comes to using the web”Paper accepted for WWW 2010 + WICOW 2010.
Overall structure:14/44
Related Work: Social Annotation15/44VideolyzerDiigoDiigoNeed to mark every instanceindividuallySpinSpotter
Related Work: Fact Checker Sites16/44Need to suspect somethingmay be disputed.
Related Work: Source Rating17/44Automatic quality metrics.But: Non-credible sources still have useful information.But: Credible sources still get stuff wrong.
Related Work: Wiki Source Tracking18/44WikiTrustWikiScannerWho wrote this, and are they credible/biased?Great if your content is on wikipedia.
Overall structure:19/44
20/44Compare Observed Text to Known DisputesGlenn Beck falsely claimed that the moon is made of cheese, despite clear evidence to the contrary.False claim: "the moon is made of cheese"Disputed by: Huffington Post, New York TimesContext: ...Entailment: "We should mine the moon because it ismadeof cheese"
21/44Contradiction detection via dispute detection
22/44Contradiction detection vs Dispute DetectionContradiction detection:     Does statement X logically contradict statement Y.     Hard: need lots of real-world knowledge.Dispute detection:    Does author A believe that statement X is disputed or misleading.    Humans determine what is actually disputed.    Humans determine which disputes are interesting.    Only detects contradictions that humans find.    Detects statements that are misleadingwithout being wrong. Once we have determined that a dispute is real, could use contradiction detection and sentiment analysis to see who is on each side.
23/44A statement can be misleading without being wrongGM's misleading claim that the Chevrolet Volt gets 230 miles per gallondeceptively claimed that fast food could be nutritiousLogical truth isn't all that interesting.We want to know if there is a different way of looking at the subject. A different frame. 
24/44Mining claims from the web
25/44Use Patterns to Find Disputed Claimsthe false claim that Himalayan glaciers could melt away by 2035it is not true that anyone aged over 59 cannot receive heart repairsthe misconception that everyone in the south are stupidthe delusion that scientists in different countries do science differentlyinto believing that Van Morrison had a new babythe myth that we can't afford good working conditions for everyonemisleadingly claimed that unemployment is lower than the '70s We built a simple grammar for such prefixes.Currently 1293 patterns, identified on ~ 35 million web pages.of which we have downloaded and processed 2 million. Restricting to prefixes allows us to search for them using Yahoo BOSS. Future: automatically infer a larger grammar of patterns
26/44Some Disputes I Wasn’t Aware ofThe Niger-Iraq Uranium connection has been discreditedMedieval Europeans thought the world was flatDinosaurs looked sleek and reptilian.Dietary Cholesterol is a problem.“Wear and Tear” causes arthritisSpecific foods cause ulcersEstimates from Yahoo BOSS. Not all URLs downloaded.
Most Disputed NounsGodIraqGovernmentObamaWar6.Israel7. President8. Women9. Money10. Jesus
28/44Search for all patterns on Yahoo BOSSYahoo BOSS is an API for Yahoo search. BOSS API has a limit of 1000 hits per query, so salt with year and month.  +"falsely claimed that" +2010+"falsely claimed that" -2010+2009+"falsely claimed that" -2010-2009+2008+"falsely claimed that" -2010-2009 -2008+2007Needed for 197 patterns.We talked to Yahoo first...Future: get direct access to complete results for a pattern
29/44Claims need to be filteredthe false claim that won't go away falsely claimed that he didn't do itwrongly think that the bill will pass wrongly think that Great Britain doesn't the myth thatElvis is alivehas a long history falsely claim thatfull commentary below  fragmentambiguoussuffixextractionerror
30/44Labeled data from Mechanical Turk$0.04 to label 10 claims, two of which are known.If a turker gets known items wrong, reject their work.Each claim labeled by two turkers.
31/44Problem: text may not be a statementthe false claim that won't go awaythe belief that works bestthe lie that people fell for Current approach: Is the first word a verb?		finds 71% of bad claims 		mistakenly drops 2% of good claims	Works for first two, but not last.
32/44Problem: ambiguous claimshe didn't do itthe union was a party in the proceedingsthe other parent is abusiveour troops have committed atrocities property taxes are regressive Obama is a communist BadMaybeGoodIf two pages say X, do they mean the same thing?Turk: 61.9% agreement - often very subjective    Future: associate claim with page topic
33/44Wikipedia links tell us what is unambiguousproperty taxes are regressive  Obama is a communistIs this word always linked to the same thing?Precision: 73%  Recall:  73%(vs gold data + word features)
Overall structure:34/44
Users enter that claims they disagree with35/44
Users add paraphrases for claims36/44Alternative ways to phrase the same claim.
Teach Dispute Finder to recognize claims37/44
Users add evidence to support claims38/44A claim will not be shown to others unless the user finds a source that argues against it.
Users identify a disputed claim on a page39/44Define a new disputed claim, or add paraphrase forexisting disputed claim.
40/44User Study ResultsFrustrated by low number of claims that were highlighted	- motivated text mining approachDid not appreciate that a claim should apply to multiple pages	- particularly when using context menu approachConfused about how specific a claim should be 	E.g. “Global temperatures will rise by X degrees”Users created claims with ambiguous meanings	E.g. saying “wood” to mean “Ronnie Wood”Confused by double-negatives when adding evidence	E.g. opposes global warming does not existFuture: use users to improve mined claims
41/44Entailment
42/44Entailment is resource constrainedMust compare many sentences against a huge number of claimsin a fraction of a second.
43/44Simple lexical entailmentI think that globalwarming is just a hoaxglobalwarming is a hoaxAll non-stopwords present, and in the correct order.Very simple but:    it can be done very efficiently
    if you have a big enough corpus then it works okFuture: better entailment that still scalesFuture: look at context, and other places same text appears
What is Disputed?44/44Anything disputed by anyone?- we get overwhelmed with claims disputed by nutcasesAnything disputed by a “reliable source”?- what is a “reliable source”? (Wikipedia rules?)		- do we end up enforcing “orthodox” beliefs and stifling debate?Anything disputed by a source that I would trust?		- we reinforce existing echo-chamber problemAnything disputed by my friends?		- do I agree with my friends		- should I be encouraged to agree with themFuture: learn what to show a user by analyzing their behavior

Dispute finder

  • 1.
    1/44Is there anotherside to this?Identifying Disputed Information on the WebRob Ennals, Intel Research Berkeley  -  [email protected] done in collaboration with:    John Mark Agosta, Dan Byler, Beth Trushkowsky,     Barbara Rosario, Tad Hirsch, Tye Rattenbury
  • 2.
    About Me: RobEnnalsSenior Research Scientist at Intel Research
  • 3.
    Represent Intel atW3C for HTML and Web Apps.
  • 4.
    PhD from Universityof Cambridge(advised by Simon Peyton Jones – Microsoft Research)Diverse interests: PL, Concurrency, Systems, Web, Mashups, HCI, NLP, Politics, etcNot everything on the web is true, balanced, and objective3/44Not everything on the web is true, balanced, and objective
  • 5.
    4/44People increasingly relyon the web for informationsource: Pew Research
  • 6.
    Old Model: smallnumber of known sources    TV, Radio, Newspaper, Book PublishersNew Model: huge number of unknown sources    Blogs, random websites, foreign newspapers5/44Not just an issue of source credibility. If we ignore untrusted sources then we ignore a lot of the information on the web.
  • 7.
    6/44Dispute Finder:inform userswhen information that they encounter in their lives is disputed by a source that they might trust
  • 8.
    7/44Browser extensionFirefox extensionexamines every page you browse     (including email, intranet pages, etc).Highlights claims that are disputed.
  • 9.
    8/44Click a disputefor more informationShow sources that support or oppose the claim.
  • 10.
    9/44Search Engine Front-EndBuiltwith Yahoo BOSS.Examines text on all linked pages.
  • 11.
    Early Work:Mobile VoiceInterfaceCurrently an early prototype, running on a laptop, based on Dragon NaturallySpeaking.Listen to everything people say around you. Keep a list of disputed things you may have heard. Vibrate when you hear something disputed.10/44
  • 12.
  • 13.
  • 14.
    People seem tolike itCovered by: NPR, New Scientist, Fast Company, Christian Science Monitor, Wall Street Journal, NY Times Bay Area, San Jose Mercury, SF Chronicle, The Guardian, ACM TechNews, CBC (Canadian Public Radio), Cnet, Sacramento Bee, + many othersTG Daily: “This is hands down, the most amazing idea I’ve ever heard of when it comes to using the web”Paper accepted for WWW 2010 + WICOW 2010.
  • 15.
  • 16.
    Related Work: SocialAnnotation15/44VideolyzerDiigoDiigoNeed to mark every instanceindividuallySpinSpotter
  • 17.
    Related Work: FactChecker Sites16/44Need to suspect somethingmay be disputed.
  • 18.
    Related Work: SourceRating17/44Automatic quality metrics.But: Non-credible sources still have useful information.But: Credible sources still get stuff wrong.
  • 19.
    Related Work: WikiSource Tracking18/44WikiTrustWikiScannerWho wrote this, and are they credible/biased?Great if your content is on wikipedia.
  • 20.
  • 21.
    20/44Compare Observed Textto Known DisputesGlenn Beck falsely claimed that the moon is made of cheese, despite clear evidence to the contrary.False claim: "the moon is made of cheese"Disputed by: Huffington Post, New York TimesContext: ...Entailment: "We should mine the moon because it ismadeof cheese"
  • 22.
  • 23.
    22/44Contradiction detection vsDispute DetectionContradiction detection:     Does statement X logically contradict statement Y.     Hard: need lots of real-world knowledge.Dispute detection:    Does author A believe that statement X is disputed or misleading.    Humans determine what is actually disputed.    Humans determine which disputes are interesting.    Only detects contradictions that humans find.    Detects statements that are misleadingwithout being wrong. Once we have determined that a dispute is real, could use contradiction detection and sentiment analysis to see who is on each side.
  • 24.
    23/44A statement canbe misleading without being wrongGM's misleading claim that the Chevrolet Volt gets 230 miles per gallondeceptively claimed that fast food could be nutritiousLogical truth isn't all that interesting.We want to know if there is a different way of looking at the subject. A different frame. 
  • 25.
  • 26.
    25/44Use Patterns toFind Disputed Claimsthe false claim that Himalayan glaciers could melt away by 2035it is not true that anyone aged over 59 cannot receive heart repairsthe misconception that everyone in the south are stupidthe delusion that scientists in different countries do science differentlyinto believing that Van Morrison had a new babythe myth that we can't afford good working conditions for everyonemisleadingly claimed that unemployment is lower than the '70s We built a simple grammar for such prefixes.Currently 1293 patterns, identified on ~ 35 million web pages.of which we have downloaded and processed 2 million. Restricting to prefixes allows us to search for them using Yahoo BOSS. Future: automatically infer a larger grammar of patterns
  • 27.
    26/44Some Disputes IWasn’t Aware ofThe Niger-Iraq Uranium connection has been discreditedMedieval Europeans thought the world was flatDinosaurs looked sleek and reptilian.Dietary Cholesterol is a problem.“Wear and Tear” causes arthritisSpecific foods cause ulcersEstimates from Yahoo BOSS. Not all URLs downloaded.
  • 28.
  • 29.
    28/44Search for allpatterns on Yahoo BOSSYahoo BOSS is an API for Yahoo search. BOSS API has a limit of 1000 hits per query, so salt with year and month.  +"falsely claimed that" +2010+"falsely claimed that" -2010+2009+"falsely claimed that" -2010-2009+2008+"falsely claimed that" -2010-2009 -2008+2007Needed for 197 patterns.We talked to Yahoo first...Future: get direct access to complete results for a pattern
  • 30.
    29/44Claims need tobe filteredthe false claim that won't go away falsely claimed that he didn't do itwrongly think that the bill will pass wrongly think that Great Britain doesn't the myth thatElvis is alivehas a long history falsely claim thatfull commentary below  fragmentambiguoussuffixextractionerror
  • 31.
    30/44Labeled data fromMechanical Turk$0.04 to label 10 claims, two of which are known.If a turker gets known items wrong, reject their work.Each claim labeled by two turkers.
  • 32.
    31/44Problem: text maynot be a statementthe false claim that won't go awaythe belief that works bestthe lie that people fell for Current approach: Is the first word a verb? finds 71% of bad claims mistakenly drops 2% of good claims Works for first two, but not last.
  • 33.
    32/44Problem: ambiguous claimshedidn't do itthe union was a party in the proceedingsthe other parent is abusiveour troops have committed atrocities property taxes are regressive Obama is a communist BadMaybeGoodIf two pages say X, do they mean the same thing?Turk: 61.9% agreement - often very subjective    Future: associate claim with page topic
  • 34.
    33/44Wikipedia links tellus what is unambiguousproperty taxes are regressive  Obama is a communistIs this word always linked to the same thing?Precision: 73% Recall: 73%(vs gold data + word features)
  • 35.
  • 36.
    Users enter thatclaims they disagree with35/44
  • 37.
    Users add paraphrasesfor claims36/44Alternative ways to phrase the same claim.
  • 38.
    Teach Dispute Finderto recognize claims37/44
  • 39.
    Users add evidenceto support claims38/44A claim will not be shown to others unless the user finds a source that argues against it.
  • 40.
    Users identify adisputed claim on a page39/44Define a new disputed claim, or add paraphrase forexisting disputed claim.
  • 41.
    40/44User Study ResultsFrustratedby low number of claims that were highlighted - motivated text mining approachDid not appreciate that a claim should apply to multiple pages - particularly when using context menu approachConfused about how specific a claim should be E.g. “Global temperatures will rise by X degrees”Users created claims with ambiguous meanings E.g. saying “wood” to mean “Ronnie Wood”Confused by double-negatives when adding evidence E.g. opposes global warming does not existFuture: use users to improve mined claims
  • 42.
  • 43.
    42/44Entailment is resourceconstrainedMust compare many sentences against a huge number of claimsin a fraction of a second.
  • 44.
    43/44Simple lexical entailmentIthink that globalwarming is just a hoaxglobalwarming is a hoaxAll non-stopwords present, and in the correct order.Very simple but:    it can be done very efficiently
  • 45.
        if youhave a big enough corpus then it works okFuture: better entailment that still scalesFuture: look at context, and other places same text appears
  • 46.
    What is Disputed?44/44Anythingdisputed by anyone?- we get overwhelmed with claims disputed by nutcasesAnything disputed by a “reliable source”?- what is a “reliable source”? (Wikipedia rules?) - do we end up enforcing “orthodox” beliefs and stifling debate?Anything disputed by a source that I would trust? - we reinforce existing echo-chamber problemAnything disputed by my friends? - do I agree with my friends - should I be encouraged to agree with themFuture: learn what to show a user by analyzing their behavior
  • 47.
    Interviews: Do peoplewant this?45/44Hard to change established opinions They think they already understand the issue. They would have to publically back downSo focus on issues they don’t yet have an opinion on?Hard to make someone accept the other sideSocial identity in “us” vs “them”Not willing to listen to “other side”So give sources from their “own” side?Sometimes people may not care Reading just for entertainment and conversation material Don’t care much if they are wrong Not interested in challenging opinions of othersFocus on issues that affect them personallyDispute Finder probably isn’t for everyone
  • 48.