SemEval-2017 Task 8: RumourEval: Determining rumour veracity and
support for rumours
Leon Derczynski♥∗
and Kalina Bontcheva♥
and Maria Liakata♣
and Rob Procter♣
and Geraldine Wong Sak Hoi♦
and Arkaitz Zubiaga♣
♥: Department of Computer Science, University of Sheffield, S1 4DP, UK
♣: Department of Computer Science, University of Warwick, CV4 7AL, UK
♦: swissinfo.ch, Bern, Switzerland
∗: leon.d@shef.ac.uk
Abstract
Media is full of false claims. Even Ox-
ford Dictionaries named “post-truth” as
the word of 2016. This makes it more
important than ever to build systems that
can identify the veracity of a story, and
the nature of the discourse around it. Ru-
mourEval is a SemEval shared task that
aims to identify and handle rumours and
reactions to them, in text. We present an
annotation scheme, a large dataset cov-
ering multiple topics – each having their
own families of claims and replies – and
use these to pose two concrete challenges
as well as the results achieved by partici-
pants on these challenges.
1 Introduction and Motivation
Rumours are rife on the web. False claims affect
people’s perceptions of events and their behaviour,
sometimes in harmful ways. With the increasing
reliance on the Web – social media, in particular –
as a source of information and news updates by in-
dividuals, news professionals, and automated sys-
tems, the potential disruptive impact of rumours is
further accentuated.
The task of analysing and determining veracity
of social media content has been of recent interest
to the field of natural language processing. After
initial work (Qazvinian et al., 2011), increasingly
advanced systems and annotation schemas have
been developed to support the analysis of rumour
and misinformation in text (Kumar and Geethaku-
mari, 2014; Zhang et al., 2015; Shao et al., 2016;
Zubiaga et al., 2016b). Veracity judgment can
be decomposed intuitively in terms of a compar-
ison between assertions made in – and entailments
from – a candidate text, and external world knowl-
edge. Intermediate linguistic cues have also been
shown to play a role. Critically, based on recent
work the task appears deeply nuanced and very
challenging, while having important applications
in, for example, journalism and disaster mitigation
(Hermida, 2012; Procter et al., 2013a; Veil et al.,
2011).
We propose a shared task where participants
analyse rumours in the form of claims made in
user-generated content, and where users respond
to one another within conversations attempting to
resolve the veracity of the rumour. We define a ru-
mour as a “circulating story of questionable verac-
ity, which is apparently credible but hard to verify,
and produces sufficient scepticism and/or anxiety
so as to motivate finding out the actual truth” (Zu-
biaga et al., 2015b). While breaking news unfold,
gathering opinions and evidence from as many
sources as possible as communities react becomes
crucial to determine the veracity of rumours and
consequently reduce the impact of the spread of
misinformation.
Within this scenario where one needs to listen
to, and assess the testimony of, different sources
to make a final decision with respect to a rumour’s
veracity, we ran a task in SemEval consisting of
two subtasks: (a) stance classification towards ru-
mours, and (b) veracity classification. Subtask A
corresponds to the core problem in crowd response
analysis when using discourse around claims to
verify or disprove them. Subtask B corresponds
to the AI-hard task of assessing directly whether
or not a claim is false.
1.1 Subtask A - SDQC Support/ Rumour
stance classification
Related to the objective of predicting a rumour’s
veracity, Subtask A deals with the complementary
objective of tracking how other sources orient to
the accuracy of the rumourous story. A key step
in the analysis of the surrounding discourse is to
SDQC support classification. Example 1:
u1: We understand there are two gunmen and up to a dozen hostages inside the cafe under siege at Sydney.. ISIS flags
remain on display #7News [support]
u2: @u1 not ISIS flags [deny]
u3: @u1 sorry - how do you know it’s an ISIS flag? Can you actually confirm that? [query]
u4: @u3 no she can’t cos it’s actually not [deny]
u5: @u1 More on situation at Martin Place in Sydney, AU –LINK– [comment]
u6: @u1 Have you actually confirmed its an ISIS flag or are you talking shit [query]
SDQC support classification. Example 2:
u1: These are not timid colours; soldiers back guarding Tomb of Unknown Soldier after today’s shooting #StandforCanada
–PICTURE– [support]
u2: @u1 Apparently a hoax. Best to take Tweet down. [deny]
u3: @u1 This photo was taken this morning, before the shooting. [deny]
u4: @u1 I don’t believe there are soldiers guarding this area right now. [deny]
u5: @u4 wondered as well. I’ve reached out to someone who would know just to confirm that. Hopefully get
response soon. [comment]
u4: @u5 ok, thanks. [comment]
Figure 1: Examples of tree-structured threads discussing the veracity of a rumour, where the label asso-
ciated with each tweet is the target of the SDQC support classification task.
determine how other users in social media regard
the rumour (Procter et al., 2013b). We propose
to tackle this analysis by looking at the conversa-
tion stemming from direct and nested replies to the
tweet originating the rumour (source tweet).
To this effect RumourEval provided partici-
pants with a tree-structured conversation formed
of tweets replying to the originating rumourous
tweet, directly or indirectly. Each tweet presents
its own type of support with respect to the rumour
(see Figure 1). We frame this in terms of support-
ing, denying, querying or commenting on (SDQC)
the original rumour (Zubiaga et al., 2016b). There-
fore, we introduce a subtask where the goal is to
label the type of interaction between a given state-
ment (rumourous tweet) and a reply tweet (the lat-
ter can be either direct or nested replies).
We note that superficially this subtask may bear
similarity to SemEval-2016 Task 6 on stance de-
tection from tweets (Mohammad et al., 2016),
where participants are asked to determine whether
a tweet is in favour, against or neither, of a given
target entity (e.g. Hillary Clinton) or topic (e.g.
climate change). Our SQDC subtask differs in two
aspects. Firstly, participants needed to determine
the objective support towards a rumour, an entire
statement, rather than individual target concepts.
Moreover, they are asked to determine additional
response types to the rumourous tweet that are rel-
evant to the discourse, such as a request for more
information (questioning, Q) and making a com-
ment (C), where the latter doesn’t directly address
support or denial towards the rumour, but pro-
vides an indication of the conversational context
surrounding rumours. For example, certain pat-
terns of comments and questions can be indicative
of false rumours and others indicative of rumours
that turn out to be true.
Secondly, participants need to determine the
type of response towards a rumourous tweet from
a tree-structured conversation, where each tweet is
not necessarily sufficiently descriptive on its own,
but needs to be viewed in the context of an aggre-
gate discussion consisting of tweets preceding it
in the thread. This is more closely aligned with
stance classification as defined in other domains,
such as public debates (Anand et al., 2011). The
latter also relates somewhat to the SemEval-2015
Task 3 on Answer Selection in Community Ques-
tion Answering (Moschitti et al., 2015), where the
task was to determine the quality of responses in
tree-structured threads in CQA platforms. Re-
sponses to questions are classified as ‘good’, ‘po-
tential’ or ‘bad’. Both tasks are related to tex-
tual entailment and textual similarity. However,
Semeval-2015 Task3 is clearly a question answer-
ing task, the platform itself supporting a QA for-
mat in contrast with the more free-form format of
conversations in Twitter. Moreover, as a question
answering task Semeval-2015 Task 3 is more con-
cerned with relevance and retrieval whereas the
task we propose here is about whether support or
denial can be inferred towards the original state-
ment (source tweet) from the reply tweets.
Each tweet in the tree-structured thread is cate-
gorised into one of the following four categories,
following Procter et al. (2013b):
• Support: the author of the response supports
the veracity of the rumour.
• Deny: the author of the response denies the
veracity of the rumour.
• Query: the author of the response asks for
additional evidence in relation to the veracity
of the rumour.
• Comment: the author of the response makes
their own comment without a clear contribu-
tion to assessing the veracity of the rumour.
Prior work in the area has found the task dif-
ficult, compounded by the variety present in lan-
guage use between different stories (Lukasik et al.,
2015; Zubiaga et al., 2017). This indicates it is
challenging enough to make for an interesting Se-
mEval shared task.
1.2 Subtask B - Veracity prediction
The goal of this subtask is to predict the verac-
ity of a given rumour. The rumour is presented
as a tweet, reporting an update associated with a
newsworthy event, but deemed unsubstantiated at
the time of release. Given such a tweet/claim, and
a set of other resources provided, systems should
return a label describing the anticipated veracity of
the rumour as true or false – see Figure 2.
The ground truth of this task has been manually
established by journalist members of the team who
identified official statements or other trustworthy
sources of evidence that resolved the veracity of
the given rumour. Examples of tweets annotated
for veracity are shown in Figure 2.
The participants in this subtask chose between
two variants. In the first case – the closed vari-
ant – the veracity of a rumour had to be predicted
solely from the tweet itself (for example (Liu et al.,
2015) rely only on the content of tweets to assess
the veracity of tweets in real time, while systems
such as Tweet-Cred (Gupta et al., 2014) follow a
tweet level analysis for a similar task where the
credibility of a tweet is predicted). In the second
case – the open variant – additional context was
provided as input to veracity prediction systems;
this context consists of a Wikipedia dump. Criti-
cally, no external resources could be used that con-
tained information from after the rumour’s resolu-
tion. To control this, we specified precise versions
of external information that participants could use.
This was important to make sure we introduced
time sensitivity into the task of veracity prediction.
In a practical system, the classified conversation
threads from Subtask A could be used as context.
We take a simple approach to this task, us-
ing only true/false labels for rumours. In prac-
tice, however, many claims are hard to verify;
for example, there were many rumours concern-
ing Vladimir Putin’s activities in early 2015, many
wholly unsubstantiable. Therefore, we also expect
systems to return a confidence value in the range
of 0-1 for each rumour; if the rumour is unverifi-
able, a confidence of 0 should be returned.
1.3 Impact
Identifying the veracity of claims made on the web
is an increasingly important task (Zubiaga et al.,
2015b). Decision support, digital journalism and
disaster response already rely on picking out such
claims (Procter et al., 2013b). Additionally, web
and social media are a more challenging environ-
ment than e.g. newswire, which has traditionally
provided the mainstay of similar tasks (such as
RTE (Bentivogli et al., 2011)). Last year we ran
a workshop at WWW 2015, Rumors and Decep-
tion in Social Media: Detection, Tracking, and
Visualization (RDSM 2015)1 which garnered in-
terest from researchers coming from a variety of
backgrounds, including natural language process-
ing, web science and computational journalism.
2 Data & Resources
To capture web claims and the community reac-
tion around them, we take data from the “model
organism” of social media, Twitter (Tufekci,
2014). Data for the task is available in the form
of online discussion threads, each pertaining to a
particular event and the rumours around it. These
threads form a tree, where each tweet has a par-
ent tweet it responds to. Together these form a
conversation, initiated by a source tweet (see Fig-
ure 1). The data has already been annotated for
veracity and SDQC following a published anno-
tation scheme (Zubiaga et al., 2016b), as part of
the PHEME project (Derczynski and Bontcheva,
2014), in which the task organisers are partners.
1
http://www.pheme.eu/events/rdsm2015/
Veracity prediction examples:
u1: Hostage-taker in supermarket siege killed, reports say. #ParisAttacks –LINK– [true]
u1: OMG. #Prince rumoured to be performing in Toronto today. Exciting! [false]
Figure 2: Examples of source tweets with a veracity value, which has to be predicted in the veracity
prediction task.
Subtask A
S D Q C
Train 910 344 358 2,907
Test 94 71 106 778
Subtask B
T F U
Train 137 62 98
Test 8 12 8
Table 1: Label distribution of training and test
datasets.
2.1 Training Data
Our training dataset comprises 297 rumourous
threads collected for 8 events in total, which in-
clude 297 source and 4,222 reply tweets, amount-
ing to 4,519 tweets in total. These events include
well-known breaking news such as the Charlie
Hebdo shooting in Paris, the Ferguson unrest in
the US, and the Germanwings plane crash in the
French Alps. The size of the dataset means it can
be distributed without modifications, according to
Twitter’s current data usage policy, as JSON files.
This dataset is already publicly available (Zubi-
aga et al., 2016a) and constitutes the training and
development data.
2.2 Test Data
For the test data, we annotated 28 additional
threads. These include 20 threads extracted from
the same events as the training set, and 8 threads
from two newly collected events: (1) a rumour
that Hillary Clinton was diagnosed with pneumo-
nia during the 2016 US election campaign, and
(2) a rumour that Youtuber Marina Joyce had been
kidnapped.
The test dataset includes, in total, 1,080 tweets,
28 of which are source tweets and 1,052 replies.
The distribution of labels in the training and test
datasets is summarised in Table 1.
2.3 Context Data
Along with the tweet threads, we also provided ad-
ditional context that participants could make use
of. The context we provided was two-fold: (1)
Wikipedia articles associated with the event in
question. We provided the last revision of the ar-
ticle prior to the source tweet being posted, and
(2) content of linked URLs, using the Internet
Archive to retrieve the latest revision prior to the
link being tweeted, where available.
2.4 Data Annotation
The annotation of rumours and their subsequent
interactions was performed in two steps. In the
first step, we sampled a subset of likely rumourous
tweets from all the tweets associated with the
event in question, where we used the high num-
ber of retweets as an indication of a tweet be-
ing potentially rumourous. These sampled tweets
were fed to an annotation tool, by means of which
our expert journalist annotators members manu-
ally identified the ones that did indeed report un-
verified updates and were considered to be ru-
mours. Whenever possible, they also annotated
rumours that had ultimately been proven true or
the ones that had been debunked as false stories;
the rest were annotated as “unverified”. In the
second step, we collected conversations associ-
ated with those rumourous tweets, which included
all replies succeeding a rumourous source tweet.
The type of support (SDQC) expressed by each
participant in the conversation was then annotated
through crowdsourcing. The methodology for per-
forming this crowdsourced annotation process has
been previously assessed and validated (Zubiaga
et al., 2015a), and is further detailed in (Zubiaga
et al., 2016b). The overall inter-annotator agree-
ment rate of 63.7% showed the task to be chal-
lenging, and easier for source tweets (81.1%) than
for replying tweets (62.2%).
The evaluation data was not available to those
participating in any way in the task, and selec-
tion decisions were taken only by organisers not
connected with any submission, to retain fairness
across submissions.
Figure 1 shows an example of what a data in-
stance looks like, where the source tweet in the
tree presents a rumourous statement that is sup-
ported, denied, queried and commented on by oth-
ers. Note that replies are nested, where some
tweets reply directly to the source, while other
tweets reply to earlier replies, e.g., u4 and u5 en-
gage in a short conversation replying to each other
in the second example. The input to the verac-
ity prediction task is simpler than this; here par-
ticipants had to determine if a rumour was true or
false by only looking at the source tweet (see Fig-
ure 2), and optionally making use of the additional
context provided by the organisers.
To prepare the evaluation resources, we col-
lected and sampled the tweets around which there
is most interaction, placed these in an existing an-
notation tool to be annotated as rumour vs. non-
rumour, categorised them into rumour sub-stories,
and labelled them for veracity.
For Subtask A, the extra annotation for sup-
port / deny / question / comment at the tweet level
within the conversations were performed through
crowdsourcing – as performed to satisfactory qual-
ity already with the existing training data (Zubiaga
et al., 2015a).
3 Evaluation
The two subtasks were evaluated as follows.
SDQC stance classification: The evaluation of
the SDQC needed careful consideration, as the
distribution of the categories is clearly skewed to-
wards comments. Evaluation is through classifica-
tion accuracy.
Veracity prediction: The evaluation of the pre-
dicted veracity, which is either true or false for
each instance, was done using macroaveraged ac-
curacy, hence measuring the ratio of instances for
which a correct prediction was made. Addition-
ally, we calculated RMSE ρ for the difference be-
tween system and reference confidence in correct
examples and provided the mean of these scores.
Incorrect examples have an RMSE of 1. This
is normalised and combined with the macroaver-
aged accuracy to give a final score; e.g. acc =
(1 − ρ)acc.
The baseline is the most common class. For
Team Score
DFKI DKT 0.635
ECNU 0.778
IITP 0.641
IKM 0.701
Mama Edha 0.749
NileTMRG 0.709
Turing 0.784
UWaterloo 0.780
Baseline (4-way) 0.741
Baseline (SDQ) 0.391
Table 2: Results for Task A: sup-
port/deny/query/comment classification.
Task A, we also introduce a baseline excluding the
common, low-impact “comment” class, consider-
ing accuracy over only support, deny and query.
This is included as the SDQ baseline.
4 Participant Systems and Results
We have had 13 system submissions at Ru-
mourEval, eight submissions for Subtask A
(Kochkina et al., 2017; Bahuleyan and Vech-
tomova, 2017; Srivastava et al., 2017; Wang
et al., 2017; Singh et al., 2017; Chen et al.,
2017; Garc´ıa Lozano et al., 2017; Enayet and El-
Beltagy, 2017), the identification of stance to-
wards rumours, and five submissions for Sub-
task B (Srivastava et al., 2017; Wang et al., 2017;
Singh et al., 2017; Chen et al., 2017; Enayet and
El-Beltagy, 2017), the rumour veracity classifi-
cation task, with participant teams coming from
four continents (Europe: Germany, Sweden, UK;
North America: Canada; Asia: China, India, Tai-
wan; Africa: Egypt), showing the global reach of
the issue of rumour veracity on social media.
Most participants tackled Subtask A, which in-
volves classifying a tweet in a conversation thread
as either supporting (S), denying (D), querying (Q)
or commenting on (C) a rumour. Results are given
in Table 2 The distribution of SDQC labels in the
training, development and test sets favours com-
ments (see Table 1. Including and recognising the
items that fit in this class is important for reduc-
ing noise in the other, information-bearing classi-
fications (support, deny and query). In actual fact,
comments are often express implicit support; the
absence of dispute is a soft signal of agreement.
Systems generally viewed this task as a four-
way single tweet classification task, with the ex-
ception of the best performing system (Turing),
which addressed it as a sequential classification
problem, where the SDQC label of each tweet
depends on the features and labels of the pre-
vious tweets, and the ECNU and IITP systems.
The IITP system takes as input pairs of source
and reply tweets whereas the ECNU system ad-
dressed class imbalance by decomposing the prob-
lem into a two step classification task (com-
ment vs. non-comment), and all non-comment
tweets classified as SDQ. Half of the systems em-
ployed ensemble classifiers, where classification
was obtained through majority voting (ECNU,
MamaEdha, UWaterloo, DFKI-DKT). In some
cases the ensembles were hybrid, consisting both
of machine learning classifiers and manually cre-
ated rules, with differential weighting of classi-
fiers for different class labels (ECNU, MamaEdha,
DFKI-DKT). Three systems used deep learning,
with team Turing employing LSTMs for sequen-
tial classification, team IKM using convolutional
neural networks (CNN) for obtaining the repre-
sentation of each tweet, assigned a probability for
a class by a softmax classifier and team Mama
Edha using CNN as one of the classifiers in their
hybrid conglomeration. The remaining two sys-
tems NileTMRG and IITP used support vector
machines with linear and polynomial kernel re-
spectively. Half of the systems invested in elabo-
rate feature engineering including cue words and
expressions denoting Belief, Knowledge, Doubt
and Denial (UWaterloo) as well as Tweet domain
features including meta-data about users, hash-
tags and event specific keywords (ECNU, UWa-
terloo, IITP, NileTMRG). The systems with the
least elaborate features were IKM and Mama Edha
for CNNs (word embeddings), DFKI-DKT (sparse
word vectors as input to logistic regression) and
Turing (average word vectors, punctuation, sim-
ilarity between word vectors in current tweet,
source tweet and previous tweet, presence of nega-
tion, picture, URL). Five out of the eight systems
used pre-trained word embeddings, mostly Google
News word2vec embeddings, while ECNU used
four different types of embeddings. Overall, elab-
orate feature engineering and a strategy for ad-
dressing class imbalance seemed to pay off, as can
be seen by the success of the high performance
of the UWaterloo and ECNU systems. The suc-
cess of the best performing system (Turing) can
be attributed both to the use of LSTM to address
Team Score Confidence RMSE
IITP 0.393 0.746
Table 3: Results for Task B: Rumour veracity -
open variant.
Team Score Confidence RMSE
DFKI DKT 0.393 0.845
ECNU 0.464 0.736
IITP 0.286 0.807
IKM 0.536 0.763
NileTMRG 0.536 0.672
Baseline 0.571 –
Table 4: Results for Task B: Rumour veracity -
closed variant.
the problem as a sequential task and the choice of
word embeddings.
Subtask B, veracity classification of a
source tweet, was viewed as either a three-
way (NileTMRG, ECNU, IITP) or two-way
(IKM, DFKI-DKT) single tweet classification
task. Results are given in Table 3 for the open
variant, where external resources may be used,2
and Table 4 for the closed variant – with no
external resource use permitted. The systems used
mostly similar features and classifiers to those in
Subtask A, though some added features more spe-
cific to the distribution of SDQC labels in replies
to the source tweet (e.g. the best performing
system in this task, NileTMRG, considered the
percentage of reply tweets classified as either S,
D or Q).
5 Conclusion
Detecting and verifying rumours is a critical task
and in the current media landscape, vital to pop-
ulations so they can make decisions based on the
truth. This shared task brought together many ap-
proaches to fixing veracity in real media, working
through community interactions and claims made
on the web. Many systems were able to achieve
good results on unravelling the argument around
various claims, finding out whether a discussion
supports, denies, questions or comments on ru-
mours.
The commentary around a story often helps de-
termine how true that story is, so this advance is
a great positive. However, finding out accurately
2
Namely, the 20160901 English Wikipedia dump.
whether a story is false or true remains really
tough. Systems did not reach the most-common-
class baseline, despite the data not being excep-
tionally skewed. even the best systems could have
the wrong level of confidence in a true/false judg-
ment, weakly verifying stories that are true and so
on. This tells us that we are making progress, but
that the problem is so far very hard.
RumourEval leaves behind competitive results,
a large number of approaches to be dissected by
future researchers, and a benchmark dataset of
thousands of documents and novel news stories.
This sets a good baseline for the next steps in the
area of fake news detection, as well as the mate-
rial anyone needs to get started on the problem and
evaluate and improve their systems.
Acknowledgments
This work is supported by the European Com-
mission’s 7th Framework Programme for research,
under grant No. 611223 PHEME. This work is
also supported by the European Unions Horizon
2020 research and innovation programme under
grant agreement No. 687847 COMRADES. We are
grateful to Swissinfo.ch for their extended support
in the form of journalistic advice, keeping the task
well-grounded, and annotation and task design ef-
forts. We also extend our thanks to the SemEval
organisers for their sustained hard work, and to
our participants for bearing with us during the first
shared task of this nature and all the joy and trou-
ble that comes with it.
References
Pranav Anand, Marilyn Walker, Rob Abbott, Jean
E. Fox Tree, Robeson Bowmani, and Michael
Minor. 2011. Cats rule and dogs drool!: Clas-
sifying stance in online debate. In Proceedings
of the 2Nd Workshop on Computational Ap-
proaches to Subjectivity and Sentiment Analy-
sis. Association for Computational Linguistics,
Stroudsburg, PA, USA, WASSA ’11, pages 1–9.
http://dl.acm.org/citation.cfm?id=2107653.2107654.
Hareesh Bahuleyan and Olga Vechtomova. 2017.
UWaterloo at SemEval-2017 Task 8: Detecting
Stance towards Rumours with Topic Independent
Features. In Proceedings of SemEval. ACL.
Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Dang,
and Danilo Giampiccolo. 2011. The seventh Pascal
Recognizing Textual Entailment challenge. In Pro-
ceedings of the Text Analysis Conference. NIST.
Yi-Chin Chen, Zhao-Yand Liu, and Hung-Yu Kao.
2017. IKM at SemEval-2017 Task 8: Convolutional
Neural Networks for Stance Detection and Rumor
Verification. In Proceedings of SemEval. ACL.
Leon Derczynski and Kalina Bontcheva. 2014. Pheme:
Veracity in digital social networks. In UMAP Work-
shops.
Omar Enayet and Samhaa R. El-Beltagy. 2017.
NileTMRG at SemEval-2017 Task 8: Determining
Rumour and Veracity Support for Rumours on Twit-
ter. In Proceedings of SemEval. ACL.
Marianela Garc´ıa Lozano, Hanna Lilja, Edward
Tj¨ornhammar, and Maja Maja Karasalo. 2017.
Mama Edha at SemEval-2017 Task 8: Stance Clas-
sification with CNN and Rules. In Proceedings of
SemEval. ACL.
Aditi Gupta, Ponnurangam Kumaraguru, Carlos
Castillo, and Patrick Meier. 2014. Tweet-
cred: Real-time credibility assessment of con-
tent on twitter. In SocInfo. pages 228–243.
https://doi.org/10.1007/978-3-319-13734-6 16.
Alfred Hermida. 2012. Tweets and truth: Journalism as
a discipline of collaborative verification. Journalism
Practice 6(5-6):659–668.
Elena Kochkina, Maria Liakata, and Isabelle Augen-
stein. 2017. Turing at SemEval-2017 Task 8: Se-
quential Approach to Rumour Stance Classification
with Branch-LSTM. In Proceedings of SemEval.
ACL.
KP Krishna Kumar and G Geethakumari. 2014. De-
tecting misinformation in online social networks us-
ing cognitive psychology. Human-centric Comput-
ing and Information Sciences 4(1):1–22.
Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui
Fang, and Sameena Shah. 2015. Real-time rumor
debunking on twitter. In Proceedings of the 24th
ACM International on Conference on Information
and Knowledge Management. ACM, pages 1867–
1870.
Michal Lukasik, Trevor Cohn, and Kalina Bontcheva.
2015. Classifying tweet level judgements of ru-
mours in social media. In Proceedings of the Con-
ference on Empirical Methods in Natural Language
Processing. volume 2, pages 2590–2595.
Saif M Mohammad, Svetlana Kiritchenko, Parinaz
Sobhani, Xiaodan Zhu, and Colin Cherry. 2016.
SemEval-2016 Task 6: Detecting Stance in Tweets.
In Proceedings of the Workshop on Semantic Evalu-
ation.
Alessandro Moschitti, Preslav Nakov, Lluıs Marquez,
Walid Magdy, James Glass, and Bilal Randeree.
2015. Semeval-2015 task 3: Answer selection
in community question answering. SemEval-2015
page 269.
Rob Procter, Jeremy Crump, Susanne Karstedt, Alex
Voss, and Marta Cantijoch. 2013a. Reading the ri-
ots: What were the Police doing on Twitter? Polic-
ing and Society 23(4):413–436.
Rob Procter, Farida Vis, and Alex Voss. 2013b. Read-
ing the riots on twitter: methodological innovation
for the analysis of big data. International journal of
social research methodology 16(3):197–214.
Vahed Qazvinian, Emily Rosengren, Dragomir R
Radev, and Qiaozhu Mei. 2011. Rumor has it: Iden-
tifying misinformation in microblogs. In Proceed-
ings of the Conference on Empirical Methods in Nat-
ural Language Processing. Association for Compu-
tational Linguistics, pages 1589–1599.
Chengcheng Shao, Giovanni Luca Ciampaglia,
Alessandro Flammini, and Filippo Menczer.
2016. Hoaxy: A platform for tracking online
misinformation. arXiv preprint arXiv:1603.01511 .
Vikram Singh, Sunny Narayan, Md Shad Akhtar, Asif
Ekbal, and Pushpak Bhattacharya. 2017. IITP at
SemEval-2017 Task 8: A Supervised Approach for
Rumour Evaluation. In Proceedings of SemEval.
ACL.
Ankit Srivastava, Rehm Rehm, and Julian
Moreno Schneider. 2017. DFKI-DKT at SemEval-
2017 Task 8: Rumour Detection and Classification
using Cascading Heuristics. In Proceedings of
SemEval. ACL.
Zeynep Tufekci. 2014. Big questions for social me-
dia big data: Representativeness, validity and other
methodological pitfalls. In Proceedings of the AAAI
International Conference on Weblogs and Social
Media.
Shari R Veil, Tara Buehner, and Michael J Palenchar.
2011. A work-in-process literature review: Incor-
porating social media in risk and crisis communica-
tion. Journal of contingencies and crisis manage-
ment 19(2):110–122.
Feixiang Wang, Man Lan, and Yuanbin Wu. 2017.
ECNU at SemEval-2017 Task 8: Rumour Evalua-
tion Using Effective Features and Supervised En-
semble Models. In Proceedings of SemEval. ACL.
Qiao Zhang, Shuiyuan Zhang, Jian Dong, Jinhua
Xiong, and Xueqi Cheng. 2015. Automatic de-
tection of rumor on social network. In Natu-
ral Language Processing and Chinese Computing,
Springer, pages 113–122.
Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva,
Maria Liakata, and Rob Procter. 2017. Detection
and resolution of rumours in social media: A survey.
arXiv preprint arXiv:1704.00656 .
Arkaitz Zubiaga, Maria Liakata, Rob Procter, Kalina
Bontcheva, and Peter Tolmie. 2015a. Crowdsourc-
ing the annotation of rumourous conversations in
social media. In Proceedings of the 24th Interna-
tional Conference on World Wide Web: Companion
volume. International World Wide Web Conferences
Steering Committee, pages 347–353.
Arkaitz Zubiaga, Maria Liakata, Rob Procter, Kalina
Bontcheva, and Peter Tolmie. 2015b. Towards de-
tecting rumours in social media. In Proceedings of
the AAAI Workshop on AI for Cities.
Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-
dine Wong Sak Hoi, and Peter Tolmie. 2016a.
PHEME rumour scheme dataset: Journalism use
case. doi:10.6084/m9.figshare.2068650.v1.
Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-
dine Wong Sak Hoi, and Peter Tolmie. 2016b.
Analysing how people orient to and spread ru-
mours in social media by looking at con-
versational threads. PLoS ONE 11(3):1–29.
https://doi.org/10.1371/journal.pone.0150989.

More Related Content

PDF
On Semantics and Deep Learning for Event Detection in Crisis Situations
PDF
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
PDF
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
PDF
Content-based link prediction
PPTX
Generating Storylines (Literature Survey)
PDF
Tweet Segmentation and Its Application to Named Entity Recognition
PDF
IRJET- Fake News Detection and Rumour Source Identification
PDF
IRJET- Fake News Detection
On Semantics and Deep Learning for Event Detection in Crisis Situations
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Content-based link prediction
Generating Storylines (Literature Survey)
Tweet Segmentation and Its Application to Named Entity Recognition
IRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection

What's hot (10)

PDF
Who gives a tweet
PPTX
Semantic Wide and Deep Learning for Detecting Crisis-Information Categories o...
PDF
Nannobloging pr
PDF
News construction from microblogging post using open data
PDF
Slides: Epidemiological Modeling of News and Rumors on Twitter
PDF
Who to follow and why: link prediction with explanations
PPT
Response modeling-iui-2013-talk
PDF
Unfollowing on twitter
 
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PDF
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
Who gives a tweet
Semantic Wide and Deep Learning for Detecting Crisis-Information Categories o...
Nannobloging pr
News construction from microblogging post using open data
Slides: Epidemiological Modeling of News and Rumors on Twitter
Who to follow and why: link prediction with explanations
Response modeling-iui-2013-talk
Unfollowing on twitter
 
Frontiers of Computational Journalism week 2 - Text Analysis
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
Ad

Similar to SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours (20)

PPTX
NS-CUK Seminar: J.H.Lee, "Review on "Rumor Detection on Twitter with Claim-Gu...
PPTX
Detection and resolution of rumours in social media
PDF
Sentence embedding to improve rumour detection performance model
PDF
Joint Rumour Stance and Veracity
ODP
RumourEval
PDF
Enhancing prediction of user stance for social networks rumors
DOCX
An Emotion-Aware Multitask Approach to Fake News and Rumor Detection Using Tr...
PDF
Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on A...
PDF
WeVerify at NILC - May 2019.pptx
PDF
Classification of Disastrous Tweets on Twitter using BERT Model
PPT
Detection and Resolution of Rumours in Social Media
PDF
IRJET- Authentic News Summarization
PPTX
Fake news -final.pptx
PDF
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
PDF
IRJET- Fake Message Deduction using Machine Learining
PPTX
DP1_160430723010_Divya.pptx
PDF
Fake News and Message Detection
PDF
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTER
DOCX
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
NS-CUK Seminar: J.H.Lee, "Review on "Rumor Detection on Twitter with Claim-Gu...
Detection and resolution of rumours in social media
Sentence embedding to improve rumour detection performance model
Joint Rumour Stance and Veracity
RumourEval
Enhancing prediction of user stance for social networks rumors
An Emotion-Aware Multitask Approach to Fake News and Rumor Detection Using Tr...
Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on A...
WeVerify at NILC - May 2019.pptx
Classification of Disastrous Tweets on Twitter using BERT Model
Detection and Resolution of Rumours in Social Media
IRJET- Authentic News Summarization
Fake news -final.pptx
Crowdsourcing the Annotation of Rumourous Conversations in Social Media
IRJET- Fake Message Deduction using Machine Learining
DP1_160430723010_Divya.pptx
Fake News and Message Detection
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTER
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Ad

More from COMRADES project (18)

PDF
COMRADES EU Project Factsheet
PDF
Evaluating Platforms for Community Sensemaking: Using the Case of the Kenyan ...
PDF
Helping Crisis Responders Find the Informative Needle in the Tweet Haystack
PDF
An Extensible Multilingual Open Source Lemmatizer
PDF
Classifying Crises-Information Relevancy with Semantics
PDF
D6.2 First report on Communication and Dissemination activities
PDF
D3.1 Multilingual content processing methods
PDF
D4.1 Enriched Semantic Models of Emergency Events
PDF
Prospecting Socially-Aware Concepts and Artefacts for Designing for Community...
PDF
A Semantic Graph-based Approach for Radicalisation Detection on Social Media
PDF
Behind the Scenes of Scenario-Based Training: Understanding Scenario Design a...
PDF
Sustainable Performance Measurement for Humanitarian Supply Chain Operations
PDF
Detecting Important Life Events on Twitter Using Frequent Semantic and Syntac...
PDF
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
PDF
D2.1 Requirements for boosting community resilience in crisis situation
PDF
COMRADES EU Project Overall Presentation
PDF
COMRADES EU Project Factsheet
PDF
COMRADES EU Project Brochure
COMRADES EU Project Factsheet
Evaluating Platforms for Community Sensemaking: Using the Case of the Kenyan ...
Helping Crisis Responders Find the Informative Needle in the Tweet Haystack
An Extensible Multilingual Open Source Lemmatizer
Classifying Crises-Information Relevancy with Semantics
D6.2 First report on Communication and Dissemination activities
D3.1 Multilingual content processing methods
D4.1 Enriched Semantic Models of Emergency Events
Prospecting Socially-Aware Concepts and Artefacts for Designing for Community...
A Semantic Graph-based Approach for Radicalisation Detection on Social Media
Behind the Scenes of Scenario-Based Training: Understanding Scenario Design a...
Sustainable Performance Measurement for Humanitarian Supply Chain Operations
Detecting Important Life Events on Twitter Using Frequent Semantic and Syntac...
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
D2.1 Requirements for boosting community resilience in crisis situation
COMRADES EU Project Overall Presentation
COMRADES EU Project Factsheet
COMRADES EU Project Brochure

Recently uploaded (20)

PDF
AP Vision-2047 and its importance & Role MI&MP.pdf
DOCX
Diplomatic Studies and Migration- Global Perspectives and Practices.docx
PPTX
IMPLEMENTING RULES AND REGULATIONS OF REPUBLIC ACT NO. 11058 ENTITLED “AN ACT...
PDF
The City of Stuart CDBG, Florida - Small Cities CDBG FloridaCommerce -Report ...
PDF
The Ways The Abhay Bhutada Foundation Is Helping Indian STEM Education
PDF
Covid-19 Immigration Effects - Key Slides - June 2025
PDF
The Landscape Observatory of Catalonia. Some projects and challenges
PDF
rs_9fsfssdgdgdgdgdgdgdgsdgdgdgdconverted.pdf
PPTX
Quiz Night Game Questions and Questions for interactive games
PDF
Oil Industry Ethics Evolution Report (1).pdf
PPTX
SlideEgg_66119-Responsible Sourcing.pptx
PPTX
IMPLEMENTING GUIDELINES OF SUSTAINABLE LIVELIHOOD PROGRAM -SLP MC 22 ORIENTAT...
PPTX
InnoTech Mahamba Presentation yearly.pptx
PPTX
PPT odisha stete tribal museum OSTM-13.08.25 - Copy.pptx
PPTX
Chapter 12 Public Enterprises and Regulatory Bodies in the Philippine Adminis...
PPTX
c. b. 3 Basics of BDP geared towards public service.pptx
PPTX
Unit 3 - Genetic engineering.ppvvxtm.pptx
PPTX
Avoiding Suspensions and Disallowances in Audit.pptx
PDF
Global Peace Index - 2025 - Ghana slips on 2025 Global Peace Index; drops out...
PPTX
ISO 9001 awarness for government offices 2015
AP Vision-2047 and its importance & Role MI&MP.pdf
Diplomatic Studies and Migration- Global Perspectives and Practices.docx
IMPLEMENTING RULES AND REGULATIONS OF REPUBLIC ACT NO. 11058 ENTITLED “AN ACT...
The City of Stuart CDBG, Florida - Small Cities CDBG FloridaCommerce -Report ...
The Ways The Abhay Bhutada Foundation Is Helping Indian STEM Education
Covid-19 Immigration Effects - Key Slides - June 2025
The Landscape Observatory of Catalonia. Some projects and challenges
rs_9fsfssdgdgdgdgdgdgdgsdgdgdgdconverted.pdf
Quiz Night Game Questions and Questions for interactive games
Oil Industry Ethics Evolution Report (1).pdf
SlideEgg_66119-Responsible Sourcing.pptx
IMPLEMENTING GUIDELINES OF SUSTAINABLE LIVELIHOOD PROGRAM -SLP MC 22 ORIENTAT...
InnoTech Mahamba Presentation yearly.pptx
PPT odisha stete tribal museum OSTM-13.08.25 - Copy.pptx
Chapter 12 Public Enterprises and Regulatory Bodies in the Philippine Adminis...
c. b. 3 Basics of BDP geared towards public service.pptx
Unit 3 - Genetic engineering.ppvvxtm.pptx
Avoiding Suspensions and Disallowances in Audit.pptx
Global Peace Index - 2025 - Ghana slips on 2025 Global Peace Index; drops out...
ISO 9001 awarness for government offices 2015

SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

  • 1. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours Leon Derczynski♥∗ and Kalina Bontcheva♥ and Maria Liakata♣ and Rob Procter♣ and Geraldine Wong Sak Hoi♦ and Arkaitz Zubiaga♣ ♥: Department of Computer Science, University of Sheffield, S1 4DP, UK ♣: Department of Computer Science, University of Warwick, CV4 7AL, UK ♦: swissinfo.ch, Bern, Switzerland ∗: [email protected] Abstract Media is full of false claims. Even Ox- ford Dictionaries named “post-truth” as the word of 2016. This makes it more important than ever to build systems that can identify the veracity of a story, and the nature of the discourse around it. Ru- mourEval is a SemEval shared task that aims to identify and handle rumours and reactions to them, in text. We present an annotation scheme, a large dataset cov- ering multiple topics – each having their own families of claims and replies – and use these to pose two concrete challenges as well as the results achieved by partici- pants on these challenges. 1 Introduction and Motivation Rumours are rife on the web. False claims affect people’s perceptions of events and their behaviour, sometimes in harmful ways. With the increasing reliance on the Web – social media, in particular – as a source of information and news updates by in- dividuals, news professionals, and automated sys- tems, the potential disruptive impact of rumours is further accentuated. The task of analysing and determining veracity of social media content has been of recent interest to the field of natural language processing. After initial work (Qazvinian et al., 2011), increasingly advanced systems and annotation schemas have been developed to support the analysis of rumour and misinformation in text (Kumar and Geethaku- mari, 2014; Zhang et al., 2015; Shao et al., 2016; Zubiaga et al., 2016b). Veracity judgment can be decomposed intuitively in terms of a compar- ison between assertions made in – and entailments from – a candidate text, and external world knowl- edge. Intermediate linguistic cues have also been shown to play a role. Critically, based on recent work the task appears deeply nuanced and very challenging, while having important applications in, for example, journalism and disaster mitigation (Hermida, 2012; Procter et al., 2013a; Veil et al., 2011). We propose a shared task where participants analyse rumours in the form of claims made in user-generated content, and where users respond to one another within conversations attempting to resolve the veracity of the rumour. We define a ru- mour as a “circulating story of questionable verac- ity, which is apparently credible but hard to verify, and produces sufficient scepticism and/or anxiety so as to motivate finding out the actual truth” (Zu- biaga et al., 2015b). While breaking news unfold, gathering opinions and evidence from as many sources as possible as communities react becomes crucial to determine the veracity of rumours and consequently reduce the impact of the spread of misinformation. Within this scenario where one needs to listen to, and assess the testimony of, different sources to make a final decision with respect to a rumour’s veracity, we ran a task in SemEval consisting of two subtasks: (a) stance classification towards ru- mours, and (b) veracity classification. Subtask A corresponds to the core problem in crowd response analysis when using discourse around claims to verify or disprove them. Subtask B corresponds to the AI-hard task of assessing directly whether or not a claim is false. 1.1 Subtask A - SDQC Support/ Rumour stance classification Related to the objective of predicting a rumour’s veracity, Subtask A deals with the complementary objective of tracking how other sources orient to the accuracy of the rumourous story. A key step in the analysis of the surrounding discourse is to
  • 2. SDQC support classification. Example 1: u1: We understand there are two gunmen and up to a dozen hostages inside the cafe under siege at Sydney.. ISIS flags remain on display #7News [support] u2: @u1 not ISIS flags [deny] u3: @u1 sorry - how do you know it’s an ISIS flag? Can you actually confirm that? [query] u4: @u3 no she can’t cos it’s actually not [deny] u5: @u1 More on situation at Martin Place in Sydney, AU –LINK– [comment] u6: @u1 Have you actually confirmed its an ISIS flag or are you talking shit [query] SDQC support classification. Example 2: u1: These are not timid colours; soldiers back guarding Tomb of Unknown Soldier after today’s shooting #StandforCanada –PICTURE– [support] u2: @u1 Apparently a hoax. Best to take Tweet down. [deny] u3: @u1 This photo was taken this morning, before the shooting. [deny] u4: @u1 I don’t believe there are soldiers guarding this area right now. [deny] u5: @u4 wondered as well. I’ve reached out to someone who would know just to confirm that. Hopefully get response soon. [comment] u4: @u5 ok, thanks. [comment] Figure 1: Examples of tree-structured threads discussing the veracity of a rumour, where the label asso- ciated with each tweet is the target of the SDQC support classification task. determine how other users in social media regard the rumour (Procter et al., 2013b). We propose to tackle this analysis by looking at the conversa- tion stemming from direct and nested replies to the tweet originating the rumour (source tweet). To this effect RumourEval provided partici- pants with a tree-structured conversation formed of tweets replying to the originating rumourous tweet, directly or indirectly. Each tweet presents its own type of support with respect to the rumour (see Figure 1). We frame this in terms of support- ing, denying, querying or commenting on (SDQC) the original rumour (Zubiaga et al., 2016b). There- fore, we introduce a subtask where the goal is to label the type of interaction between a given state- ment (rumourous tweet) and a reply tweet (the lat- ter can be either direct or nested replies). We note that superficially this subtask may bear similarity to SemEval-2016 Task 6 on stance de- tection from tweets (Mohammad et al., 2016), where participants are asked to determine whether a tweet is in favour, against or neither, of a given target entity (e.g. Hillary Clinton) or topic (e.g. climate change). Our SQDC subtask differs in two aspects. Firstly, participants needed to determine the objective support towards a rumour, an entire statement, rather than individual target concepts. Moreover, they are asked to determine additional response types to the rumourous tweet that are rel- evant to the discourse, such as a request for more information (questioning, Q) and making a com- ment (C), where the latter doesn’t directly address support or denial towards the rumour, but pro- vides an indication of the conversational context surrounding rumours. For example, certain pat- terns of comments and questions can be indicative of false rumours and others indicative of rumours that turn out to be true. Secondly, participants need to determine the type of response towards a rumourous tweet from a tree-structured conversation, where each tweet is not necessarily sufficiently descriptive on its own, but needs to be viewed in the context of an aggre- gate discussion consisting of tweets preceding it in the thread. This is more closely aligned with stance classification as defined in other domains, such as public debates (Anand et al., 2011). The latter also relates somewhat to the SemEval-2015 Task 3 on Answer Selection in Community Ques- tion Answering (Moschitti et al., 2015), where the task was to determine the quality of responses in tree-structured threads in CQA platforms. Re- sponses to questions are classified as ‘good’, ‘po- tential’ or ‘bad’. Both tasks are related to tex- tual entailment and textual similarity. However, Semeval-2015 Task3 is clearly a question answer- ing task, the platform itself supporting a QA for- mat in contrast with the more free-form format of conversations in Twitter. Moreover, as a question answering task Semeval-2015 Task 3 is more con- cerned with relevance and retrieval whereas the task we propose here is about whether support or
  • 3. denial can be inferred towards the original state- ment (source tweet) from the reply tweets. Each tweet in the tree-structured thread is cate- gorised into one of the following four categories, following Procter et al. (2013b): • Support: the author of the response supports the veracity of the rumour. • Deny: the author of the response denies the veracity of the rumour. • Query: the author of the response asks for additional evidence in relation to the veracity of the rumour. • Comment: the author of the response makes their own comment without a clear contribu- tion to assessing the veracity of the rumour. Prior work in the area has found the task dif- ficult, compounded by the variety present in lan- guage use between different stories (Lukasik et al., 2015; Zubiaga et al., 2017). This indicates it is challenging enough to make for an interesting Se- mEval shared task. 1.2 Subtask B - Veracity prediction The goal of this subtask is to predict the verac- ity of a given rumour. The rumour is presented as a tweet, reporting an update associated with a newsworthy event, but deemed unsubstantiated at the time of release. Given such a tweet/claim, and a set of other resources provided, systems should return a label describing the anticipated veracity of the rumour as true or false – see Figure 2. The ground truth of this task has been manually established by journalist members of the team who identified official statements or other trustworthy sources of evidence that resolved the veracity of the given rumour. Examples of tweets annotated for veracity are shown in Figure 2. The participants in this subtask chose between two variants. In the first case – the closed vari- ant – the veracity of a rumour had to be predicted solely from the tweet itself (for example (Liu et al., 2015) rely only on the content of tweets to assess the veracity of tweets in real time, while systems such as Tweet-Cred (Gupta et al., 2014) follow a tweet level analysis for a similar task where the credibility of a tweet is predicted). In the second case – the open variant – additional context was provided as input to veracity prediction systems; this context consists of a Wikipedia dump. Criti- cally, no external resources could be used that con- tained information from after the rumour’s resolu- tion. To control this, we specified precise versions of external information that participants could use. This was important to make sure we introduced time sensitivity into the task of veracity prediction. In a practical system, the classified conversation threads from Subtask A could be used as context. We take a simple approach to this task, us- ing only true/false labels for rumours. In prac- tice, however, many claims are hard to verify; for example, there were many rumours concern- ing Vladimir Putin’s activities in early 2015, many wholly unsubstantiable. Therefore, we also expect systems to return a confidence value in the range of 0-1 for each rumour; if the rumour is unverifi- able, a confidence of 0 should be returned. 1.3 Impact Identifying the veracity of claims made on the web is an increasingly important task (Zubiaga et al., 2015b). Decision support, digital journalism and disaster response already rely on picking out such claims (Procter et al., 2013b). Additionally, web and social media are a more challenging environ- ment than e.g. newswire, which has traditionally provided the mainstay of similar tasks (such as RTE (Bentivogli et al., 2011)). Last year we ran a workshop at WWW 2015, Rumors and Decep- tion in Social Media: Detection, Tracking, and Visualization (RDSM 2015)1 which garnered in- terest from researchers coming from a variety of backgrounds, including natural language process- ing, web science and computational journalism. 2 Data & Resources To capture web claims and the community reac- tion around them, we take data from the “model organism” of social media, Twitter (Tufekci, 2014). Data for the task is available in the form of online discussion threads, each pertaining to a particular event and the rumours around it. These threads form a tree, where each tweet has a par- ent tweet it responds to. Together these form a conversation, initiated by a source tweet (see Fig- ure 1). The data has already been annotated for veracity and SDQC following a published anno- tation scheme (Zubiaga et al., 2016b), as part of the PHEME project (Derczynski and Bontcheva, 2014), in which the task organisers are partners. 1 http://www.pheme.eu/events/rdsm2015/
  • 4. Veracity prediction examples: u1: Hostage-taker in supermarket siege killed, reports say. #ParisAttacks –LINK– [true] u1: OMG. #Prince rumoured to be performing in Toronto today. Exciting! [false] Figure 2: Examples of source tweets with a veracity value, which has to be predicted in the veracity prediction task. Subtask A S D Q C Train 910 344 358 2,907 Test 94 71 106 778 Subtask B T F U Train 137 62 98 Test 8 12 8 Table 1: Label distribution of training and test datasets. 2.1 Training Data Our training dataset comprises 297 rumourous threads collected for 8 events in total, which in- clude 297 source and 4,222 reply tweets, amount- ing to 4,519 tweets in total. These events include well-known breaking news such as the Charlie Hebdo shooting in Paris, the Ferguson unrest in the US, and the Germanwings plane crash in the French Alps. The size of the dataset means it can be distributed without modifications, according to Twitter’s current data usage policy, as JSON files. This dataset is already publicly available (Zubi- aga et al., 2016a) and constitutes the training and development data. 2.2 Test Data For the test data, we annotated 28 additional threads. These include 20 threads extracted from the same events as the training set, and 8 threads from two newly collected events: (1) a rumour that Hillary Clinton was diagnosed with pneumo- nia during the 2016 US election campaign, and (2) a rumour that Youtuber Marina Joyce had been kidnapped. The test dataset includes, in total, 1,080 tweets, 28 of which are source tweets and 1,052 replies. The distribution of labels in the training and test datasets is summarised in Table 1. 2.3 Context Data Along with the tweet threads, we also provided ad- ditional context that participants could make use of. The context we provided was two-fold: (1) Wikipedia articles associated with the event in question. We provided the last revision of the ar- ticle prior to the source tweet being posted, and (2) content of linked URLs, using the Internet Archive to retrieve the latest revision prior to the link being tweeted, where available. 2.4 Data Annotation The annotation of rumours and their subsequent interactions was performed in two steps. In the first step, we sampled a subset of likely rumourous tweets from all the tweets associated with the event in question, where we used the high num- ber of retweets as an indication of a tweet be- ing potentially rumourous. These sampled tweets were fed to an annotation tool, by means of which our expert journalist annotators members manu- ally identified the ones that did indeed report un- verified updates and were considered to be ru- mours. Whenever possible, they also annotated rumours that had ultimately been proven true or the ones that had been debunked as false stories; the rest were annotated as “unverified”. In the second step, we collected conversations associ- ated with those rumourous tweets, which included all replies succeeding a rumourous source tweet. The type of support (SDQC) expressed by each participant in the conversation was then annotated through crowdsourcing. The methodology for per- forming this crowdsourced annotation process has been previously assessed and validated (Zubiaga et al., 2015a), and is further detailed in (Zubiaga et al., 2016b). The overall inter-annotator agree- ment rate of 63.7% showed the task to be chal- lenging, and easier for source tweets (81.1%) than for replying tweets (62.2%). The evaluation data was not available to those participating in any way in the task, and selec-
  • 5. tion decisions were taken only by organisers not connected with any submission, to retain fairness across submissions. Figure 1 shows an example of what a data in- stance looks like, where the source tweet in the tree presents a rumourous statement that is sup- ported, denied, queried and commented on by oth- ers. Note that replies are nested, where some tweets reply directly to the source, while other tweets reply to earlier replies, e.g., u4 and u5 en- gage in a short conversation replying to each other in the second example. The input to the verac- ity prediction task is simpler than this; here par- ticipants had to determine if a rumour was true or false by only looking at the source tweet (see Fig- ure 2), and optionally making use of the additional context provided by the organisers. To prepare the evaluation resources, we col- lected and sampled the tweets around which there is most interaction, placed these in an existing an- notation tool to be annotated as rumour vs. non- rumour, categorised them into rumour sub-stories, and labelled them for veracity. For Subtask A, the extra annotation for sup- port / deny / question / comment at the tweet level within the conversations were performed through crowdsourcing – as performed to satisfactory qual- ity already with the existing training data (Zubiaga et al., 2015a). 3 Evaluation The two subtasks were evaluated as follows. SDQC stance classification: The evaluation of the SDQC needed careful consideration, as the distribution of the categories is clearly skewed to- wards comments. Evaluation is through classifica- tion accuracy. Veracity prediction: The evaluation of the pre- dicted veracity, which is either true or false for each instance, was done using macroaveraged ac- curacy, hence measuring the ratio of instances for which a correct prediction was made. Addition- ally, we calculated RMSE ρ for the difference be- tween system and reference confidence in correct examples and provided the mean of these scores. Incorrect examples have an RMSE of 1. This is normalised and combined with the macroaver- aged accuracy to give a final score; e.g. acc = (1 − ρ)acc. The baseline is the most common class. For Team Score DFKI DKT 0.635 ECNU 0.778 IITP 0.641 IKM 0.701 Mama Edha 0.749 NileTMRG 0.709 Turing 0.784 UWaterloo 0.780 Baseline (4-way) 0.741 Baseline (SDQ) 0.391 Table 2: Results for Task A: sup- port/deny/query/comment classification. Task A, we also introduce a baseline excluding the common, low-impact “comment” class, consider- ing accuracy over only support, deny and query. This is included as the SDQ baseline. 4 Participant Systems and Results We have had 13 system submissions at Ru- mourEval, eight submissions for Subtask A (Kochkina et al., 2017; Bahuleyan and Vech- tomova, 2017; Srivastava et al., 2017; Wang et al., 2017; Singh et al., 2017; Chen et al., 2017; Garc´ıa Lozano et al., 2017; Enayet and El- Beltagy, 2017), the identification of stance to- wards rumours, and five submissions for Sub- task B (Srivastava et al., 2017; Wang et al., 2017; Singh et al., 2017; Chen et al., 2017; Enayet and El-Beltagy, 2017), the rumour veracity classifi- cation task, with participant teams coming from four continents (Europe: Germany, Sweden, UK; North America: Canada; Asia: China, India, Tai- wan; Africa: Egypt), showing the global reach of the issue of rumour veracity on social media. Most participants tackled Subtask A, which in- volves classifying a tweet in a conversation thread as either supporting (S), denying (D), querying (Q) or commenting on (C) a rumour. Results are given in Table 2 The distribution of SDQC labels in the training, development and test sets favours com- ments (see Table 1. Including and recognising the items that fit in this class is important for reduc- ing noise in the other, information-bearing classi- fications (support, deny and query). In actual fact, comments are often express implicit support; the absence of dispute is a soft signal of agreement. Systems generally viewed this task as a four- way single tweet classification task, with the ex-
  • 6. ception of the best performing system (Turing), which addressed it as a sequential classification problem, where the SDQC label of each tweet depends on the features and labels of the pre- vious tweets, and the ECNU and IITP systems. The IITP system takes as input pairs of source and reply tweets whereas the ECNU system ad- dressed class imbalance by decomposing the prob- lem into a two step classification task (com- ment vs. non-comment), and all non-comment tweets classified as SDQ. Half of the systems em- ployed ensemble classifiers, where classification was obtained through majority voting (ECNU, MamaEdha, UWaterloo, DFKI-DKT). In some cases the ensembles were hybrid, consisting both of machine learning classifiers and manually cre- ated rules, with differential weighting of classi- fiers for different class labels (ECNU, MamaEdha, DFKI-DKT). Three systems used deep learning, with team Turing employing LSTMs for sequen- tial classification, team IKM using convolutional neural networks (CNN) for obtaining the repre- sentation of each tweet, assigned a probability for a class by a softmax classifier and team Mama Edha using CNN as one of the classifiers in their hybrid conglomeration. The remaining two sys- tems NileTMRG and IITP used support vector machines with linear and polynomial kernel re- spectively. Half of the systems invested in elabo- rate feature engineering including cue words and expressions denoting Belief, Knowledge, Doubt and Denial (UWaterloo) as well as Tweet domain features including meta-data about users, hash- tags and event specific keywords (ECNU, UWa- terloo, IITP, NileTMRG). The systems with the least elaborate features were IKM and Mama Edha for CNNs (word embeddings), DFKI-DKT (sparse word vectors as input to logistic regression) and Turing (average word vectors, punctuation, sim- ilarity between word vectors in current tweet, source tweet and previous tweet, presence of nega- tion, picture, URL). Five out of the eight systems used pre-trained word embeddings, mostly Google News word2vec embeddings, while ECNU used four different types of embeddings. Overall, elab- orate feature engineering and a strategy for ad- dressing class imbalance seemed to pay off, as can be seen by the success of the high performance of the UWaterloo and ECNU systems. The suc- cess of the best performing system (Turing) can be attributed both to the use of LSTM to address Team Score Confidence RMSE IITP 0.393 0.746 Table 3: Results for Task B: Rumour veracity - open variant. Team Score Confidence RMSE DFKI DKT 0.393 0.845 ECNU 0.464 0.736 IITP 0.286 0.807 IKM 0.536 0.763 NileTMRG 0.536 0.672 Baseline 0.571 – Table 4: Results for Task B: Rumour veracity - closed variant. the problem as a sequential task and the choice of word embeddings. Subtask B, veracity classification of a source tweet, was viewed as either a three- way (NileTMRG, ECNU, IITP) or two-way (IKM, DFKI-DKT) single tweet classification task. Results are given in Table 3 for the open variant, where external resources may be used,2 and Table 4 for the closed variant – with no external resource use permitted. The systems used mostly similar features and classifiers to those in Subtask A, though some added features more spe- cific to the distribution of SDQC labels in replies to the source tweet (e.g. the best performing system in this task, NileTMRG, considered the percentage of reply tweets classified as either S, D or Q). 5 Conclusion Detecting and verifying rumours is a critical task and in the current media landscape, vital to pop- ulations so they can make decisions based on the truth. This shared task brought together many ap- proaches to fixing veracity in real media, working through community interactions and claims made on the web. Many systems were able to achieve good results on unravelling the argument around various claims, finding out whether a discussion supports, denies, questions or comments on ru- mours. The commentary around a story often helps de- termine how true that story is, so this advance is a great positive. However, finding out accurately 2 Namely, the 20160901 English Wikipedia dump.
  • 7. whether a story is false or true remains really tough. Systems did not reach the most-common- class baseline, despite the data not being excep- tionally skewed. even the best systems could have the wrong level of confidence in a true/false judg- ment, weakly verifying stories that are true and so on. This tells us that we are making progress, but that the problem is so far very hard. RumourEval leaves behind competitive results, a large number of approaches to be dissected by future researchers, and a benchmark dataset of thousands of documents and novel news stories. This sets a good baseline for the next steps in the area of fake news detection, as well as the mate- rial anyone needs to get started on the problem and evaluate and improve their systems. Acknowledgments This work is supported by the European Com- mission’s 7th Framework Programme for research, under grant No. 611223 PHEME. This work is also supported by the European Unions Horizon 2020 research and innovation programme under grant agreement No. 687847 COMRADES. We are grateful to Swissinfo.ch for their extended support in the form of journalistic advice, keeping the task well-grounded, and annotation and task design ef- forts. We also extend our thanks to the SemEval organisers for their sustained hard work, and to our participants for bearing with us during the first shared task of this nature and all the joy and trou- ble that comes with it. References Pranav Anand, Marilyn Walker, Rob Abbott, Jean E. Fox Tree, Robeson Bowmani, and Michael Minor. 2011. Cats rule and dogs drool!: Clas- sifying stance in online debate. In Proceedings of the 2Nd Workshop on Computational Ap- proaches to Subjectivity and Sentiment Analy- sis. Association for Computational Linguistics, Stroudsburg, PA, USA, WASSA ’11, pages 1–9. http://dl.acm.org/citation.cfm?id=2107653.2107654. Hareesh Bahuleyan and Olga Vechtomova. 2017. UWaterloo at SemEval-2017 Task 8: Detecting Stance towards Rumours with Topic Independent Features. In Proceedings of SemEval. ACL. Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Dang, and Danilo Giampiccolo. 2011. The seventh Pascal Recognizing Textual Entailment challenge. In Pro- ceedings of the Text Analysis Conference. NIST. Yi-Chin Chen, Zhao-Yand Liu, and Hung-Yu Kao. 2017. IKM at SemEval-2017 Task 8: Convolutional Neural Networks for Stance Detection and Rumor Verification. In Proceedings of SemEval. ACL. Leon Derczynski and Kalina Bontcheva. 2014. Pheme: Veracity in digital social networks. In UMAP Work- shops. Omar Enayet and Samhaa R. El-Beltagy. 2017. NileTMRG at SemEval-2017 Task 8: Determining Rumour and Veracity Support for Rumours on Twit- ter. In Proceedings of SemEval. ACL. Marianela Garc´ıa Lozano, Hanna Lilja, Edward Tj¨ornhammar, and Maja Maja Karasalo. 2017. Mama Edha at SemEval-2017 Task 8: Stance Clas- sification with CNN and Rules. In Proceedings of SemEval. ACL. Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo, and Patrick Meier. 2014. Tweet- cred: Real-time credibility assessment of con- tent on twitter. In SocInfo. pages 228–243. https://doi.org/10.1007/978-3-319-13734-6 16. Alfred Hermida. 2012. Tweets and truth: Journalism as a discipline of collaborative verification. Journalism Practice 6(5-6):659–668. Elena Kochkina, Maria Liakata, and Isabelle Augen- stein. 2017. Turing at SemEval-2017 Task 8: Se- quential Approach to Rumour Stance Classification with Branch-LSTM. In Proceedings of SemEval. ACL. KP Krishna Kumar and G Geethakumari. 2014. De- tecting misinformation in online social networks us- ing cognitive psychology. Human-centric Comput- ing and Information Sciences 4(1):1–22. Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui Fang, and Sameena Shah. 2015. Real-time rumor debunking on twitter. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, pages 1867– 1870. Michal Lukasik, Trevor Cohn, and Kalina Bontcheva. 2015. Classifying tweet level judgements of ru- mours in social media. In Proceedings of the Con- ference on Empirical Methods in Natural Language Processing. volume 2, pages 2590–2595. Saif M Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. SemEval-2016 Task 6: Detecting Stance in Tweets. In Proceedings of the Workshop on Semantic Evalu- ation. Alessandro Moschitti, Preslav Nakov, Lluıs Marquez, Walid Magdy, James Glass, and Bilal Randeree. 2015. Semeval-2015 task 3: Answer selection in community question answering. SemEval-2015 page 269.
  • 8. Rob Procter, Jeremy Crump, Susanne Karstedt, Alex Voss, and Marta Cantijoch. 2013a. Reading the ri- ots: What were the Police doing on Twitter? Polic- ing and Society 23(4):413–436. Rob Procter, Farida Vis, and Alex Voss. 2013b. Read- ing the riots on twitter: methodological innovation for the analysis of big data. International journal of social research methodology 16(3):197–214. Vahed Qazvinian, Emily Rosengren, Dragomir R Radev, and Qiaozhu Mei. 2011. Rumor has it: Iden- tifying misinformation in microblogs. In Proceed- ings of the Conference on Empirical Methods in Nat- ural Language Processing. Association for Compu- tational Linguistics, pages 1589–1599. Chengcheng Shao, Giovanni Luca Ciampaglia, Alessandro Flammini, and Filippo Menczer. 2016. Hoaxy: A platform for tracking online misinformation. arXiv preprint arXiv:1603.01511 . Vikram Singh, Sunny Narayan, Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharya. 2017. IITP at SemEval-2017 Task 8: A Supervised Approach for Rumour Evaluation. In Proceedings of SemEval. ACL. Ankit Srivastava, Rehm Rehm, and Julian Moreno Schneider. 2017. DFKI-DKT at SemEval- 2017 Task 8: Rumour Detection and Classification using Cascading Heuristics. In Proceedings of SemEval. ACL. Zeynep Tufekci. 2014. Big questions for social me- dia big data: Representativeness, validity and other methodological pitfalls. In Proceedings of the AAAI International Conference on Weblogs and Social Media. Shari R Veil, Tara Buehner, and Michael J Palenchar. 2011. A work-in-process literature review: Incor- porating social media in risk and crisis communica- tion. Journal of contingencies and crisis manage- ment 19(2):110–122. Feixiang Wang, Man Lan, and Yuanbin Wu. 2017. ECNU at SemEval-2017 Task 8: Rumour Evalua- tion Using Effective Features and Supervised En- semble Models. In Proceedings of SemEval. ACL. Qiao Zhang, Shuiyuan Zhang, Jian Dong, Jinhua Xiong, and Xueqi Cheng. 2015. Automatic de- tection of rumor on social network. In Natu- ral Language Processing and Chinese Computing, Springer, pages 113–122. Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva, Maria Liakata, and Rob Procter. 2017. Detection and resolution of rumours in social media: A survey. arXiv preprint arXiv:1704.00656 . Arkaitz Zubiaga, Maria Liakata, Rob Procter, Kalina Bontcheva, and Peter Tolmie. 2015a. Crowdsourc- ing the annotation of rumourous conversations in social media. In Proceedings of the 24th Interna- tional Conference on World Wide Web: Companion volume. International World Wide Web Conferences Steering Committee, pages 347–353. Arkaitz Zubiaga, Maria Liakata, Rob Procter, Kalina Bontcheva, and Peter Tolmie. 2015b. Towards de- tecting rumours in social media. In Proceedings of the AAAI Workshop on AI for Cities. Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral- dine Wong Sak Hoi, and Peter Tolmie. 2016a. PHEME rumour scheme dataset: Journalism use case. doi:10.6084/m9.figshare.2068650.v1. Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral- dine Wong Sak Hoi, and Peter Tolmie. 2016b. Analysing how people orient to and spread ru- mours in social media by looking at con- versational threads. PLoS ONE 11(3):1–29. https://doi.org/10.1371/journal.pone.0150989.